Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Jobs work directory and temportary directory

From EGIWiki
Jump to navigation Jump to search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security
Middleware menu: Home Software Calendars Technology Providers UMD Release Schedule UMD Products Overview UMD Products ID Cards Release and deployment process Middleware Requirements Next middleware release



2nd proposal 26 May 2011

Document describing a proposal to manage the jobs workdir.

Problem description

Workdir

The workdir is the directory associated to a user’s job. It is the directory where all the files are created, when in the user's application code a path is not specified, or specified using a relative path. It is basically the current directory (the path specified in the $PWD environment variable) from where a process, a job, is run. In many configurations a batch job is run from a workdir contained in the unix user’s home directory. Often the home directories are imported by worker nodes from a shared file system. This could raise serious performance issues for the file servers, having many jobs accessing to a distributed file system, not really needing distributed data.

Temp dir

The temporary directory is used by jobs to create large, temporary files. The unix convention is to use the /tmp directory. However in some configurations it would be better to point the jobs to a better file system for large temporary files, for performance issues or to avoid clashes between jobs running in the same worker node.

Update 07 May 2012: an EGI proposal policy has been defined to address this issue: TMPDIR proposed policy

Proposed solutions

Wordir

Batch system configuration

A) The clean and right way to address the problem is probably properly configuring the batch system, in order to set the desired directory under which job working directories should be created. This solution is up to site managers that install and configure the batch system, as long as the LRMS allows to tune this configuration. It may be worthy also to specify different workdir for parallel jobs: the parallel jobs may need to be run from a distributed file system.

JobWrapper customization points

B) If for some reasons the previous solution is not possible and if the directory assigned by the batch system (which by default is usually the home directory of the local account) is not ok, a possible solution for this issue is referring to the customization points in the JobWrapper.

A user job (the one specified as 'executable' in the job JDL) is "included" in a JobWrapper which, besides running the user payload, is also responsible for other operations (input/output sandbox management, LB logging, etc.). This JobWrapper can be created by the WMS, for jobs submitted to LCG-CEs through the WMS, or by CREAM, for all job submission paths (direct submissions, submissions through the WMS, submissions through Condor).

In the current implementations of the JobWrapper (for both LCG-CE and CREAM-CE) there is not an explicit "cd" operation before the creation the job working directory. This means that the job working directory is created under the directory "assigned" by the batch system (which is, as said above, the home directory of the local account mapped to that Grid user).

As referred in [R2] and [R3] customization points are scripts (to be provided by the local administrator) which are run by the JobWrapper (both the WMS JobWrapper and the CREAM JobWrapper). There are customization points script run before and after the job execution. In particular the first customization point:

 ${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_1.sh , is executed in the beginning of the JobWrapper execution, 
 before the creation of the job working directory.

Therefore such customization point that must be created in each WN, could be used to "cd" to the desired site specific directory.

In case of MPI jobs: the customization can be implemented by figuring out the type of job, and by using a different policy (if needed) for MPI and sequential jobs. For example, by checking variables such as PBS_NODEFILE in Torque, or PE_HOSTFILE in SGE, for sure there are plenty of variables here that can help. For MPI jobs the LRMS administrator can control where to run, ideally in a shared area, if not, then have a shared area for MPI that is accessible for MPI-Start [E. Fernandex de Castillo]

Tempdir

Standard environment variable

Even if for most of the applications a temporary directory contained in the same file system of the working directory is a good option, it would be better to keep their definitions separated. If for parallel jobs a home directory in a shared file system might be a good option, there is not reason to use this file system for the large temporary files, which belong to a single job and are used only by that job.

A variable specifying the proper path for (largish) temporary files might be set in the job environment, before the execution, and used in the job code to create those files. The name of the variable should be a standard feature for all the batch systems, it would be part of the standard environment that every job can expect on the worker node.

Currently the definition of a temporary directory different from the workdir, is not a problem extensively addressed. Currently the name of this variable and how to set it -from batch system, in the JobWrapper…- are still open issues.

To be approved

Here follow a summary of the proposals to be discussed and approved by operations managers.

Workdir

The solutions A) and B) are equally good solutions depending on the choice of the system administrators.

Tmp dir

A variable called TMPDIR will be created within every job environment. The user's application code should open large temporary files into $TPMDIR/ instead of /tmp. Jobs can create many big temporary files, system administrators can configure the LRMS to put the TMPDIRs in a file system configured with the needed properties, a unique directory can be created in that space for every job, the jobs env will get a TMPDIR pointing to that space. The LRMS could be set up to clean up the whole temporary data when the job exits.


If operations managers agree with these proposals, they will be brought to technology providers as a requirement.

References

Comments from NGIs

Please provide here your feedback.

J. Templon

I am not speaking on behalf of my NGI, I was asked by Tiziana to comment.

After reviewing the very useful summary, which provides a fresh look at the entire problem, I conclude that we should turn the solution around. The problem we want to solve is sort of created by the presence of MPI jobs on the grid. This type of job (as well as any other jobs spanning multiple nodes) use a shared file system as a communication platform between the instances running on the independent nodes. If these jobs did not exist, there would be no real use case for shared home directories and the problem would disappear, as the site could, at the batch system layer, always make sure that $HOME was somewhere that a job could write into without causing problems (most likely not on a shared file system), and also make sure that 'cd $HOME' was executed before the job wrapper began.

If we take that as the baseline approach ... which adresses by far the largest fraction of jobs on the grid ... then we do not care about TMPDIR.

The solution then does not involve any TMPDIR, what it needs is some method for communication between programs running on the various nodes of a multi-node job. I will assume for the moment that a shared area will be used. This should NOT be the home directory, it should be some other directory. The associated environment variable should have some good name, standardized in EMI, which I think we are free to choose. Something like GMULTI_SHARED_DIR where G is for Grid, Multi means for jobs spanning multiple nodes, shared means of course shared across all these nodes. People may have even better ideas for the name. I would prefer not to include "EGI" or "EMI" or "UMD" or whatever in the name!!! This gives hope for other middleware projects to be able to adopt this as a standard.

The MPI grid tools would need to be modified to use this new directory (via the environment variable) for the shared communication. Putting the work here is correct since the "multiple node" type of jobs are the only type of jobs that need this directory.

This is my opinion.

Note I agree with Goncalo, MPI should work properly on the grid!


Comment from Goncalo on this (reported by Peter): This is also what many MPI sites do. They do not use shared home directories, but they use another directory in a shared file system configured for parallel jobs purposes. The MPI-start script, run when a mpi job starts, the workdir is changed. Site admins do not use environment variables, they directly customize MPI-start.

NGI_Ibergrid (G. Borges, take 1)

Almost in all the surveys, application people always request: please put MPI properly working on the grid! While for sequential jobs it is ok to change the working dir to a scratch area the situation is more unclear in MPI scenarios since the MPI setup is, 90% of the cases, based on shared homes. So, the middleware has to deal with two different use cases:

  • Sequential jobs that do not want to use shared homes due to performance issues;
  • Parallel jobs that 90% of the times do want to use shared homes;

AFAIK, customization scripts inside the jobwrappers are completly insensitive to the user jobs (i.e. they act in the same way for all users, VOs and applications). The hypothesis of implementing a "change dir" (to a scratch space) in the customization scripts executed by the JobWrapper could break the MPI execution unless the Jobwrapper is capable of recognizing the job type, and acting accordingly. If I'm thinking right, I would not vote for this option as a valid option (as it is now) since it gives the message that EGI infrastructure does not care about parallel processing. The adopted solution should be global, i. e., general enough to any kind of job type application, and without the need to manually adjust the middleware behaviour to enable a specific job type to be executed.

(Tiziana): use we should check if the prologue mechanism can be used for this: "The prologue is an executable run within the WMS job wrapper before the user job is started. It can be used for purposes ranging from application-specific checks that the job environment has been correctlyset on the WN to actions like data transfers, database updates or MPI pre script. If the prologue fails the job wrapper terminates and the job is considered for resubmission."
(Stuart Purdie): Using Prologue would not be a good solution. It would require MPI users to use the prologue explicitly - and therefore the user has to handle the difference between MPI and serial jobs, rather than the software. Secondly, it means that the user has to know _where_ the correct location to put the the files is. (This becomes an issue with MPI jobs that handle large files). Rather than using prologue, and forcing all these issues on the end user; the middleware should have a neat way for system managers to specify this - this could be as simple as passing the number of cores used into the customisation points scripts for gLite. Site managers can then ignore it, or use that data to route job directories appropriately Spurdie 12:00, 15 March 2011 (UTC)

NGI_Ibergrid (G. Borges, take 2)

I think that we all agree that option A) is the best one.

Option B) is not so good and general because it is to gLite specific. Also, if implemented without the proper care, the change of the working dir to a scratch area could break the MPI execution. Workaround for that precise problem would be to provide a proper template to the customization scripts in order to recognize the different kinds of jobs/applications which are being executed, and act accordingly. In alternative, site admins enabling MPI must be warned/advised to configure MPI-START in such a way that parallel executions are copied and executed in a shared filesystem available somewhere (either the homes some other place) although started on a scratch area.

Finally, the prologue/epilog WMS scripts, besides being also gLite specific, leaves to the user the responsibility of setting probes to check the environment for his parallel job execution. This is no general solution since it depends if the user has the ability/know-how to implement such checks which, in the first place, should be done by the infrastructure providers. I would tend to not suggest this option.

NGI Switzerland

Observations:

  • MPI in itself does not make any assumption about disk I/O capabilities.
  • The number of cores per CPU is constantly increasing. A worker node can easily have 32 jobs running, and local disks are not able to provide enough bandwidth and IO ops for all of them to meet IO requirements (5-10 MB/s per job). Then, having something else like a highly performant shared FS for scratch is a must.
  • Criteria for disk space can be: speed, size (size and number of files), safeness (backup vs.volatility of data), availability from every worker node (WN). A typical home directory is often not that fast, not that large, but has a backup, and is usually shared among some head node and the WN. A typical scratch directory (for temporary files of independent jobs) is usually assumed to be fast, large, doesn't need a backup, and doesn't need to be shared. The MPI jobs mentioned by Jeff Templon are needing “a shared file system as a communication platform between the instances running on the independent nodes”.

Opinions:

  • We agree that it is not straightforward to implement a useful scratch directory. Potential solutions include:
  • a local SSD for every WN: these would be very fast, but not that large, and expensive. This is similar to building virtual disks in the WN's memory.
  • A shared parallel file system like lustre or GPFS: This solutions usually assumes the presence of a high performance interconnect between the WN's and a large and powerful disk system.
  • We would like to see reports from sysadmins that have successfully implemented option A for at least a couple different batch systems (say, SGE and TORQUE) before we can consider solution as a viable option.

Open questions:

  • The more general question is: what does the job require in terms of disk resources? How can these requirements be forwarded to the LRMS, like eg. Torque? (it is currently by no means clear that the content of some TMPDIR variable can be taken over by Torque and been taken into account accordingly).
  • On the other hand, what does the site provide? How is the site's capability been advertised in the information system?
  • What should it be called then? $HOME is already reserved. /tmp is already reserved.

Solutions already in place:

  • At CSCS we are using a CREAM JobWrapper that works just fine.
  • With ARC, site submission scripts can be customized by sysadmins.