Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Jobs work directory and temportary directory

From EGIWiki
Jump to navigation Jump to search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Problem description

Workdir

The workdir is the directory associated to a user’s job. It is the directory where all the files are created, when in the user's application code a path is not specified, or specified using a relative path. It is basically the current directory (the path specified in the $PWD environment variable) from where a process, a job, is run. In many configurations a batch job is run from a workdir contained in the unix user’s home directory. Often the home directories are imported by worker nodes from a shared file system. This could raise serious performance issues for the file servers, having many jobs accessing to a distributed file system, not really needing distributed data.

Temp dir

The temporary directory is used by jobs to create large, temporary files. The unix convention is to use the /tmp directory. However in some configurations it would be better to point the jobs to a better file system for large temporary files, for performance issues or to avoid clashes between jobs running in the same worker node.

Proposed solutions

Wordir

Batch system configuration

A) The clean and right way to address the problem is probably properly configuring the batch system, in order to set the desired directory under which job working directories should be created. This solution is up to site managers that install and configure the batch system, as long as the LRMS allows to tune this configuration. It may be worthy also to specify different workdir for parallel jobs: the parallel jobs may need to be run from a distributed file system.

JobWrapper customization points

B) If for some reasons the previous solution is not possible and if the directory assigned by the batch system (which by default is usually the home directory of the local account) is not ok, a possible solution for this issue is referring to the customization points in the JobWrapper.

A user job (the one specified as 'executable' in the job JDL) is "included" in a JobWrapper which, besides running the user payload, is also responsible for other operations (input/output sandbox management, LB logging, etc.). This JobWrapper can be created by the WMS, for jobs submitted to LCG-CEs through the WMS, or by CREAM, for all job submission paths (direct submissions, submissions through the WMS, submissions through Condor).

In the current implementations of the JobWrapper (for both LCG-CE and CREAM-CE) there is not an explicit "cd" operation before the creation the job working directory. This means that the job working directory is created under the directory "assigned" by the batch system (which is, as said above, the home directory of the local account mapped to that Grid user).

As referred in [R2] and [R3] customization points are scripts (to be provided by the local administrator) which are run by the JobWrapper (both the WMS JobWrapper and the CREAM JobWrapper). There are customization points script run before and after the job execution. In particular the first customization point:

 ${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_1.sh , is executed in the beginning of the JobWrapper execution, 
 before the creation of the job working directory.

Therefore such customization point that must be created in each WN, could be used to "cd" to the desired site specific directory.

In case of MPI jobs: the customization can be implemented by figuring out the type of job, and by using a different policy (if needed) for MPI and sequential jobs. For example, by checking variables such as PBS_NODEFILE in Torque, or PE_HOSTFILE in SGE, for sure there are plenty of variables here that can help. For MPI jobs the LRMS administrator can control where to run, ideally in a shared area, if not, then have a shared area for MPI that is accessible for MPI-Start [E. Fernandex de Castillo]

Tempdir

Standard environment variable

Even if for most of the applications a temporary directory contained in the same file system of the working directory is a good option, it would be better to keep their definitions separated. If for parallel jobs a home directory in a shared file system might be a good option, there is not reason to use this file system for the large temporary files, which belong to a single job and are used only by that job.

A variable specifying the proper path for (largish) temporary files might be set in the job environment, before the execution, and used in the job code to create those files. The name of the variable should be a standard feature for all the batch systems, it would be part of the standard environment that every job can expect on the worker node.

Currently the definition of a temporary directory different from the workdir, is not a problem extensively addressed. Currently the name of this variable and how to set it -from batch system, in the JobWrapper…- are still open issues.

To be approved

Here follow a summary of the proposals to be discussed and approved by operations managers.

Workdir

The solutions A) and B) are equally good solutions depending on the choice of the system administrators.

Tmp dir

A variable called TMPDIR will be created within every job environment. The user's application code should open large temporary files into $TPMDIR/ instead of /tmp. Jobs can create many big temporary files, system administrators can configure the LRMS to put the TMPDIRs in a file system configured with the needed properties, a unique directory can be created in that space for every job, the jobs env will get a TMPDIR pointing to that space. The LRMS could be set up to clean up the whole temporary data when the job exits.


If operations managers agree with these proposals, they will be brought to technology providers as a requirement.

References

Comments from NGIs

Please provide here your feedback.

NGI_Ibergrid (G. Borges, take 1)

Almost in all the surveys, application people always request: please put MPI properly working on the grid! While for sequential jobs it is ok to change the working dir to a scratch area the situation is more unclear in MPI scenarios since the MPI setup is, 90% of the cases, based on shared homes. So, the middleware has to deal with two different use cases:

  • Sequential jobs that do not want to use shared homes due to performance issues;
  • Parallel jobs that 90% of the times do want to use shared homes;

AFAIK, customization scripts inside the jobwrappers are completly insensitive to the user jobs (i.e. they act in the same way for all users, VOs and applications). The hypothesis of implementing a "change dir" (to a scratch space) in the customization scripts executed by the JobWrapper could break the MPI execution unless the Jobwrapper is capable of recognizing the job type, and acting accordingly. If I'm thinking right, I would not vote for this option as a valid option (as it is now) since it gives the message that EGI infrastructure does not care about parallel processing. The adopted solution should be global, i. e., general enough to any kind of job type application, and without the need to manually adjust the middleware behaviour to enable a specific job type to be executed.

(Tiziana): use we should check if the prologue mechanism can be used for this: "The prologue is an executable run within the WMS job wrapper before the user job is started. It can be used for purposes ranging from application-specific checks that the job environment has been correctlyset on the WN to actions like data transfers, database updates or MPI pre script. If the prologue fails the job wrapper terminates and the job is considered for resubmission."
(Stuart Purdie): Using Prologue would not be a good solution. It would require MPI users to use the prologue explicitly - and therefore the user has to handle the difference between MPI and serial jobs, rather than the software. Secondly, it means that the user has to know _where_ the correct location to put the the files is. (This becomes an issue with MPI jobs that handle large files). Rather than using prologue, and forcing all these issues on the end user; the middleware should have a neat way for system managers to specify this - this could be as simple as passing the number of cores used into the customisation points scripts for gLite. Site managers can then ignore it, or use that data to route job directories appropriately Spurdie 12:00, 15 March 2011 (UTC)

NGI_Ibergrid (G. Borges, take 2)

I think that we all agree that option A) is the best one.

Option B) is not so good and general because it is to gLite specific. Also, if implemented without the proper care, the change of the working dir to a scratch area could break the MPI execution. Workaround for that precise problem would be to provide a proper template to the customization scripts in order to recognize the different kinds of jobs/applications which are being executed, and act accordingly. In alternative, site admins enabling MPI must be warned/advised to configure MPI-START in such a way that parallel executions are copied and executed in a shared filesystem available somewhere (either the homes some other place) although started on a scratch area.

Finally, the prologue/epilog WMS scripts, besides being also gLite specific, leaves to the user the responsibility of setting probes to check the environment for his parallel job execution. This is no general solution since it depends if the user has the ability/know-how to implement such checks which, in the first place, should be done by the infrastructure providers. I would tend to not suggest this option.