Condor Setup at Michigan
Overview
Condor is University of Wisconsin system to run batch jobs on CPU farms and/or random groups of desktop machines.
Condor jobs are controlled by submitting a "job" file to Condor which indicates which program/script to run, where to write output, the job CPU requirements, and other configuration details. Condor runs your job in a queue based on your requirements.
Michigan has 2 condor batch queues at present:
umrocks and
umopt1. The umrocks CPU farm has about 100 dual-core AMD Athelon CPUs and umopt1 has 8 dual-core AMD opterons (and will be expanding in fall 2006). To use these queues log onto the head node of the given system (umrocks or umopt1) and submit your condor jobs.
Some useful condor commands.
Here are the most common Condor commands. For a full list see
Condor Manuals.
$
condor_submit : submit job file
to queue
$ condor_q: get list of jobs on the queue. Gives output:
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
82333.0 zarzhit 4/2 19:32 0+03:44:50 I 20 512.0 runAOD4421aJob 000
ST = flag indicating if job is idle (I), running (R), or held (H)
$ condor_q -analyze : report on availability of queue to run a job. If your job never runs, this will tell you why.
$ condor_rm : Remove a job from condor queue
$ condor_status: Gives list/status of machines in condor queue
$ condor_status -submitters: Gives list of condor users
$ condor_hold : put a job in held (suspended) state.
$ condor_release : release a held job
Condor job file examples (see also :
#
# Sample condor job file
#
Universe = vanilla
# These don't work on umrocks/umopt1
# Notification = Complete
# Notify_user = diehl@umich.edu
GetEnv = True
Executable = /bin/echo
Arguments = Hello World
Initialdir = /net/data08/diehl/condor
Output = condor$(Process).out
Log = condor$(Process).log
Error = condor$(Process).err
Queue
Here is the meaning of the flags:
- Universe
- vanilla is for any executable program; standard means program compiled with condor_compile (and which can be moved from one CPU to another while executing -- a feature not needed in our clusters).
- Notification
- Complete means notify the user when the job is complete. Does not work at UM.
- Notify_user
- Email address of user to notify.
- GetEnv
- Pass the users environmental variables to the Condor process.
- Executable
- Command or script for Condor to run.
- Arguments
- Parameters to be passed to the script
- Initialdir
- Directory where Condor run scripts and where the output files go.
- Output/Log/Error
- Names of the output, log, and errors files, respectively. $(Process) gives the Condor sub-job number to the file name, to allow for unique names in the case of multiple job submissions.
- Queue
- This tells Condor to submit the job to the queue. If number is supplied, then jobs will be submitted with different sub-numbers.
Stupid condor problems
- Condor cannot write to AFS. Hence, do not send your Output, Log, or Error files on AFS, otherwise your job will become Held, and just sit there indefinitely. The problem is that condor does not get AFS tokens due some bug or other. Write you output to NFS (e.g. /data08).
- At Michigan condor email notification does not work. Hence, you can skip putting in the "Notification" and "Notify_user" condor job options. The problem is that the condor head nodes are not authorized to send email at Michigan.
- Presently (Sept-20-2006), the condor queue on umopt1 does not actually run jobs due to a configuration error.
-- EdwardDiehl - 19 Sep 2006