Job Queing at Michigan State
NOTE: THIS PAGE IS PRETTY MUCH OUT OF DATE
The queing system at Michigan State has not yet been established.
Job Queing at the University of Michigan
The University of Michigan hardware uses Condor (current version 7.0.5) as its batch queing system. As the Tier2 and Trash/Tier3 hardware share the same head node (
umopt1) we implement group quotas to maintain the correct allocation and use of the resource. Roughly speaking:
- 240 job slots have ONLY Tier2 access (including T2 Analysis jobs)
- 56 job slots have ONLY Trash/Tier3 access
- An additional 16 slots have 4 hr time limits (Medium queue)
- An additional 6 slots have 1 hr time limits (Short queue)
- An additional 3 slots have 30 minute time limits (Test queue)
- 1072 job slots have shared Tier2/Trash/Tier3 access
- 400 job slots allow access to any submitted job (including T2 Analysis jobs and other VOs)
- 32 jobs slots are dedicated T2 Analysis-job slots
- The pool as a whole is managed using the Condor "Group Quota" mechanism.
Condor jobs may only be submitted to our queues from the 3 interactive machines, umt3int01/02/03.
The usual "condor_submit" command for starting a job accesses the pool of slots with 3-day time limits. Upon special request only,
longer jobs than this can be run.
To access the medium, short and test queues, special commands have been created so that the needed Condor job parameters
will not have to reside in the memory of the job owner. These commands are:
- condor_submit_medium
- condor_submit_short
- condor_submit_test
Condor jobs cannot be run that try to write to a user's afs space. Instead, the Condor logs, at least, must be placed in an
NFS directory. Send an Email to
aglt2-help@umich.edu if you require such a directory and do not already have one.
In general, when running a Condor job with input files, it is best to copy the files to a /tmp directory on the compute node where the job runs. At job completion, any created directory should then be cleaned up/deleted. Automatic cleanup will be performed on any such directory that is at least 5 days old. For an example of how this can be done, look at
- Examples directory /afs/atlas.umich.edu/opt/localTools/condor
- Condor submit job file condorJob2
- Executable script examples athenaCondor.csh and athenaCondor.sh
Group Quota Mechanism
If all the regular users are queing more jobs than their quota, then they
will be guaranteed to get at least as many processors, after equilibrium
is reached, as the number in their quota. Processors available after
all quotas are reached are split among the active users in the usual, obscure
Condor fashion. This tends to favor small-quota groups.
Condor requires that the sum of all quotas is less than
or equal to the total number of available processors, so the totals
below sum to fewer, guaranteeing all quotas can be met.
Note that the actual split is based upon Accounting Groups, which may
include multiple users as part of a single group. No sub-group quotas
are possible.
In general, the policy we have adopted gives equal access to users who
regularly access the system. We have specified "regularly access" here
so that so that we can maximize the per-group quota by minimizing the
divisor in "available cpus / count of regular users". If you are not a
"regular user", but are about to begin such use, you can notify me of
the expectation and the quotas can quickly be modified.
Any user that is not a "regular user" falls into a "generic" category
where a small quota is maintained that ensures access.
This "default" policy will be modified for short durations so that
results to be presented at conferences, or other special needs, can be
accommodated. Such needs should be brought up as much in advance as
possible.
The following tables are out of date. New groups have been added, old groups deleted, and users have been
moved around as needed.
Accounting Group |
Default Quota |
Current Quota |
Tier2 |
172 |
152 |
WW Group |
20 |
20 |
zhengguo/zhangpei |
0 |
20 |
Higgs Group |
20 |
20 |
Muon Group |
20 |
20 |
Generic Group |
4 |
4 |
Group Membership:
- WW Group -- aww, xuefeili, daits, [zhengguo, zhangpei]
- Temporary zh Group -- zhengguo, zhangpei
- Higgs Group -- qianj, armbrusa, dharp, jpurdham, liuhao, strandbe, rthun
- Muon Group -- dslevin, diehl, desalvo
- Generic Group -- anyone not listed above
--
BobBall - 26 Jun 2007