This document helps the UM Tier3 users to diagnose their condor job problems.
Submitting Machines
Tier 3 users can submit their condor jobs from the following machines
umt3int01.aglt2.org (SL6, jobs will be run on the SL6 queue, which has 80 cores available)
umt3int02.aglt2.org /umt3int03.aglt2.org / umt3int04.aglt2.org /umt3int05.aglt2.org (SL7, jobs will be run on SL7 queue, which has 1000 cores shared with Tier2, and is allowed to overflow-use more cores-if the Tier2 is not busy)
In order to run your jobs on either the SL6 or SL7 nodes, it is safer to compile the code on the corresponding OS first.
Options of Queues
User can submit their jobs to 4 different kind of queues by using different submission commands:
condor_submit
(submit to regular queues, jobs are allowed to run up to 3 days)
condor_submit_short
(submited to the short queue, which has 12 reserved cores, jobs are allowed to run up to 2 hours)
condor_submit_medium (jobs are allowed to run up to 4 hours)
condor_submit_unlimited
(with 16 reserved cores, jobs are allowed to run with unlimited time)
Options of Resources
User can choose different resource requirements by using different submission commands:
condor_submit_mcore
(jobs are allowed to use 8 cpus/job)
condor_submit_lmem
(jobs are allowed to use up to 6GB memory/job)
For more details about the submission options, check the content of the files /usr/local/bin/condor_submit* on any interactive machines.
Things to avoid during submission
Before submitting the job, make sure:
1) User jobs need to be submitted from their NFS directory, such as /atlas/data19/username, but
not either their AFS home (/afs/atlas.umich.edu/home/username) or Lustre directory ( /lustre/umt3/). Otherwise, job submission will fail.
2) The job scripts
should not refer (read or write) to any files stored in the AFS directories. Because when the job is run on the work nodes, it does not have the user AFS token which allows the work node to read/write the AFS directory. Otherwise, jobs will be in holding status.
What If the jobs stay in idle for hours after submission?
When this happens, very likely, your jobs do not have any matching resources due to the resource requirement specified in your script.
In order to debug it,
1) get the job id for your idle job,
replace xuwenhao with your own username
-bash-4.2$ condor_q -constraint ' Owner=="xuwenhao" && JobStatus==1 '
277490.430 xuwenhao 5/1 10:43 0+00:00:00 I 10 97.7 run 430277490.431 xuwenhao 5/1 10:43 0+00:00:00 I 10 97.7 run 431
2) analyze the job
-bash-4.2$ condor_q -analyze 277490.430
WARNING: Be advised:
No resources matched request's constraints
The Requirements expression for your job is:
( ( TARGET.TotalDisk >= 21000000 && TARGET.IsSL7WN is true &&
TARGET.AGLT2_SITE == "UM" && ( ( OpSysAndVer is "CentOS7" ) ) ) ) &&
( TARGET.Arch == "X86_64" ) && ( TARGET.Disk >= RequestDisk ) &&
( TARGET.Memory >= RequestMemory ) && ( ( TARGET.HasFileTransfer ) ||
( TARGET.FileSystemDomain == MY.FileSystemDomain ) )
Suggestions:
Condition Machines Matched Suggestion
--------- ---------------- ----------
1 ( ( OpSysAndVer is "CentOS7" ) ) 0 MODIFY TO "SL7"
2 TARGET.AGLT2_SITE == "UM" 1605
3 ( TARGET.Memory >= 4096 ) 2020
4 TARGET.IsSL7WN is true 2684
5 TARGET.TotalDisk >= 21000000 2706
6 ( TARGET.Arch == "X86_64" ) 2710
7 ( TARGET.Disk >= 3 ) 2710
8 ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "local" ) )
2710
The reason this job does not have matching resource is it requests the target (work node) OS to be "CentOS7", but our cluster work node has either SL6 or SL7. So change this requirement in your job script.
There are some other options with condor_q command to check your jobs matching status:
-bash-4.2$ condor_q -better-analyze:summary
-- Schedd: umt3int02.aglt2.org : <10.10.1.51:9618?...
Analyzing matches for 2659 slots
Autocluster Matches Machine Running Serving
JobId Members/Idle Requirements Rejects Job Users Job Other User Available Owner
------------- ------------ ------------ ----------- ----------- ---------- --------- -----
272670.343 116/0 1153 141 116/116 846 50 aaronsw
277490.0 432/432 0 0 0 0 0 xuwenhao
-bash-4.2$ condor_q -better-analyze:summary -constraint 'Owner=="xuwenhao"'
-- Schedd: umt3int02.aglt2.org : <10.10.1.51:9618?...
Analyzing matches for 2656 slots
Autocluster Matches Machine Running Serving
JobId Members/Idle Requirements Rejects Job Users Job Other User Available Owner
------------- ------------ ------------ ----------- ----------- ---------- --------- -----
277490.0 432/432 0 0 0 0 0 xuwenhao
-bash-4.2$
More details can be viewed with the command "condor_q -help"
--
WenjingWu - 01 May 2019