This document helps the UM Tier3 users to diagnose their condor job problems.
Submitting Machines
UMATLAS Tier3 users can submit their condor jobs from the following login machines
umt3int01.aglt2.org
umt3int02.aglt2.org
umt3int03.aglt2.org
umt3int04.aglt2.org
umt3int05.aglt2.org
gc-9-36.aglt2.org
gc-7-31.aglt2.org
Before submitting the jobs, please make sure,
1) none of input files or output files are stored from your AFS home directory. Jobs do not have permission to access your AFS home direcotry from the execution work nodes.
2) Output files should be writen to your NFS directory, such as /atlas/data19/username, instead of Lustre, as small IOs from the job updates will impose performance penalty to the Lustre file system.
Options of Queues
User can submit their jobs to 4 different kind of queues by using different submission commands:
condor_submit
(submit to regular queues, jobs are allowed to run up to 3 days)
condor_submit_short
(submited to the short queue, which has 12 reserved cores, jobs are allowed to run up to 2 hours)
condor_submit_medium (jobs are allowed to run up to 4 hours)
condor_submit_unlimited
(with 16 reserved cores, jobs are allowed to run with unlimited time)
Options of Resources
User can choose different resource requirements by using different submission commands:
condor_submit_mcore
(jobs are allowed to use 8 cpus/job)
condor_submit_lmem
(jobs are allowed to use up to 6GB memory/job)
For more details about the submission options, check the content of the files /usr/local/bin/condor_submit* on any interactive machines.
Things to avoid during submission
Before submitting the job, make sure:
1) User jobs need to be submitted from their NFS directory, such as /atlas/data19/username, but
not either their AFS home (/afs/atlas.umich.edu/home/username) or Lustre directory ( /lustre/umt3/). Otherwise, job submission will fail.
2) The job scripts
should not refer (read or write) to any files stored in the AFS directories. Because when the job is run on the work nodes, it does not have the user AFS token which allows the work node to read/write the AFS directory. Otherwise, jobs will be in holding status.
What If the jobs stay in idle for hours after submission?
When this happens, very likely, your jobs do not have any matching resources due to the resource requirement specified in your script.
In order to debug it,
1) get the job id for your idle job,
replace xuwenhao with your own username
-bash-4.2$ condor_q -constraint ' Owner=="xuwenhao" && JobStatus==1 '
277490.430 xuwenhao 5/1 10:43 0+00:00:00 I 10 97.7 run 430277490.431 xuwenhao 5/1 10:43 0+00:00:00 I 10 97.7 run 431
2) analyze the job
-bash-4.2$ condor_q -analyze 277490.430
WARNING: Be advised:
No resources matched request's constraints
The Requirements expression for your job is:
( ( TARGET.TotalDisk >= 21000000 && TARGET.IsSL7WN is true &&
TARGET.AGLT2_SITE == "UM" && ( ( OpSysAndVer is "CentOS7" ) ) ) ) &&
( TARGET.Arch == "X86_64" ) && ( TARGET.Disk >= RequestDisk ) &&
( TARGET.Memory >= RequestMemory ) && ( ( TARGET.HasFileTransfer ) ||
( TARGET.FileSystemDomain == MY.FileSystemDomain ) )
Suggestions:
Condition Machines Matched Suggestion
--------- ---------------- ----------
1 ( ( OpSysAndVer is "CentOS7" ) ) 0 MODIFY TO "SL7"
2 TARGET.AGLT2_SITE == "UM" 1605
3 ( TARGET.Memory >= 4096 ) 2020
4 TARGET.IsSL7WN is true 2684
5 TARGET.TotalDisk >= 21000000 2706
6 ( TARGET.Arch == "X86_64" ) 2710
7 ( TARGET.Disk >= 3 ) 2710
8 ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "local" ) )
2710
The reason this job does not have matching resource is it requests the target (work node) OS to be "CentOS7", but our cluster work node has either SL6 or SL7. So change this requirement in your job script.
There are some other options with condor_q command to check your jobs matching status:
-bash-4.2$ condor_q -better-analyze:summary
-- Schedd: umt3int02.aglt2.org : <10.10.1.51:9618?...
Analyzing matches for 2659 slots
Autocluster Matches Machine Running Serving
JobId Members/Idle Requirements Rejects Job Users Job Other User Available Owner
------------- ------------ ------------ ----------- ----------- ---------- --------- -----
272670.343 116/0 1153 141 116/116 846 50 aaronsw
277490.0 432/432 0 0 0 0 0 xuwenhao
-bash-4.2$ condor_q -better-analyze:summary -constraint 'Owner=="xuwenhao"'
-- Schedd: umt3int02.aglt2.org : <10.10.1.51:9618?...
Analyzing matches for 2656 slots
Autocluster Matches Machine Running Serving
JobId Members/Idle Requirements Rejects Job Users Job Other User Available Owner
------------- ------------ ------------ ----------- ----------- ---------- --------- -----
277490.0 432/432 0 0 0 0 0 xuwenhao
-bash-4.2$
More details can be viewed with the command "condor_q -help"
--
WenjingWu - 01 May 2019