This document helps the UM Tier3 users to diagnose their condor job problems.

Submitting Machines

UMATLAS Tier3 users can submit their condor jobs from the following login machines

umt3int01.aglt2.org

umt3int02.aglt2.org

umt3int03.aglt2.org

umt3int04.aglt2.org

umt3int05.aglt2.org

gc-9-36.aglt2.org

gc-7-31.aglt2.org

Before submitting the jobs, please make sure,

1) none of input files or output files are stored from your AFS home directory. Jobs do not have permission to access your AFS home direcotry from the execution work nodes.

2) Output files should be writen to your NFS directory, such as /atlas/data19/username, instead of Lustre, as small IOs from the job updates will impose performance penalty to the Lustre file system.

Options of Queues

User can submit their jobs to 4 different kind of queues by using different submission commands:

condor_submit (submit to regular queues, jobs are allowed to run up to 3 days)

condor_submit_short (submited to the short queue, which has 12 reserved cores, jobs are allowed to run up to 2 hours)

condor_submit_medium (jobs are allowed to run up to 4 hours)

condor_submit_unlimited (with 16 reserved cores, jobs are allowed to run with unlimited time)

Options of Resources

User can choose different resource requirements by using different submission commands:

condor_submit_mcore (jobs are allowed to use 8 cpus/job)

condor_submit_lmem (jobs are allowed to use up to 6GB memory/job)

For more details about the submission options, check the content of the files /usr/local/bin/condor_submit* on any interactive machines.

Things to avoid during submission

Before submitting the job, make sure:

1) User jobs need to be submitted from their NFS directory, such as /atlas/data19/username, but not either their AFS home (/afs/atlas.umich.edu/home/username) or Lustre directory ( /lustre/umt3/). Otherwise, job submission will fail.

2) The job scripts should not refer (read or write) to any files stored in the AFS directories. Because when the job is run on the work nodes, it does not have the user AFS token which allows the work node to read/write the AFS directory. Otherwise, jobs will be in holding status.

What If the jobs stay in idle for hours after submission?

When this happens, very likely, your jobs do not have any matching resources due to the resource requirement specified in your script.

In order to debug it,

1) get the job id for your idle job, replace xuwenhao with your own username
-bash-4.2$ condor_q -constraint ' Owner=="xuwenhao" && JobStatus==1 '
277490.430 xuwenhao        5/1  10:43   0+00:00:00 I  10  97.7 run 430277490.431 xuwenhao        5/1  10:43   0+00:00:00 I  10  97.7 run 431

2) analyze the job
-bash-4.2$ condor_q -analyze 277490.430

WARNING:  Be advised:
   No resources matched request's constraints

The Requirements expression for your job is:

    ( ( TARGET.TotalDisk >= 21000000 && TARGET.IsSL7WN is true &&
        TARGET.AGLT2_SITE == "UM" && ( ( OpSysAndVer is "CentOS7" ) ) ) ) &&
    ( TARGET.Arch == "X86_64" ) && ( TARGET.Disk >= RequestDisk ) &&
    ( TARGET.Memory >= RequestMemory ) && ( ( TARGET.HasFileTransfer ) ||
      ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )


Suggestions:

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( ( OpSysAndVer is "CentOS7" ) )  0                   MODIFY TO "SL7"
2   TARGET.AGLT2_SITE == "UM"         1605                 
3   ( TARGET.Memory >= 4096 )         2020                 
4   TARGET.IsSL7WN is true            2684                 
5   TARGET.TotalDisk >= 21000000      2706                 
6   ( TARGET.Arch == "X86_64" )       2710                 
7   ( TARGET.Disk >= 3 )              2710                 
8   ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "local" ) )
                                      2710                 

The reason this job does not have matching resource is it requests the target (work node) OS to be "CentOS7", but our cluster work node has either SL6 or SL7. So change this requirement in your job script.

There are some other options with condor_q command to check your jobs matching status:
-bash-4.2$ condor_q -better-analyze:summary 

-- Schedd: umt3int02.aglt2.org : <10.10.1.51:9618?...
Analyzing matches for 2659 slots
               Autocluster    Matches     Machine     Running     Serving            
 JobId        Members/Idle Requirements Rejects Job  Users Job  Other User Available Owner
------------- ------------ ------------ ----------- ----------- ---------- --------- -----
272670.343    116/0                1153         141     116/116        846        50 aaronsw
277490.0      432/432                 0           0           0          0         0 xuwenhao


-bash-4.2$ condor_q -better-analyze:summary -constraint 'Owner=="xuwenhao"'

-- Schedd: umt3int02.aglt2.org : <10.10.1.51:9618?...
Analyzing matches for 2656 slots
               Autocluster    Matches     Machine     Running     Serving            
 JobId        Members/Idle Requirements Rejects Job  Users Job  Other User Available Owner
------------- ------------ ------------ ----------- ----------- ---------- --------- -----
277490.0      432/432                 0           0           0          0         0 xuwenhao
-bash-4.2$ 

More details can be viewed with the command "condor_q -help"

-- WenjingWu - 01 May 2019
Topic revision: r2 - 16 Jun 2025, WenjingWu
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback