Setting up the Bypass Queue
The users requested a queue that would bypass the timed queues, i.e., a queue with no limits on it. The agreed upon way to denote such a job was to add "IsBypassJob = True" to the condor submit script.
The logic for this queue follows very closely to that of the
timed queues, so please read up on them first.
Plan for implementation
We will create a new macro/variable for the bypass jobs and modify the code of the timed queues so bypass jobs don't have any limitations.
Like before, if a queue is not explicitly stated then a job is put in the short queue. In addition, if a job has two labels (bypass and long for example) then it will be a bypass job only.
Implementation
Checkout the cfengine SVN repository
Start by checking out the cfengine SVN repository as described here:
https://www.aglt2.org/wiki/bin/view/AGLT2/CfenginePolicyWorkflow
Implementing the bypass queue on the submit/login nodes
When we added the timed queues, we modified cfengine/masterfiles/stash/condor_msut3/58-host-submit.conf to look like this:
# Create some place holder variables to shorten up later code
IsUserMediumJob = (IsMediumJob =?= true) # Checks a job's class ads to see if IsMediumJob is set to true
IsUserLongJob = (IsLongJob =?= true) # Checks a job's class ads to see if IsLongJob is set to true
IsUserShortJob = ( !$(IsUserMediumJob) && !$(IsUserLongJob) ) # If a job is not a medium or long job then it is a short job (i.e. short job is the default)
# Hold a job if...
SYSTEM_PERIODIC_HOLD = ( JobStatus == 2 ) && (\ # the job is in the running state and...
( $(IsUserShortJob) && RemoteUserCpu > 3*60*60 ) || \ # the job is a short job that has been running for more than 3 hours, or...
( $(IsUserMediumJob) && RemoteUserCpu > 2*24*60*60 ) || \ # the job is a medium job that has been running for more than 2 days, or...
( $(IsUserLongJob) && RemoteUserCpu > 7*24*60*60 ) ) # the job is a long job that has been running for more than 7 days
# Set the hold reason (so the user knows what went wrong)
SYSTEM_PERIODIC_HOLD_REASON = ifThenElse($(IsUserShortJob) && RemoteUserCpu > 3*60*60, "Job exceeded short queue time of 3 cpu hours.", \
ifThenElse($(IsUserMerdiumJob) && RemoteUserCpu > 2*24*60*60, "Job exceeded medium queue time of 48 cpu hours.", \
ifThenElse($(IsUserLongJob) && RemoteUserCpu > 7*24*60*60, "Job exceeded long queue time of 7 cpu days.", "Unknown periodic hold reason") ) )
# Remove a held job if it's been running for more than 24 hours
SYSTEM_PERIODIC_REMOVE = ( JobStatus == 5) && (CurrentTime - EnteredCurrentStatus > 24*60*60)
To create the bypass queue we will simply modify the first block of code to the following:
IsUserBypassJob = (IsBypassJob =?= true)
IsUserMediumJob = (IsMediumJob =?= true && !$(IsUserBypassJob) )
IsUserLongJob = (IsLongJob =?= true && !$(IsUserBypassJob) )
IsUserShortJob = ( !$(IsUserMediumJob) && !$(IsUserLongJob) && !$(IsUserBypassJob) )
Implementing the limits on the worker nodes
When we added the timed queues, we modified cfengine/masterfiles/stash/condor_msut3/55-host-pe1950.conf to look like this:
# Check what type of job is asking to start, need to add TARGET. to the front because this job is coming from the submit node
IsUserMediumJob = (TARGET.IsMediumJob =?= True)
IsUserLongJob = (TARGET.IsLongJob =?= True)
IsUserShortJob = ( !$(IsUserMediumJob) && !$(IsUserLongJob) )
# start the job if...
START = $(START) && ( \ # the start condition from higher level config files is met and...
$(IsUserShortJob) || \ # the job is a short job or...
( (SlotID != 8) && $(IsUserMediumJob) ) || \ # the job is a medium job and this is not job slot number 8 or...
( (SlotID == 8) && $(IsUserLongJob) ) ) # the job is a long job and this is job slot number 8.
In addition, we made similar changes to cfengine/masterfiles/stash/condor_msut3/55-host-r610.conf
To create the bypass queue we will modify the first block of code in the PE 1950 config, as well as modify a small portion of the START macro.
IsUserBypassJob = (TARGET.IsBypassJob =?= True)
IsUserMediumJob = (TARGET.IsMediumJob =?= True && !$(IsUserBypassJob) )
IsUserLongJob = (TARGET.IsLongJob =?= True && !$(IsUserBypassJob) )
IsUserShortJob = ( !$(IsUserMediumJob) && !$(IsUserLongJob) && !$(IsUserBypassJob) )
# short jobs should run on all slots, medium jobs all but one, and long jobs only one.
START = $(START) && ( \
$(IsUserShortJob) || $(IsUserBypassJob) || \
( (SlotID != 8) && $(IsUserMediumJob) ) || \
( (SlotID == 8) && $(IsUserLongJob) ) )
For the R610 config, the changes are:
IsUserBypassJob = (TARGET.IsBypassJob =?= True)
IsUserMediumJob = (TARGET.IsMediumJob =?= True && !$(IsUserBypassJob) )
IsUserLongJob = (TARGET.IsLongJob =?= True && !$(IsUserBypassJob) )
IsUserShortJob = ( !$(IsUserMediumJob) && !$(IsUserLongJob) && !$(IsUserBypassJob) )
# short and medium jobs should run on all slots, while long jobs should run on only one.
START = $(START) && $(START_14_SLOT_MSU_RESERVE_SHORT) && ( \
$(IsUserShortJob) || $(IsUserMediumJob) || $(IsUserBypassJob) || \
( (SlotID == 12) && $(IsUserLongJob) ) )
Create a test policy and test the queues
Create a test policy as described here:
https://www.aglt2.org/wiki/bin/view/AGLT2/CfenginePolicyWorkflow
Pick one login node, one pe1950 worker node, and one r610 worker node. Switch their policy to the test policy as described here:
https://www.aglt2.org/wiki/bin/view/AGLT2/CfenginePolicyWorkflow
Run a bunch of test jobs to see if each part of these new queues is working. Check that the time limits are working, that the number of jobs that can run at once match what you would expect for each queue, and that the held jobs are removed after 24 hours.
Update the T2 policy
Once you are done testing, switch the policy on the nodes you tested back to the T2 policy, then update the T2 policy as described here:
https://www.aglt2.org/wiki/bin/view/AGLT2/CfenginePolicyWorkflow
--
ForrestPhillips - 10 Sep 2019