Setup and Configuration of USATLAS Tier-3 Queue
Background
To assist Fred Luehring in testing remote, Tier-3 job queues for USATLAS, we have set up a test
PanDA queue at AGLT2. This queue is backed by 40 job slots (5 PE1950 nodes of 8 cores each) dedicated to the queue. No other jobs will run on these slots, and the queue will provide jobs ONLY to those slots.
The setup has three primary pieces.
- Modifications to condor_config.local on the Worker Nodes
- Modifications to the job submission via condor.pm
- SchedConfig additions for the new queue
condor_config.local changes
On the Worker Node, the slot advertisements must state they are Tier-3 Test Queue destinations, and require that the submitting job is a Tier-3 Test Job.
# Special setups for Tier3 Test nodes
#
IS_TIER3_TEST_QUEUE = True
STARTD_ATTRS = $(STARTD_ATTRS), IS_TIER3_TEST_QUEUE
IsT3TJob = ( TARGET.IsTier3TestJob =?= True )
# Basic START rule for the Tier3 Test Queue
#
START = $(IS_T2_USER_GRP)
# All slots will be for the testing
StartSLOTB = ( $(IsT3TJob) && (( RemoteWallClockTime < ( $(BaseTime) - $(OffTime)) ) =!= False) )
START = ((SlotID == 1) && ($(StartSLOTB)) && ($(START))) || \
etc
The five PE1950 backing this queue at AGLT2 are:
- c-2-37.aglt2.org
- c-4-19.aglt2.org
- c-5-39.aglt2.org
- c-7-18.aglt2.org
- c-7-23.aglt2.org
condor.pm file changes
The Condor Job must advertise that it is a Tier-3 Test Job, and require that it match only with a Tier-3 Test Queue destination. All of these changes are within the "submit" module of condor.pm.
my $doQue = $queue[0];
...
} elsif ($doQue eq 'Tier3Test') {
$rc = print SCRIPT_FILE "+IsTier3TestJob = True\n";
if (!$rc)
{
return $self->respond_with_failure_extension(
"print: $script_filename: $!",
Globus::GRAM::Error::TEMP_SCRIPT_FILE_FAILED());
}
push (@requirements, " IS_TIER3_TEST_QUEUE =?= True ");
NOTE, we are modifying the job requirements here. If we do this, in this way using condor.pm, then the structure of condor.pm must be changed to accomodate the additional requirements. Normally, by the time we would do this, the requirements have already been written out.
#
# Changed from original, BB 7/28/2012
#
### if (scalar(@requirements) > 0)
### {
### $rc = print SCRIPT_FILE "Requirements = ", join(" && ", @requirements) ."\n";
### }
### if (!$rc)
### {
### return $self->respond_with_failure_extension(
### "print: $script_filename: $!",
### Globus::GRAM::Error::TEMP_SCRIPT_FILE_FAILED());
### }
Then, afterwards, we can successfully write out the modified requirements.
#
# Write out the deferred requirements statement
#
if (scalar(@requirements) > 0)
{
$rc = print SCRIPT_FILE "Requirements = ", join(" && ", @requirements) ."\n";
}
if (!$rc)
{
return $self->respond_with_failure_extension(
"print: $script_filename: $!",
Globus::GRAM::Error::TEMP_SCRIPT_FILE_FAILED());
}
Setting up the PanDA Resource within SchedConfig/AGIS
A new resource must be defined within AGIS, else mixing of standard production or analysis jobs into the queue will take place. To set this up, navigate to the
AGIS page using your Grid Certificate to log in. In the list of actions, find "Define PANDA Resource" and click that link.
- In the "PANDA Site" box, specify where you are (eg, GreatLakesT2). A popup of possibilities appears once you begin typing.
- Enter the "Name of PANDA resource" in that box, eg, ANALY_AGLT2_TIER3_TEST
- Select GRID as the "Resource type" from the pull-up list.
- Click the "Check input data" button.
- If all is well, a new button "Save PANDA Resource" button will appear. Click it and the resource will be created.
Note that the resource description states "No SW releases on this site". Hopefully this will not be an issue.
Setting up the Queue within SchedConfig/AGIS
The ATLAS PanDA queue set up for this endeavor is called ANALY_AGLT2_TIER3_TEST. It is a clone of the ANALY_AGLT2-condor queue, with some modifications. To set this up, navigate to the
AGIS page using your Grid Certificate to log in. In the list of actions, find "Define PANDA queue" and click that link.
- In the "Specify PANDA Queue" box, enter the name of the queue to clone
- ANALY_AGLT2-condor for us
- Click the "clone" button to the right of this box.
- Change the PANDA Resource Name
- Change the PANDA Queue Name
- ANALY_AGLT2_TIER3_TEST-condor
- Specify the Type of the queue via the pull-down choices
- Change the value of the jdl field
- ANALY_AGLT2_TIER3_TEST-condor
- Pass the desired queue name in the jdl sent to the gatekeeper via globus. This is the name that condor.pm will look for.
- Click the "Save and continue" button at the bottom of the form
Add the queue to the CE
The new queue has to be associated with a gate keeper. Find the Computing Element screen (our is AGLT2-CE-gate04.aglt2.org) in the AGIS pages.
- Click the "Add Queue" link in the lower right of the screen
- Fill in the Queue Name (ANALY_AGLT2_TIER3_TEST-condor)
- Leave the Max cputime and Max wallclocktime as zero
OOPS, THIS WAS A MISTAKE, AND MUST BE UNDONE. WISH I KNEW HOW TONIGHT
The correct steps are:
From the AGIS list of PandaSites, click your own, for example, AGLT2
- Expand the PANDA items, and select your site name (eg, GreatLakesT2)
- Note, if the link is not followed, right click, copy, open new tab, paste
- Click the name of the "Panda queue" in the SECOND COLUMN that you want to associate with a CE
- Click on the "Find and associate another CE/Queue" link
- Type in some of the name of your gatekeeper (eg, gate04.aglt2) and click the "Search" button
- When the list comes up showing your CE, click the "default" entry to add the queue to the CE
There may be more than one way to find your site name in that first step.
Activating the PanDA Queue
Wait 20 minutes or so for all parameters set in SchedConfig/AGIS to have effect, then prepare the queue for SoftWare validation.
- curl -k --cert /tmp/x509up_u`id -u` 'https://panda.cern.ch:25943/server/controller/query?tpmes=setmanual&queue=ANALY_AGLT2_TIER3_TEST-condor'
- curl -k --cert /tmp/x509up_u`id -u` 'https://panda.cern.ch:25943/server/controller/query?tpmes=setbrokeroff&queue=ANALY_AGLT2_TIER3_TEST-condor'
Next, request that John Hover and/or Jose Caballero activate an APF facory for the queue.
Now, you must ask Alessandro
DeSalvo to send the tag synchronization for the Software Distributions. This will take 1-2hrs. The assumption is that this is just a subset of the WN at your site, that are separated out for this purpose. If the case is different, you will need a unique gatekeeper, and a full validation of the queue. Once that process is complete, the queue can be set online.
- curl -k --cert /tmp/x509up_u`id -u` 'https://panda.cern.ch:25943/server/controller/query?tpmes=setonline&queue=ANALY_AGLT2_TIER3_TEST-condor'
What to do if you mess up the AGIS configuration
The agis group have been very helpful. They can do far more than can be done from the SchedConfig/AGIS GUI. They can be contacted at the follow Email addresses.
- AlessandroDOTDiDOTGirolamoATcernDOTch
- atlas-adc-agisATcernDOTch
--
BobBall - 13 May 2013