Tier3ForBeginners < AGLT2

You are here: Foswiki>AGLT2 Web>MSUTier3>Tier3ForBeginners (22 Oct 2009, JamesKoll)Edit Attach

0. Useful webpages
1. Very easy job
2. Easy job with a loop, running parallely on more than one machine
3. Longer jobs
4. Example of longer job
Scheduling downtime

0. Useful webpages

This webpage is a mix between a tutorial and reference. If you are just interested in a quick overview of useful condor commands, just google "condor useful commands" and you will find tons of pages. If I stumble across a particularly useful one, I will list it here.

The original condor manual: http://www.cs.wisc.edu/condor/manual/

1. Very easy job

Let's start: Log onto the tier by typing

ssh msu3.aglt2.org

It will then ask you for your password, type it in.
We are on the tier now, in our home directory.
I suggest creating a folder with

mkdir foldername

and changing into that folder with

cd foldername

Now open vi by typing

vi testScript.sh

(You can use emacs, too, but it will be a bit slower).
This will be our little script:

#!/usr/bin/env bash
echo Hello Sarah! >>test.txt

Very easy. We should make it executable by typing:

chmod +x testScript.sh

Now we have to create a cmd file for submission, open vi again, this time for example by typing

vi testSarah.cmd

Our cmd file will also be very easy:

universe = vanilla
executable = testScript.sh
queue 1

Now we can submit it:

condor_submit testSarah.cmd

It will tell you that you submitted a job and will create the output file test.txt that we specified in our bash script and write "Hello Sarah" into it.

2. Easy job with a loop, running parallely on more than one machine

The idea of the tier3 is to save time and to be able to run a lot of jobs parallely on a lot of machines.
We do that by writing a bash script that loops over the jobs to be done and produces a cmd file for each of them.
Of course we have to be careful as to how to store the output so it doesn't get overwritten every time.

We need two scripts for this.
Let's start with the script that gives us an output into a file.
Let's call it "output.sh".

#!/usr/bin/env bash
X=${1:-1}
echo Sarah is number ${X} >>test${X}.txt

Our script is made for running with different X, producing different output into different text files
(p. ex.: "Sarah is number 1" into text file test1.txt)
The

X=${1:-1}

means: Take the first argument from the command line and if there is none, take as default value 1.
The second argument would be called like that:

Y=${2:-1}

if you want 1 as your default value again.
Now the other script, that will loop and produce a cmd file for EACH X.
Let's call it testSarah.sh

#!/usr/bin/env bash
for ((i=0;i<5;i+=1)); do
let X=0+$i
echo $X
cat >testLoop${X}.cmd <<EOF
universe = vanilla
executable = output.sh
arguments = ${X}
queue 1
EOF
condor_submit testLoop${X}.cmd
done

So now we can run the script (don't forget to make both scripts executable with "chmod +x!")

./testSarah.sh

The output should look like:

0
Submitting job(s).
1 job(s) submitted to cluster 123365.
1
Submitting job(s).
1 job(s) submitted to cluster 123366.
2
Submitting job(s).
1 job(s) submitted to cluster 123367.
3
Submitting job(s).
1 job(s) submitted to cluster 123368.
4
Submitting job(s).
1 job(s) submitted to cluster 123369.

You should also find the different text files in your directory.

3. Longer jobs

1. Check status

If you submit a longer, more complicated job, you might want to check the status. You can see all the submitted jobs, by typing:

condor_q

and only your jobs by typing:

condor_q -submitter yourUserName

2. Output/Error Messages

If you include in your cmd file

output=log.stdout

the output is written to the file log.stdout and you can check later.

error=log.stderr

gives you error messages, if something goes wrong.

log=log

tells you what happened to your program on the tier (start time, end time, status).

3. Killing jobs

To kill a job, first find out its ID by checking the status (see 3.1), then type

condor_rm jobID

4. Large storage requirement

If you are producing events, for example, and need more than the 10 GB in your home directory,
you can store your data in

/msu/data/dzero

5. The idea of scratch

If you are running a lot of jobs and they are writing things into file, there will be a lot of files open on one disc,
which is not a desirable thing. So the idea is copy everything you need for running your program (or making a symbolic link)
into the scratch directory, which is the local disk on the node your job is actually running on.
After the job is finished, it should copy everything you want to keep to your home directory or your data directory on /msu/data/dzero, as the scratch directory is removed once the job terminates.
The scratch directory can be accessed by your script that is called by the cmd file, but only as long as the job is running.
For example, you can change into this directory by:

cd _CONDOR_SCRATCH_DIR

6. Running an Executable that Needs ROOT

If you want to run the analysis package SingleTopRootAnalysis, you will need to have condor recognize what version of ROOT you are using. The easiest way to do this is:

Have a .bashrc file in your home directory on the tier 3 that does not mention any directories that are symbolic links. For example, say export ROOTSYS=/msu/opt/cern/root/v5.24.00_64/, NOT export ROOTSYS=/msu/cern/root/pro_64/.
Source the .bashrc file in the shell that you will submit your job to condor from
Include the line getenv = true in your cmd file. This tells condor to use ALL of the environmental variable settings in the current shell. To see all of these, you can type env in the shell window.

If you do these things, you should be able to run programs without having to specify the environmental variables like ROOTSYS in your shell script later (which is the other option). This may not work well when using the scratch directory- updates to come. NOTE: To avoid sourcing your bashrc file everytime you log onto the tier three, create a file in your home directory called ".bash_profile" and write one line in it: "source ~/.bashrc". This file will run immediately when you log in.

7. Putting the Analysis Package in Your Home Directory

To work with the single-top monte carlo within the tier 3, you will need the analysis package (SingleTopRootAnalysis) accessible to you there. Here are the steps to follow to make this happen:

Make sure you have a home directory to work in (see Tom if you don't). You should have a local installation of this package in case you generate files with different settings or classes than other users.
Put a .bashrc file in your home directory containing lines like the following:

export ROOTSYS=/msu/opt/cern/root/v5.24.00_64/
export PATH=$ROOTSYS/bin:$PATH
export CVSROOT=:ext:YOURCERNNAME@atlas-sw.cern.ch:/atlascvs
export CVS_RSH=ssh
export LD_LIBRARY_PATH=$ROOTSYS/lib:$LD_LIBRARY_PATH:./lib/:

The first line will point towards a 64 bit version of ROOT
- Be sure to list this directory without any links (see section 6)
The third line will allow you to get the files from CVS using your CERN password
- YOURCERNNAME should be replaced with your CERN username
Source the .bashrc file
Type: cvs checkout groups/SingleTopRootAnalysis
Change into the SingleTopRootAnalysis directory and compile (make)

If you do all these things, you should have a working copy of the analysis package. If you run into trouble after getting the files from CVS, try sourcing the .bashrc file again in the new directory, or typing make clean and then compiling one more time.

8. Example for Running the Analysis

I used three files to run the analysis code. This is probably not the most efficient way, but it works. File 1, command file:

universe = vanilla
getenv = true 
executable = /home/jenny/groups/SingleTopRootAnalysis/scripts/run_1451_t3_2.sh
output=/home/jenny/groups/SingleTopRootAnalysis/log.stdout
error=/home/jenny/groups/SingleTopRootAnalysis/log.stderr
log=/home/jenny/groups/SingleTopRootAnalysis/log
queue 1

File 2, shell script to change to scratch, execute tcsh script, and move file to home directory:

#!/usr/bin/env bash
cd $_CONDOR_SCRATCH_DIR
#symbolic links to all the necessary stuff
ln -s /home/jenny/groups/SingleTopRootAnalysis/bin bin
ln -s /home/jenny/groups/SingleTopRootAnalysis/build build
ln -s /home/jenny/groups/SingleTopRootAnalysis/cmt cmt
ln -s /home/jenny/groups/SingleTopRootAnalysis/config config
ln -s /home/jenny/groups/SingleTopRootAnalysis/dep dep
ln -s /home/jenny/groups/SingleTopRootAnalysis/lib lib
ln -s /home/jenny/groups/SingleTopRootAnalysis/lists lists
ln -s /home/jenny/groups/SingleTopRootAnalysis/obj obj
ln -s /home/jenny/groups/SingleTopRootAnalysis/SingleTopRootAnalysis SingleTopRootAnalysis
ln -s /home/jenny/groups/SingleTopRootAnalysis/src src
ln -s /home/jenny/groups/SingleTopRootAnalysis/tmp tmp
ln -s /home/jenny/groups/SingleTopRootAnalysis/CVS CVS
ln -s /home/jenny/groups/SingleTopRootAnalysis/scripts/run_1451_t3_link.sh run_1451_t3_link.sh

./run_1451_t3_link.sh

mv SingleTop.5500.BTag.electron2.root /home/jenny/groups/SingleTopRootAnalysis/SingleTop.5500.BTag.electron2.root

File 3, tcsh script:

#!/bin/tcsh
bin/BTag_analysis.x -config config/SingleTop.BTag.14051.Recon.Electron.config -inlist lists/v14051/SingleTop.14051.5500.t3.list -hfile SingleTop.5500.BTag.electron2.root -MCatNLO  -bTagAlgo default

To run these files, you will need to have the last two files in the scripts directory. Make sure they are executable. Also, you will clearly have to change the directory names to match your own.

9. Example for Running Athena

AthenaOnCondor

4. Example of longer job

My example is the production of events (I want it to be 100 Mio in the end) with the onetop generator.
I will produce them in packages of 500.000, so I will have 200 jobs with different random number seeds,
that I will have to run.
The first script should look familiar:

#!/usr/bin/env bash
mkdir -p NTuplesTop

for ((i=0;i<200;i+=1)); do
let ix=$i
echo $ix
mkdir -p NTuplesTop/N_${ix}
cat >NTuplesTop/N_${ix}/top_${ix}.cmd <<EOF
universe = vanilla
executable = /home/sheim/100Mio/stnlo_ctq6.6_top_tier/run100Mio.sh
arguments = ${ix}
error=/home/sheim/100Mio/stnlo_ctq6.6_top_tier/NTuplesTop/N_${ix}/log.stderr
log=/home/sheim/100Mio/stnlo_ctq6.6_top_tier/NTuplesTop/N_${ix}/log
queue 1
EOF

condor_submit NTuplesTop/N_${ix}/top_${ix}.cmd
done

That is basically the same loop as before, with the execption, that I got a little more careful and hardcoded more absolute pathnames,
instead of relying on the tier to figure it out.

In a second script (run100Mio.sh) I call "batch_gent.com" which is the script that calls my executable, and in which I can choose things like top/antitop,
random number seed, Tevatron or LHC setting...
Notice that I change into the scratch directory and create soft links to the executable (stnlo.a) and other files needed. After my program ran with a certain random number seed (that is the command line argument), I do some more processing, and then copy the final result over to my directories in /msu/data/dzero.

#!/usr/bin/env bash
#command line argument
ix=${1:-1}
#go to local scratch disk
cd $_CONDOR_SCRATCH_DIR
#symbolic links to all the necessary stuff
ln -s /home/sheim/100Mio/stnlo_ctq6.6_top_tier/inp_pdf inp_pdf
ln -s /home/sheim/100Mio/stnlo_ctq6.6_top_tier/grids grids
ln -s /home/sheim/100Mio/stnlo_ctq6.6_top_tier/ct6c0a.tbl ct6c0a.tbl 
ln -s /home/sheim/100Mio/stnlo_ctq6.6_top_tier/ct6c0b.tbl ct6c0b.tbl
ln -s /home/sheim/100Mio/stnlo_ctq6.6_top_tier/stnlo.a stnlo.a

#run batch_gent with command line argument
/home/sheim/100Mio/stnlo_ctq6.6_top_tier/batch_gent.com ${ix}

#just to make sure...
cp schan_51.check /home/sheim/100Mio/stnlo_ctq6.6_top_tier/NTuplesTop/N_${ix}/schan_51.check
#test if there is a NAN in schan_51.check, otherwise, convert ntuples#copy to /mus/data...

if grep nan schan_51.check
then
echo invalid numbers nan > /home/sheim/100Mio/stnlo_ctq6.6_top_tier/NTuplesTop/N_${ix}/log.stderr

else
#convert ntuples to root
#export ROOTSYS=/cern/root
#export LD_LIBRARY_PATH=lib:$ROOTSYS/lib:/home/sheim/cernlib/2005/lib:/cern/2005/lib  
export ROOTSYS=/msu/data/dzero/stop/myroot/v5_12_00-gcc344-x86_64-opt
export LD_LIBRARY_PATH=lib:$ROOTSYS/lib:/home/sheim/cernlib/2005/lib:/cern/2005/lib  
/msu/data/dzero/stop/myroot/v5_12_00-gcc344-x86_64-opt/bin/h2root stre.ntuple
...
mkdir /msu/data/dzero/schannel/N_${ix}
#copy root files to /msu/data/...
cp stre.root /msu/data/dzero/schannel/N_${ix}/stre.root
...
#also copy schan_51.check file
cp schan_51.check /msu/data/dzero/schannel/N_${ix}/schan_51.check
fi

Scheduling downtime

You should schedule a maintenance for your site in OIM: https://oim.grid.iu.edu/

You should email the panda shift list: pandashift@hepmail.uta.edu

You should open a ticket in RT: https://rt-racf.bnl.gov/rt/

curl -k --cert /tmp/x509up_u$(id -u) 'https://panda.cern.ch:25943/server/controller/query?tpmes=setoffline&queue=QUEUENAME'

-- JamesKoll - 22 Oct 2009 -- JennyHolzbauer - 25 Aug 2009 -- JennyHolzbauer - 19 Aug 2009 -- SarahHeim - 23 Feb 2009

Topic revision: r19 - 22 Oct 2009, JamesKoll

AGLT2

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback