0. Useful webpages
This webpage is a mix between a tutorial and reference. If you are just interested in a quick overview of useful condor commands,
just google "condor useful commands" and you will find tons of pages. If I stumble across a particularly useful one, I will list it here.
The original condor manual:
http://www.cs.wisc.edu/condor/manual/
1. Very easy job
Let's start: Log onto the tier by typing
ssh msu3.aglt2.org
It will then ask you for your password, type it in.
We are on the tier now, in our home directory.
I suggest creating a folder with
mkdir foldername
and changing into that folder with
cd foldername
Now open vi by typing
vi testScript.sh
(You can use emacs, too, but it will be a bit slower).
This will be our little script:
#!/usr/bin/env bash
echo Hello Sarah! >>test.txt
Very easy.
We should make it executable by typing:
chmod +x testScript.sh
Now we have to create a cmd file for submission, open vi again, this time for example by typing
vi testSarah.cmd
Our cmd file will also be very easy:
universe = vanilla
executable = testScript.sh
queue 1
Now we can submit it:
condor_submit testSarah.cmd
It will tell you that you submitted a job and will create the output file test.txt that we specified in our bash script and write "Hello Sarah" into it.
2. Easy job with a loop, running parallely on more than one machine
The idea of the tier3 is to save time and to be able to run a lot of jobs parallely on a lot of machines.
We do that by writing a bash script that loops over the jobs to be done and produces a cmd file for each of them.
Of course we have to be careful as to how to store the output so it doesn't get overwritten every time.
We need two scripts for this.
Let's start with the script that gives us an output into a file.
Let's call it "output.sh".
#!/usr/bin/env bash
X=${1:-1}
echo Sarah is number ${X} >>test${X}.txt
Our script is made for running with different X, producing different output into different text files
(p. ex.: "Sarah is number 1" into text file test1.txt)
The
X=${1:-1}
means: Take the first argument from the command line and if there is none, take as default value 1.
The second argument would be called like that:
Y=${2:-1}
if you want 1 as your default value again.
Now the other script, that will loop and produce a cmd file for EACH X.
Let's call it testSarah.sh
#!/usr/bin/env bash
for ((i=0;i<5;i+=1)); do
let X=0+$i
echo $X
cat >testLoop${X}.cmd <<EOF
universe = vanilla
executable = output.sh
arguments = ${X}
queue 1
EOF
condor_submit testLoop${X}.cmd
done
So now we can run the script (don't forget to make both scripts executable with "chmod +x!")
./testSarah.sh
The output should look like:
0
Submitting job(s).
1 job(s) submitted to cluster 123365.
1
Submitting job(s).
1 job(s) submitted to cluster 123366.
2
Submitting job(s).
1 job(s) submitted to cluster 123367.
3
Submitting job(s).
1 job(s) submitted to cluster 123368.
4
Submitting job(s).
1 job(s) submitted to cluster 123369.
You should also find the different text files in your directory.
3. Longer jobs
1. Check status
If you submit a longer, more complicated job, you might want to check the status.
You can see all the submitted jobs, by typing:
condor_q
and only your jobs by typing:
condor_q -submitter yourUserName
2. Output/Error Messages
If you include in your cmd file
output=log.stdout
the output is written to the file log.stdout and you can check later.
error=log.stderr
gives you error messages, if something goes wrong.
log=log
tells you what happened to your program on the tier (start time, end time, status).
3. Killing jobs
To kill a job, first find out its ID by checking the status (see 3.1), then type
condor_rm jobID
4. Large storage requirement
If you are producing events, for example, and need more than the 10 GB in your home directory,
you can store your data in
/msu/data/dzero
5. The idea of scratch
If you are running a lot of jobs and they are writing things into file, there will be a lot of files open on one disc,
which is not a desirable thing. So the idea is copy everything you need for running your program (or making a symbolic link)
into the scratch directory, which is the local disk on the node your job is actually running on.
After the job is finished, it should copy everything you want to keep to your home directory or your data directory on /msu/data/dzero,
as the scratch directory is removed once the job terminates.
The scratch directory can be accessed by your script that is called by the cmd file, but only as long as the job is running.
For example, you can change into this directory by:
cd _CONDOR_SCRATCH_DIR
6. Running an Executable that Needs ROOT
If you want to run the analysis package
SingleTopRootAnalysis, you will need to have condor recognize what version of ROOT you are using. The easiest way to do this is:
- Have a .bashrc file in your home directory on the tier 3 that does not mention any directories that are symbolic links. For example, say export ROOTSYS=/msu/opt/cern/root/v5.24.00_64/, NOT export ROOTSYS=/msu/cern/root/pro_64/.
- Source the .bashrc file in the shell that you will submit your job to condor from
- Include the line getenv = true in your cmd file. This tells condor to use ALL of the environmental variable settings in the current shell. To see all of these, you can type env in the shell window.
If you do these things, you should be able to run programs without having to specify the environmental variables like ROOTSYS in your shell script later (which is the other option). This may not work well when using the scratch directory- updates to come.
NOTE: To avoid sourcing your bashrc file everytime you log onto the tier three, create a file in your home directory called ".bash_profile" and write one line in it: "source ~/.bashrc". This file will run immediately when you log in.
7. Putting the Analysis Package in Your Home Directory
To work with the single-top monte carlo within the tier 3, you will need the analysis package (
SingleTopRootAnalysis) accessible to you there. Here are the steps to follow to make this happen:
- Make sure you have a home directory to work in (see Tom if you don't). You should have a local installation of this package in case you generate files with different settings or classes than other users.
- Put a .bashrc file in your home directory containing lines like the following:
export ROOTSYS=/msu/opt/cern/root/v5.24.00_64/
export PATH=$ROOTSYS/bin:$PATH
export CVSROOT=:ext:YOURCERNNAME@atlas-sw.cern.ch:/atlascvs
export CVS_RSH=ssh
export LD_LIBRARY_PATH=$ROOTSYS/lib:$LD_LIBRARY_PATH:./lib/:
- The first line will point towards a 64 bit version of ROOT
- Be sure to list this directory without any links (see section 6)
- The third line will allow you to get the files from CVS using your CERN password
- YOURCERNNAME should be replaced with your CERN username
- Source the .bashrc file
- Type: cvs checkout groups/SingleTopRootAnalysis
- Change into the SingleTopRootAnalysis directory and compile (make)
If you do all these things, you should have a working copy of the analysis package. If you run into trouble after getting the files from CVS, try sourcing the .bashrc file again in the new directory, or typing make clean and then compiling one more time.
8. Example for Running the Analysis
I used three files to run the analysis code. This is probably not the most efficient way, but it works.
File 1, command file:
universe = vanilla
getenv = true
executable = /home/jenny/groups/SingleTopRootAnalysis/scripts/run_1451_t3_2.sh
output=/home/jenny/groups/SingleTopRootAnalysis/log.stdout
error=/home/jenny/groups/SingleTopRootAnalysis/log.stderr
log=/home/jenny/groups/SingleTopRootAnalysis/log
queue 1
File 2, shell script to change to scratch, execute tcsh script, and move file to home directory:
#!/usr/bin/env bash
cd $_CONDOR_SCRATCH_DIR
#symbolic links to all the necessary stuff
ln -s /home/jenny/groups/SingleTopRootAnalysis/bin bin
ln -s /home/jenny/groups/SingleTopRootAnalysis/build build
ln -s /home/jenny/groups/SingleTopRootAnalysis/cmt cmt
ln -s /home/jenny/groups/SingleTopRootAnalysis/config config
ln -s /home/jenny/groups/SingleTopRootAnalysis/dep dep
ln -s /home/jenny/groups/SingleTopRootAnalysis/lib lib
ln -s /home/jenny/groups/SingleTopRootAnalysis/lists lists
ln -s /home/jenny/groups/SingleTopRootAnalysis/obj obj
ln -s /home/jenny/groups/SingleTopRootAnalysis/SingleTopRootAnalysis SingleTopRootAnalysis
ln -s /home/jenny/groups/SingleTopRootAnalysis/src src
ln -s /home/jenny/groups/SingleTopRootAnalysis/tmp tmp
ln -s /home/jenny/groups/SingleTopRootAnalysis/CVS CVS
ln -s /home/jenny/groups/SingleTopRootAnalysis/scripts/run_1451_t3_link.sh run_1451_t3_link.sh
./run_1451_t3_link.sh
mv SingleTop.5500.BTag.electron2.root /home/jenny/groups/SingleTopRootAnalysis/SingleTop.5500.BTag.electron2.root
File 3, tcsh script:
#!/bin/tcsh
bin/BTag_analysis.x -config config/SingleTop.BTag.14051.Recon.Electron.config -inlist lists/v14051/SingleTop.14051.5500.t3.list -hfile SingleTop.5500.BTag.electron2.root -MCatNLO -bTagAlgo default
To run these files, you will need to have the last two files in the scripts directory. Make sure they are executable. Also, you will clearly have to change the directory names to match your own.
9. Example for Running Athena
AthenaOnCondor
4. Example of longer job
My example is the production of events (I want it to be 100 Mio in the end) with the onetop generator.
I will produce them in packages of 500.000, so I will have 200 jobs with different random number seeds,
that I will have to run.
The first script should look familiar:
#!/usr/bin/env bash
mkdir -p NTuplesTop
for ((i=0;i<200;i+=1)); do
let ix=$i
echo $ix
mkdir -p NTuplesTop/N_${ix}
cat >NTuplesTop/N_${ix}/top_${ix}.cmd <<EOF
universe = vanilla
executable = /home/sheim/100Mio/stnlo_ctq6.6_top_tier/run100Mio.sh
arguments = ${ix}
error=/home/sheim/100Mio/stnlo_ctq6.6_top_tier/NTuplesTop/N_${ix}/log.stderr
log=/home/sheim/100Mio/stnlo_ctq6.6_top_tier/NTuplesTop/N_${ix}/log
queue 1
EOF
condor_submit NTuplesTop/N_${ix}/top_${ix}.cmd
done
That is basically the same loop as before, with the execption, that I got a little more careful and hardcoded more absolute pathnames,
instead of relying on the tier to figure it out.
In a second script (run100Mio.sh) I call "batch_gent.com" which is the script that calls my executable, and in which I can choose things like top/antitop,
random number seed, Tevatron or LHC setting...
Notice that I change into the scratch directory and create soft links to the executable (stnlo.a) and other files needed.
After my program ran with a certain random number seed (that is the command line argument), I do some more processing,
and then copy the final result over to my directories in /msu/data/dzero.
#!/usr/bin/env bash
#command line argument
ix=${1:-1}
#go to local scratch disk
cd $_CONDOR_SCRATCH_DIR
#symbolic links to all the necessary stuff
ln -s /home/sheim/100Mio/stnlo_ctq6.6_top_tier/inp_pdf inp_pdf
ln -s /home/sheim/100Mio/stnlo_ctq6.6_top_tier/grids grids
ln -s /home/sheim/100Mio/stnlo_ctq6.6_top_tier/ct6c0a.tbl ct6c0a.tbl
ln -s /home/sheim/100Mio/stnlo_ctq6.6_top_tier/ct6c0b.tbl ct6c0b.tbl
ln -s /home/sheim/100Mio/stnlo_ctq6.6_top_tier/stnlo.a stnlo.a
#run batch_gent with command line argument
/home/sheim/100Mio/stnlo_ctq6.6_top_tier/batch_gent.com ${ix}
#just to make sure...
cp schan_51.check /home/sheim/100Mio/stnlo_ctq6.6_top_tier/NTuplesTop/N_${ix}/schan_51.check
#test if there is a NAN in schan_51.check, otherwise, convert ntuples#copy to /mus/data...
if grep nan schan_51.check
then
echo invalid numbers nan > /home/sheim/100Mio/stnlo_ctq6.6_top_tier/NTuplesTop/N_${ix}/log.stderr
else
#convert ntuples to root
#export ROOTSYS=/cern/root
#export LD_LIBRARY_PATH=lib:$ROOTSYS/lib:/home/sheim/cernlib/2005/lib:/cern/2005/lib
export ROOTSYS=/msu/data/dzero/stop/myroot/v5_12_00-gcc344-x86_64-opt
export LD_LIBRARY_PATH=lib:$ROOTSYS/lib:/home/sheim/cernlib/2005/lib:/cern/2005/lib
/msu/data/dzero/stop/myroot/v5_12_00-gcc344-x86_64-opt/bin/h2root stre.ntuple
...
mkdir /msu/data/dzero/schannel/N_${ix}
#copy root files to /msu/data/...
cp stre.root /msu/data/dzero/schannel/N_${ix}/stre.root
...
#also copy schan_51.check file
cp schan_51.check /msu/data/dzero/schannel/N_${ix}/schan_51.check
fi
Scheduling downtime
--
JamesKoll - 22 Oct 2009
--
JennyHolzbauer - 25 Aug 2009
--
JennyHolzbauer - 19 Aug 2009
--
SarahHeim - 23 Feb 2009