Overview of MSU's Tier3 HTCondor Setup

Intro Video for Admins

Types of Machines on HTCondor

HTCondor consists of three types of machines: submit nodes, worker nodes, and the central manager. Each of these machines serves a specific function in the distribution and running of jobs. While it's possible for a single machine to fulfill multiple roles, MSU's Tier3 is setup so that each machine only fills one of these roles.

  • Submit Nodes (heracles, alpheus, and maron): These machines maintain a queue of jobs that users submit. Users log in to these machines to submit said jobs.
  • Worker Nodes (cc-113-* and cc-115-*): These machines run the jobs that have been submitted to the submit machines. Each of these machines hosts some number of job slots (usually one per core).
  • Central Manager (msut3-condor, which is a virtual machine): There is only one of these and it is responsible for the matching process of finding a job slot for a job to run on.

ClassAds

Each job and job slot has some information attached to it that describes what resources it needs (in the case of jobs) or offers (in the case of job slots), which are called its ClassAds. This information can range from how much memory or storage a job need or job slot has to what kind of operating system or programs a job needs or job slot has. These ClassAds are used by the Central Manager to match jobs to potential job slots.

To view a jobs ClassAds, use the command:
$ condor_q -l <JobID>

To view a job slots ClassAds, use the command:
$ condor_status -l <JobSlotID>

HTCondor Daemons

Each type of machine (submit, worker, and central manager) has specific HTCondor daemons running on it to handle different aspects of the job submission, job running, and job to job slot matchmaking process. First, an overview of what deamons each type of machine needs to run:
  • Submit Nodes: condor_master, condor_schedd, condor_shadow
  • Worker Nodes: condor_master, condor_startd, condor_starter
  • Central Manager: condor_master, condor_collector, condor_negotiator
Each of these deamons handles a specific aspect of job management:
  • condor_master: This is the process that is started when condor is started. It is responsible for starting and maintaining the other HTCondor daemons.
  • condor_schedd: Manages the job queue on a specific submit node. When a job is submitted it is sent to this daemon.
  • condor_shadow: A process that is created for each running job. This process runs on the submit node from which the job was submitted. Periodically query\x92s the worker node for information about the job.
  • condor_collector: Responsible for collecting all the information (ClassAds) about the HTCondor pool.
  • condor_negotiator: Responsible for matching making within the HTCondor pool. Periodically starts a negotiation cycle, where it queries condor_collector for the current state of all resources in the pool, contacts each condor_schedd that has jobs requesting resources, and tries to match available resources (job slots) with those requests.
  • condor_startd: Represents a machine that can run jobs in the HTCondor pool. It advertises attributes of the machine and its job slots to the collector.
  • condor_starter: Daemon that actually spawns a HTCondor job on a given machine. It sets up the execution environment and monitors the job once it\x92s running. When a job completes, it notices this and sends back any status information to the relevant submit node before exiting.

The Matchmaking Process

The matchmaking process is handled by the Central Manager (msut3-condor). When the Central Manager first boots, it has no information about the state of the submit nodes and their jobs or the worker nodes and its job slots. Alternatively, the information it does have becomes outdated as new jobs are submitted or old jobs complete. So periodically the Central Manager will collect information and use it to match jobs to job slots. A somewhat in-depth description of this process follows:
  1. The collector daemon on the central manager queries the submit nodes and worker nodes for information about their jobs and job slots.
  2. The collector daemon recieves that information and now has an up-to-date view of the HTCondor system/pool.
  3. The negotiator on the central manager looks at the ClassAds of the jobs and job slots, attempting to find job slots for every job to run on.
  4. Once the negotiator finds a match it:
    1. Tells the relevant job slot that it has been matched and should not start running any other jobs.
    2. Tells the schedduler on the submit node that it's found a job slot for job X to run on.
    3. The scheduler then send the job info to the reserved job slot.
    4. A new starter process is spawned on the worker node and the job is started on the reserved job slot.
    5. The starter then tells the scheduler that the job is running and a shadow process is spawned on the submit node.
  5. At this point the following is periodically done (every 15 minutes) to update information on the scheduler's job queue.
    1. The shadow process for each job on the submit node asks the relevant starter process for information about the job it is running.
    2. The starter then sends that information back to the shadow process of the relevant job.
  6. Once a job completes the starter tells the scheduler (oe perhaps the shadow, which then tells the scheduler) and the job is removed from the queue.

Configuration Files VIDEO

When the condor_master starts up on a machine, it checks /etc/condor/config.d for configuration files. Any files in this directory are loaded alphabetically, so any variables/macros in config files that get loaded last overwrite or modify variables/macros in config files loaded first.

The different functions of condor have been put in different config files to discretize the setup and allow more customization of each machine. The config files used at time of writing are: 10-site.conf, 20-job-rules.conf, 25-user-rules.conf, 50-host-rules.conf, 55-host.conf, 58-host-central.conf, and 58-host-submit.conf

Each machine includes some fraction of the above files in its /etc/condor/config.d/ directory.
  • All machines contain 10-site.conf, 20-job-rules.conf, 25-user-rules.conf, and 50-host-rules.conf.
  • Only submit nodes contain 58-host-submit.conf
  • Only the central manager contains 58-host-central.conf
  • Only worker nodes contain 55-host.conf

10-site.conf

This config defines the specifics of the site, i.e., MSU Tier3. It is on every machine.

# 10-site.conf

# MSU Tier3 Pool with msu-condor Collector

# Pool Info
#
COLLECTOR_NAME = MSU Tier3
CONDOR_ADMIN = rrdrake@pa.msu.edu
CONDOR_DEVELOPERS = NONE
CONDOR_DEVELOPERS_COLLECTOR = NONE
CONDOR_HOST = msut3-condor.msulocal
CONDOR_IDS = 602.602

# Basic Network Info
#
FILESYSTEM_DOMAIN = msulocal
UID_DOMAIN = msulocal
USE_AFS = False
USE_NFS = True

ENABLE_IPV6 = False
ENABLE_IPV4 = True

# Permissions
#
ALLOW_WRITE = *.msulocal
SEC_ADMINISTRATOR_AUTHENTICATION = PREFERRED
SEC_CLIENT_AUTHENTICATION = OPTIONAL
ALLOW_ADMINISTRATOR = $(CONDOR_HOST), condor@msulocal/$(IP_ADDRESS)

#
# Security issue
SENDMAIL=/usr/sbin/sendmail

20-job-rules.conf

This config defines the default behavior for running jobs, as well as the concurrency limits. It is on every machine.

# 20-job-rules.conf

# Defaults for Jobs
#
START = True
SUSPEND = False
CONTINUE = True
PREEMPT = False
KILL = False
WANT_SUSPEND = False
WANT_VACATE = False
CLAIM_WORKLIFE = 1200
PREEMPTION_REQUIREMENTS = False
PREEMPTION_RANK = 0

# Queries
#
CONDOR_Q_ONLY_MY_JOBS = False
CONDOR_Q_DASH_BATCH_IS_DEFAULT = False

# Limits
#
# These are followed by default by condor_drain
# No known way to make condor_drain wait indefinitely
#  (as condor_off -peaceful does); set long retirement time
MaxJobRetirementTime = (240 * $(HOUR))
#MaxVacateTime = (120 * $(HOUR))

# Job Wrapper
#
# Set a script to wrap job in
# Allows limits etc. to be applied to job
USER_JOB_WRAPPER = /etc/condor/user_job_wrapper.sh

# Priorities
#
# Set default prio factor larger than 1, so that we can
# manually improve a user's priority compared to default
DEFAULT_PRIO_FACTOR = 40

# Concurrency Limits
#
# Disk areas.  Each given 10000 units, jobs should require more or less
#   of the resource as determined based on tuning.  100 units (resulting
#   in 100 jobs at once) is a starting point.
#
DISK_T3FAST1_LIMIT = 10000
DISK_T3FAST2_LIMIT = 10000
DISK_T3FAST3_LIMIT = 10000
DISK_T3FAST4_LIMIT = 10000
DISK_T3WORK1_LIMIT = 10000
DISK_T3WORK2_LIMIT = 10000
DISK_T3WORK3_LIMIT = 10000
DISK_T3WORK4_LIMIT = 10000
DISK_T3WORK5_LIMIT = 10000
DISK_T3WORK6_LIMIT = 10000
DISK_T3WORK7_LIMIT = 10000
DISK_T3WORK8_LIMIT = 10000
DISK_MARTIN_LIMIT = 10000
DISK_XROOTD_LIMIT = 10000
DISK_HOME_LIMIT = 10000

APPEND_RANK = 10000 * ((SlotID==13) || (SlotID==14))

25-user-rules.conf

This config defines special users and is probably deprecated. It is on every machine.

# 25-user-rules.conf

# User Rules
#
BULK_USER   =   (( Owner == "trocsanyi" ) || \
                 ( Owner == "samgrid" )   || \
                 ( Owner == "yuntse" ))

NOT_BULK_USER = ( $(BULK_USER) == False )

50-host-defs.conf

This config defines the different slot types based on number of cpus and amount of memory. It also defines the default daemons for the machines to run (it basically sets them all to be worker nodes). It also defines the location for temporary files to be stored. It is on every machine.

# 50-host-defs.conf

# Default host configuration

# Where temporary files should be stored
#
LOCAL_DIR=/var/
EXECUTE=/tmp/condor/execute
# Daemons to run
#
DAEMON_LIST = MASTER, STARTD
START_DAEMONS = True

# Networking
#
BIND_ALL_INTERFACES = False
NETWORK_INTERFACE = 10.10.*

# Our standard slot types
#
SLOT_TYPE_1 = cpus=1 memory=1024
SLOT_TYPE_2 = cpus=1 memory=2048
SLOT_TYPE_3 = cpus=1 memory=4096
SLOT_TYPE_4 = cpus=1 memory=8192

# SlotID based start rules
#
START_12_SLOT_MSU_RESERVE = \
  ( \
    (( SLOTID == 1  ) && TRUE ) || \
    (( SLOTID == 2  ) && TRUE ) || \
    (( SLOTID == 3  ) && TRUE ) || \
    (( SLOTID == 4  ) && TRUE ) || \
    (( SLOTID == 5  ) && TRUE ) || \
    (( SLOTID == 6  ) && TRUE ) || \
    (( SLOTID == 7  ) && TRUE ) || \
    (( SLOTID == 8  ) && TRUE ) || \
    (( SLOTID == 9  ) && $(NOT_BULK_USER)) || \
    (( SLOTID == 10 ) && $(NOT_BULK_USER)) || \
    (( SLOTID == 11 ) && $(NOT_BULK_USER)) || \
    (( SLOTID == 12 ) && $(NOT_BULK_USER)) \
  )

START_14_SLOT_MSU_RESERVE_SHORT = \
  ( \
    (( SLOTID == 1  ) && TRUE ) || \
    (( SLOTID == 2  ) && TRUE ) || \
    (( SLOTID == 3  ) && TRUE ) || \
    (( SLOTID == 4  ) && TRUE ) || \
    (( SLOTID == 5  ) && TRUE ) || \
    (( SLOTID == 6  ) && TRUE ) || \
    (( SLOTID == 7  ) && TRUE ) || \
    (( SLOTID == 8  ) && TRUE ) || \
    (( SLOTID == 9  ) && $(NOT_BULK_USER)) || \
    (( SLOTID == 10 ) && $(NOT_BULK_USER)) || \
    (( SLOTID == 11 ) && $(NOT_BULK_USER)) || \
    (( SLOTID == 12 ) && $(NOT_BULK_USER)) \
  )

55-host.conf

Defines what kinds of slots are on each worker node. It also implements how many bypass, short, medium, and long jobs can run on the worker node.

PE 1950s specific config
# 55-host.conf

# from 55-host-pe1950.conf

NUM_CPUS = 8
MEMORY = 16384 + 1024
NUM_SLOTS_TYPE_1 = 5
NUM_SLOTS_TYPE_3 = 3

IsUserBypassJob = (TARGET.IsBypassJob =?= True)
IsUserMediumJob = (TARGET.IsMediumJob =?= True && !$(IsUserBypassJob) )
IsUserLongJob   = (TARGET.IsLongJob   =?= True && !$(IsUserBypassJob) )
IsUserShortJob  = ( !$(IsUserMediumJob) && !$(IsUserLongJob) && !$(IsUserBypassJob) )

# short jobs should run on all slots, medium jobs all but one, and long jobs only one.
START = $(START) && ( \
  $(IsUserShortJob) || $(IsUserBypassJob) ||  \
  ( (SlotID != 8) && $(IsUserMediumJob) ) || \
  ( (SlotID == 8) && $(IsUserLongJob) ) )

R610s specific config
# 55-host.conf

# from 55-host-r610.conf

NUM_CPUS = 12
NUM_SLOTS_TYPE_3 = 12
MEMORY = $(NUM_CPUS) * 4096

IsUserBypassJob = (TARGET.IsBypassJob =?= True)
IsUserMediumJob = (TARGET.IsMediumJob =?= True && !$(IsUserBypassJob) )
IsUserLongJob   = (TARGET.IsLongJob   =?= True && !$(IsUserBypassJob) ) 
IsUserShortJob  = ( !$(IsUserMediumJob) && !$(IsUserLongJob) && !$(IsUserBypassJob) )

# short and medium jobs should run on all slots, while long jobs should run on only one.
START = $(START) && $(START_14_SLOT_MSU_RESERVE_SHORT) && ( \
  $(IsUserShortJob) || $(IsUserMediumJob) || $(IsUserBypassJob) || \
  ( (SlotID == 12) && $(IsUserLongJob) ) )

58-host-central.conf

Only the central manager has this config. It simply changes the list of daemons to run on the machine to that of a central manager and turns off the ability to run jobs.

# 58-host-central.conf

DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR
START = False
SEC_CLIENT_AUTHENTICATION = Never

58-host-submit.conf

Only submit nodes have this config. It simply changes the list of daemons to run on the machine to that of a submit node and turns off the ability to run jobs. It also sets up the time limits for the bypass, short, medium, and long queues.

# 58-host-submit.conf

DAEMON_LIST = MASTER, SCHEDD
START = False

IsUserBypassJob = (IsBypassJob =?= true)
IsUserMediumJob = (IsMediumJob =?= true && !$(IsUserBypassJob) )
IsUserLongJob = (IsLongJob =?= true && !$(IsUserBypassJob) )
IsUserShortJob = ( !$(IsUserMediumJob) && !$(IsUserLongJob) && !$(IsUserBypassJob) )


SYSTEM_PERIODIC_HOLD = ( JobStatus == 2 ) && ( \
                        ( $(IsUserShortJob) && RemoteUserCpu > 3*60*60 ) || \
                        ( $(IsUserMediumJob) && RemoteUserCpu > 2*24*60*60 ) || \
                        ( $(IsUserLongJob) && RemoteUserCpu > 7*24*60*60 ) )

SYSTEM_PERIODIC_HOLD_REASON = ifThenElse($(IsUserShortJob) && RemoteUserCpu > 3*60*60, "Job exceeded short queue time of 3 cpu hours.", \
                                ifThenElse($(IsUserMerdiumJob) && RemoteUserCpu > 2*24*60*60, "Job exceeded medium queue time of 48 cpu hours.", \
                                ifThenElse($(IsUserLongJob) && RemoteUserCpu > 7*24*60*60, "Job exceeded long queue time of 7 cpu days.", "Unknown periodic hold reason") ) )

SYSTEM_PERIODIC_REMOVE = ( JobStatus == 5) && (CurrentTime - EnteredCurrentStatus > 24*60*60)

Commands to use and configure HTCondor VIDEO

The commands to use HTCondor are described here. Commands for configuring and restarting HTCondor follow.

condor_reconfig

Use the "condor_reconfig" command to reconfigure all of the HTCondor daemons on the current machine.

condor_restart

If any of the following HTCondor variables/macros are edited, then a restart of condor using "condor_restart" is necessary.
  • DAEMON_LIST
  • BIND_ALL_INTERFACE
  • FetchWorkDelay
  • MAX_NUM_CPUS
  • MAX_TRACKING_GID
  • MIN_TRACKING_GID
  • NETWORK_INTERFACE
  • NUM_CPUS
  • PREEMPTION_REQUIREMENTS_STABLE
  • PRIVSEP_ENABLED
  • PROCD_ADDRESS

Miscellaneous

Job Statuses

  • 1: Idle
  • 2: Running
  • 3: Removed
  • 4: Completed
  • 5: Held
  • 6: Transferring Output
  • 7: Suspended

Job Slot States

  • Owner*
  • Unclaimed
  • Matched
  • Claimed
  • Preempting*
  • Backfill*
  • Drained
Note (*): These do not apply to the way the tier3 is setup. They only apply to an HTCondor pool with a checkpoint system setup.

-- ForrestPhillips - 23 Sep 2019
Topic revision: r4 - 10 Oct 2019, ForrestPhillips
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback