Planning Condor Configuration Updates for AGLT2
Now that AGLT2 is running on an SL6.4 OS we can plan on implementing some new features in Condor that will take advantage of some OS specific methods for job isolation.
Brian Bockelman has a nice presentation at
https://indico.fnal.gov/getFile.py/access?contribId=25&sessionId=6&resId=0&materialId=slides&confId=5109 which describes configuring Condor with
cgroups.
There is documentation from RedHat on cgroups in RHEL6 at
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide
(or try
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch01.html)
For AGLT2 I am proposing that we implement two changes to our Condor configuration:
- We use the MOUNT_UNDER_SCRATCH option described in section 3.3 of the Condor 7.8.8 manual (page 209)
- We implement cgroups on all our worker nodes and configure Condor to use it
Filesystem Isolation
Having each job "isolated" in its on workspace has pros and cons:
Benefits
- Information Security: Jobs see only their local workspace. They are not able to see other files or grid-credentials from other jobs, even if they have the same UID. Having this feature removes the need to run glexec.
- Host Security: Jobs are only able to write into a well defined, controlled location. They cannot leave persistent code or applications hidden on the host
- Resource Management: All files and output from the job are in a well-defined location. No matter if the job fails or succeeds, we can easily clean up all used space from the run and prevent the work area from filling.
Disadvantages
- Job Recovery: It will be difficult to "recover" from job failures. Specifically, if a job completes its work but fails to get its outputs sent to their destination, the current situation would let a follow-on job "recover" the data. If the job workspace is always cleaned up when the job is finished (whether or not there is an error), there will be nothing to recover.
- Caching: Currently we run 'pcache' on our worker nodes. Files are hard-linked to a /tmp subdirectory. Another request for the file on the same node can use this copy. If jobs are isolated in a 'chroot' equivalent location, they cannot see/access the 'pcache'.
Condor as of 7.7 supports using the MOUNT_UNDER_SCRATCH option for RHEL 5+. From the Condor Manual:
MOUNT_UNDER_SCRATCH A comma separated list of directories. For each directory in the list, HTCondor creates a directory in the job\x92s temporary scratch directory, but make it availableat the given name using bind mounts. This is available on Linux systems which provide bind mounts and per-process tree mount tables, such as Red Hat Enterprise Linux 5. As an example:
MOUNT_UNDER_SCRATCH = /tmp,/var/tmp
The job will think that it sees the usual /tmp and /var/tmp directories, but they will actually exist as subdirectories in the job\x92s temporary scratch directory. This is useful, because the job\x92s scratch directory will be cleaned up after the job completes, and because jobs will not be able to fill up the real /tmp directory. The default value if not defined will be that nodirectories are mounted in the job\x92s temporary scratch directory.
This will ensure each Condor job has its own directory area. The space will be used from the _CONDOR_SCRATCH_DIR area and we need to verify this is where we want it to be.
Using 'cgroups'
The Condor 8.x manual describes the needed configuration for RHEL6+ systems in sections
3.2.12 and
3.2.14. In addition we need to make sure that on each worker node we have:
- The libcgroup.x86_64 is installed
- The
/etc/cgconfig.conf
file properly configured
- The cgconfig service enabled
This will setup the OS to allow it to use cgroups. We then need to configure Condor to also use cgroups (See 3.12.14 above).
[Example] Basic Configuration of a cgroup
This example will show you how to start a job in a machine - limiting its memory usage to 2GB, and locking the job to use a specific core and memory node.
- Create a mount point. This will also be the name of the cgroup hierarchy. (Best to name it corresponding to the subsytems attached to it)
- Create a cgroup within the hierarchy using the
cgcreate
command.
resources.
- Use
cgset
to assign values to desired parameters. This can also be done by echo-ing a value to a parameter
- Use
cgexec
to start a process inside a cgroup.
Command Syntax
cgcreate [-t uid:gid] [-a uid:gid] -g [subsytems]:[path to cgroup relative to root hierarchy]
Taken from RedHat's
cgcreate
description:
- -t (optional) \x97 specifies a user (by user ID, uid) and a group (by group ID, gid) to own the tasks pseudo-file for this cgroup. This user can add tasks to the cgroup.
- -a (optional) \x97 specifies a user (by user ID, uid) and a group (by group ID, gid) to own all pseudo-files other than tasks for this cgroup. This user can modify the access that the tasks in this cgroup have to system
cgset -r [parameter]=[value]
cgexec -g [subsytems]:[path of cgroup relative to root hierarchy] [command] [arguments]
Condor Configuration for cgroups
The condor_config needs to have a new entry so the condor_startd will use cgroups.
BASE_CGROUP = htcondor
From the Condor manual:
To enable cgroup tracking in HTCondor, once cgroups have been enabled in the operating system, set the BASE_CGROUP
configuration variable to the string that matches the group name specified in the /etc/cgconfig.conf
In the example above, "htcondor" is a good choice. There is no default value for BASE_CGROUP
, and if left unset, cgroup tracking will not be used.
Kernel cgroups are named in a virtual file system hierarchy. HTCondor will put each running job on the execute node in a distinct cgroup. The name of this cgroup is the name of the execute directory for that condor_starter, with slashes replaced by underscores, followed by the name and number of the slot. So, for the memory controller, a job running on slot1 would have its cgroup located at /cgroup/memory/htcondor/condor_var_lib_condor_execute_slot1/
. The tasks
file in this directory will contain a list of all the processes in this cgroup, and many other files in this directory have useful information about resource usage of this cgroup. See the kernel documentation for full details.
Once cgroup-based tracking is configured, usage should be invisible to the user and administrator. The condor_procd log, as defined by configuration variable PROCD_LOG
, will mention that it is using this method, but no user visible changes should occur, other than the impossibility of a quickly-forking process escaping from the control of the condor_starter, and the more accurate reporting of memory usage.
We will have to test on Condor 7.8.8 and see how this works in practice.
Condor 8.x improvements
In Condor 7.8.x we can use cgroups as above. However we don't have fine-grained control of the resource usage. In Condor 8.x we have more options. From the Condor manual:
Once cgroups is configured, the condor_starter will create a cgroup for each job, and set two attributes in that cgroup which control resource usage therein. These two attributes are the cpu.shares attribute in the cpu controller, and one of two attributes in the memory controller, either memory.limit_in_bytes, or memory.soft_limit_in_bytes. The configuration variable CGROUP_MEMORY_LIMIT_POLICY
controls whether the hard limit (the former) or the soft limit will be used. If CGROUP_MEMORY_LIMIT_POLICY
is set to the string hard
, the hard limit will be used. If set to soft
, the soft limit will be used. Otherwise, no limit will be set if the value is none
. The default is none
. If the hard limit is in force, then the total amount of physical memory used by the sum of all processes in this job will not be allowed to exceed the limit. If the processes try to allocate more memory, the allocation will succeed, and virtual memory will be allocated, but no additional physical memory will be allocated. The system will keep the amount of physical memory constant by swapping some page from that job out of memory. However, if the soft limit is in place, the job will be allowed to go over the limit if there is free memory available on the system. Only when there is contention between other processes for physical memory will the system force physical memory into swap and push the physical memory used towards the assigned limit. The memory size used in both cases is the machine ClassAd attribute Memory
. Note that Memory
is a static amount when using static slots, but it is dynamic when partitionable slots are used.
In addition to memory, the condor_starter can also control the total amount of CPU used by all processes within a job. To do this, it writes a value to the cpu.shares attribute of the cgroup cpu controller. The value it writes is copied from the Cpus
attribute of the machine slot ClassAd. Again, like theMemory attribute, this value is fixed for static slots, but dynamic under partitionable slots. This tells the operating system to assign cpu usage proportionally to the number of cpus in the slot. Unlike memory, there is no concept of soft
or hard
, so this limit only applies when there is contention for the cpu. That is, on an eight core machine, with only a single, one-core slot running, and otherwise idle, the job running in the one slot could consume all eight cpus concurrently with this limit in play, if it is the only thing running. If, however, all eight slots where running jobs, with each configured for one cpu, the cpu usage would be assigned equally to each job, regardless of the number of processes in each job.
Implementation at AGLT2
First, turn on cgroups for the WN. This is two part.
- Start and chkconfig on the cgconfig service
- Add Configuration file to /etc/condor/config.d directory
[root@bl-5-1 config.d]# cat 30-cgroups.conf
# cgroups.conf
#
# We will set the 4GB memory limit as soft. PE1950 have 2GB/core,
# plus 2GB/core in swap. Just seems better to limit this way
#
BASE_CGROUP = htcondor
CGROUP_MEMORY_LIMIT_POLICY = soft
The installation of the HTCondor rpm provides a startup file for htcondor cgroups. We extended this by adding net_cls for future expansion in the /etc/cgconfig.conf file. The relevant htcondor section then looks like so:
############# This part added for htcondor ###########
#
group htcondor {
cpu{}
cpuacct{}
memory{}
freezer{}
blkio{}
net_cls{}
}
Other needed changes
We unexpectedly found that Condor has a very small value for requested memory for jobs running under cgroups. This puts Worker Nodes into a situation where swap space is heavily used, and the nodes essentially go to 100% saturation of the disk utilization. Resolve this by adding parameters to the submitted job files.
- Single core tasks, RequestMemory = 3968
- MP8 tasks, RequestMemory 32640
These values are 128MB lower than the maximum allowed in each kind of slot. 128MB is the default granularity for memory allocation.
An unexpected gotcha
Before and while the RequestMemory issue was diagnosed, we thought we had a solution. We did, but did not do what needed doing, ie, when the RequestMemory increased, somehow cgroups did not get the new value. The answer was to stop condor, restart the cgconfig service, then restart Condor.
--
ShawnMcKee - 09 Aug 2013