You are here: Foswiki>AGLT2 Web>CernCluster (02 Oct 2019, WenjingWu)Edit Attach

Configuration of the UM CERN Computing Cluster in BAT 188

In November 2014, the UM CERN Computing Cluster was upgraded to SLC6. Some old hardware was retired, new hardware was installed and configured. The details of how they were configured for use are shown in later sections. Since then an install script has been written to incorporate these actions. New systems should use the install script if possible, and new actions incorporated into that script on an ongoing basis.

As of November 2018, we are now installing CC7 on some user machines around CERN. This requires updates to the provisioning scripts (see below).

System Install Script

The script atint01:/export/scripts/setup/um-cern-setup.sh can take care of most basic setup on SL6. For CC7, we created a new atint01:/export/scripts/setup/um-cern-setup-CC7.sh. From the README (for SL6) in same directory:
The um-cern-setup.sh script will do some basic setup for systems at CERN.

Including but not limited to:
- add UM/USTC users to passwd (kerberos does auth). See addusers.sh script for list of users.  
- Modify /etc/passwd to point to /net/s3_data_home/ if user has directory there.  Also checks /net/ustc_home.   
        * Modify EXCEPTIONS in addusesr.sh script to skip this for users
- DOES NOT run /net/share/scripts/create_local_home_users.sh to create 
  symlinks in /home to /net homes  (if user has NFS home the full path will be put into passwd).  
- NFS automounts
- add B188 cluster root key to root authorized_keys
- install our krb5.conf with ATLAS realm in addition to CERN
- install UMATLAS repo
- install FusionInventory Agent (remove CERN OcsInventory)
- install some basic core packages (emacs,vi,nano,gcc,screen,etc).  Also installs CERN libs/packages documented in following sections.
- if hostname matches *pcatum* install a bunch of additional useful packages
- Install iptables with openings for condor
- Install/configure cvmfs (runs /net/share/cvmfs/install.sh)
- Install/configure condor, starts condor service.  Installs RPM from /net/share/condor-8.2.10-345812.rhel6.7.x86_64.rpm, runs install scripts:
        * /net/share/condor_config_files/nodeinfo/config_condor_prio.sh
        * /net/share/condor_config_files/set_condor_dirs_and_files.sh
- Add sysctls as documented in following sections
- Run addprinters.sh to add printers for building as mappped by hostname in file 'locationmap'.
  Info from https://network.cern.ch/sc/fcgi/sc.fcgi?Action=SearchForDisplay&DeviceName=pcatum*


To use, copy directory contents to target system and run um-cern-setup.sh <primary user>.  
Primary user will be given sudo access but is an optional argument.

To automate this you can try 'setuphost.sh <host> <primary user>' (from atintXX as root)
The script will copy everything and run the setup script over ssh. 

Backups

Backups for servers(interactive machines and file servers).

The /etc/ /root/ areas of all servers and /export/share on umint01 are backed up via

/etc/cron.d/sysbackup
0 10 * * * root /root/tools/sysbakup.sh 

The backups are stored in
[root@atums1 ~]# ls /net/s3_data_home/sysbackup/
atint01  atint02  atums1  atustc-s01  atustc-s02

Additonal/duplicated backup for atint01

Important directories on atint01 are backed up nightly with the following in /etc/cron.d/backup. A one-time 'fallback' backup of all three was also taken at time of this writing.
# backup useful filesystems nightly
0 1 * * * root tar -cf /net/s3_datad/atint01/atint01-export-nightly.tgz /export
3 1 * * * root tar -cf /net/s3_datad/atint01/atint01-root-nightly.tgz /root
4 1 * * * root tar -cf /net/s3_datad/atint01/atint01-etc-nightly.tgz /etc

Backup for user home directories

UM User Home:

atums1:/export/data_home/ is backed up to /net/ustc_01/home_backup_um

It is done by atums1:/root/tools/backup_home.sh
[root@atums1 ~]# more /etc/cron.d/backup_home
 3 2 * * * root /bin/bash /root/tools/backup_home.sh

USTC user home

atustc-s02:/net/ustc_home is backed up to /net/s3_dataa/home_backup_ustc

It is done by atustc-s02:/root/tools/backup_home.sh
[root@atums1 ~]# more /etc/cron.d/backup_home
 3 2 * * * root /bin/bash /root/tools/backup_home.sh

Cookbook doings.

  • Re-establish /root/.ssh (700) directory and its save content authorized_keys (600)
  • Set up /etc/auto.master and /etc/auto.net from saved configurations
    • ustc-lin01 disks are no longer mounted
  • Configure iptables (see below)
    • service iptables restart
  • On atint01 only, restore /export directory from saved locations
    • Modify /etc/exports and restart nfs service
  • useraddcern roball diehl mckee bmeekhof qianj zzhao
    • These use default afs directories, so leave them that way
  • Install what should be the set of ATLAS computing rpms
  • On interactive machines only
    • yum -y install screen
  • Pre-set condor and cvmfs account information. Note that the initial set of machines installed used group 492 for cvmfs, but later machines have the sfcb group at this group id. So, these directions will be for future machines. ALWAYS check first which 49X group IDs are really in use before inserting these definitions.

[root@atint01 ~]# grep -e fuse -e cvmfs -e condor /etc/passwd
cvmfs:x:496:490:CernVM-FS service account:/var/lib/cvmfs:/sbin/nologin
condor:x:495:491:Owner of Condor Daemons:/var/lib/condor:/sbin/nologin
[root@atint01 ~]# grep -e fuse -e cvmfs -e condor /etc/group
fuse:x:493:cvmfs
cvmfs:x:490:
condor:x:491:

  • Install cvmfs using the default partition name for its cache
    • rpm -Uvh http://cvmrepo.web.cern.ch/cvmrepo/yum/cvmfs/EL/6/`uname -i`/cvmfs-release-2-4.el6.noarch.rpm
      • Edit cernvm.repo and modify [cernvm] to enabled=0 so it must be explicitly enabled for use.
      • sed -i s/enabled=1/enabled=0/ /etc/yum.repos.d/cernvm.repo
    • yum -y --enablerepo=cernvm install cvmfs-2.1.19-1.el6 cvmfs-auto-setup-1.5-1 cvmfs-init-scripts-1.0.20-1 cvmfs-keys-1.5-1
    • Create /etc/cvmfs/default.local with the content below
    • Create /etc/security/limits.d/cvmfs.conf with the content below
    • service autofs restart
    • Create the cvmfs setup files for ATLAS software
      • mkdir -p /usr/local/bin/setup
      • cp -p cvmfs_atlas.[sh|csh] from /net/share/cvmfs directory.
  • Test cvmfs in various ways
    • cvmfs_config chksetup (does not like cernvmfs.gridpp.rl.ac.uk, but this is a CERN issue)
    • cvmfs_config probe
    • cvmfs_config status
  • Locally install the cern libs
    • ln -s /usr/libexec/CERNLIB /cern
    • cd /net/share; tar cf - cern|(cd /usr/libexec; tar xf -); mv /usr/libexec/cern /usr/libexec/CERNLIB
  • Create all the local user accounts based upon the maintained list
    • /net/share/scripts/create_local_home_users.sh
    • This uses the list in /net/share/scripts/user_list_local_home.txt
  • Install and configure Condor
    • yum -y localinstall /net/share/condor-8.2.3-274619.rhel6.5.x86_64.rpm
    • /net/share/condor_config_files/nodeinfo/config_condor_prio.sh
      • Run this any time the Condor configuration changes
    • /net/share/condor_config_files/set_condor_dirs_and_files.sh
      • This should only be run once, but it is non-destructive to do it again
  • Edit sysctl.conf on the interative machine, and manually update the changed parameters
    • echo 1000 > /proc/sys/net/core/somaxconn
    • echo 4194303 > /proc/sys/kernel/pid_max

# Increase the PID max from the default 32768
kernel.pid_max = 4194303

# Increase the default connection backup from the default 128
net.core.somaxconn = 1000

Update Condor Configuration files

condor configuration files are centrally put in /net/share/condor_config_files directory, so any updates to the configuration files should first be updated in the corresonding sub directories according to the machine:
atint01  
atint02
atum-16Core  (for work nodes with 16 cores)
atum-32Core  (for work nodes with 32 cores)
atum-8Core  (for work nodes with 8 cores)
pcatum_1Core (for desktops with 2 cores)
pcatum_6Core  (for desktop with 8 cores)

An example of running on the interactive machine
[root@atint01 scripts]# sh condor_node_update_config.sh 
atint01
/net/share/condor_config_files/atint01
comparing /net/share/condor_config_files/atint01/14_runs_jobs.conf /etc/condor/config.d/14_runs_jobs.conf
/etc/condor/config.d/14_runs_jobs.conf is up to date
comparing /net/share/condor_config_files/atint01/20_groups.conf /etc/condor/config.d/20_groups.conf
/etc/condor/config.d/20_groups.conf is up to date
comparing /net/share/condor_config_files/atint01/25_daemons_ads.conf /etc/condor/config.d/25_daemons_ads.conf
/etc/condor/config.d/25_daemons_ads.conf is up to date
comparing /net/share/condor_config_files/atint01/50_atint01.conf /etc/condor/config.d/50_atint01.conf
/etc/condor/config.d/50_atint01.conf is up to date
comparing /net/share/condor_config_files/common_conf/10_common.conf /etc/condor/config.d/10_common.conf
/etc/condor/config.d/10_common.conf is up to date
comparing /net/share/condor_config_files/common_conf/15_defaults.conf /etc/condor/config.d/15_defaults.conf
/etc/condor/config.d/15_defaults.conf is up to date
comparing /net/share/condor_config_files/common_conf/22_user_groups.conf /etc/condor/config.d/22_user_groups.conf
/etc/condor/config.d/22_user_groups.conf is up to date
comparing /net/share/condor_config_files/set_condor_acc_grp /usr/local/bin/set_condor_acc_grp
/usr/local/bin/set_condor_acc_grp is up to date
comparing /net/share/condor_config_files/condor_submit_unlimited /usr/local/bin/condor_submit_unlimited
/usr/local/bin/condor_submit_unlimited is up to date
comparing /net/share/condor_config_files/condor_submit_short /usr/local/bin/condor_submit_short
/usr/local/bin/condor_submit_short is up to date
comparing /net/share/condor_config_files/condor_submit /usr/local/bin/condor_submit
/usr/local/bin/condor_submit is up to date

then one can push the configuration to all the condor nodes with this command on atint01
[root@atint01 ~]#  /root/tools/push_cmd.sh -f /root/tools/machines/condornodes.txt "/net/share/scripts/condor_node_update_config.sh"

Pushing commands to the cluster machines

A "push_cmd.sh" script, closely emulating the cluster_control suite at UM (but modified to not use the cluster_control DB), is in place on atint01 in /root/tools. Many of the commands below utilize this script, and its list of machines held in /root/tools/acct_machines.

Creating new user account(both UM and USTC)

Create the user NFS directory

New user accounts on the BAT188 cluster rely upon the existence of that user's CERN account. Credentials are copied via "useraddcern", and then the home directory is modified in the passwd file to place the user's home directly on the cluster instead of in afs space, although the latter is still available to the user. The script on atint01 "/net/share/scripts/mk_nfs_homed_account.sh" is utilized to this purpose.

This script also utilizes the "push_cmd.sh" script in /root/tools, and the subdirectory there that contains lists of active machines.

[root@atint01 scripts]# /net/share/scripts/mk_nfs_homed_account.sh
 Usage: ./mk_nfs_homed_account.sh username ustc [1|2|3] or ./mk_nfs_homed_account.sh username um [a|b|c|d]

[root@atint01 scripts]# /net/share/scripts/mk_nfs_homed_account.sh ahadef ustc 2

Following steps

Three last steps are required.

1. Add the user to the correct Condor account group.
update /net/share/condor_config_files/set_condor_acc_grp

then run /net/share/scripts/condor_node_update_config.sh on the interactive machines: atint01/02

2. On the file server, modify the /net/ustc_home disk quota, making sure files
have soft/hard limits set to 51000000 55000000
#edquota -F xfs -f /export/home -u ahadef

3. If this is a USTC user, add the account to the USTC definitions
in /net/share/condor_config_files/common_conf/22_user_groups.conf
then run /net/share/scripts/condor_node_update_config.sh on all the condor nodes
/root/tools/push_cmd.sh -f /root/tools/machines/condornodes.txt  /net/share/scripts/condor_node_update_config.sh

File content as referenced above

iptables content
=========================================
iptables on atint01 needs following additions:
# Accept UMATLAS muon cluster
-A INPUT -s 137.138.94.64/26 -j ACCEPT
# Also accept ustclin0N machines
-A INPUT -s 137.138.100.0/24 -j ACCEPT
#
# For Condor
#
-A INPUT -m udp -p udp --dport 9600:9700 -j ACCEPT
-A INPUT -m tcp -p tcp --dport 9600:9700 -j ACCEPT
-A INPUT -m udp -p udp --dport 33000:35000 -j ACCEPT
-A INPUT -m tcp -p tcp --dport 33000:35000 -j ACCEPT
#
# Open up the NFS ports needed to mount all the volumes
# NFS ports
-A INPUT -m state --state NEW -m udp -p udp --dport 875 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 875 -j ACCEPT
-A INPUT -m state --state NEW -m udp -p udp --dport 32769 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 32803 -j ACCEPT
-A INPUT -m state --state NEW -m udp -p udp --dport 892 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 892 -j ACCEPT
-A INPUT -m state --state NEW -m udp -p udp --dport 111 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 111 -j ACCEPT
-A INPUT -m state --state NEW -m udp -p udp --dport 2049 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 2049 -j ACCEPT

=======================================
iptables on  atums1 needs just this set
# Accept UMATLAS muon cluster
-A INPUT -s 137.138.94.64/26 -j ACCEPT
# Open up the NFS ports needed to mount all the volumes

# NFS ports
-A INPUT -m state --state NEW -m udp -p udp --dport 875 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 875 -j ACCEPT
-A INPUT -m state --state NEW -m udp -p udp --dport 32769 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 32803 -j ACCEPT
-A INPUT -m state --state NEW -m udp -p udp --dport 892 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 892 -j ACCEPT
-A INPUT -m state --state NEW -m udp -p udp --dport 111 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 111 -j ACCEPT
-A INPUT -m state --state NEW -m udp -p udp --dport 2049 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 2049 -j ACCEPT

=========================================================
Any other WN needs just this set

#
# Accept UMATLAS muon cluster
-A INPUT -s 137.138.94.64/26 -j ACCEPT
# Also accept ustclin0N machines
-A INPUT -s 137.138.100.0/24 -j ACCEPT
#
# For Condor
#
-A INPUT -m udp -p udp --dport 33000:35000 -j ACCEPT
-A INPUT -m tcp -p tcp --dport 33000:35000 -j ACCEPT
#

====================================

Content of /etc/cvmfs/default.local
======================================
# This is /etc/cvmfs/default.local

# this files overrides and extends the values contained
# within the default.conf file.

# Use 0.875*partition size for the quota limit
CVMFS_QUOTA_LIMIT='28450'
CVMFS_HTTP_PROXY="http://ca-proxy.cern.ch:3128;http://ca-proxy1.cern.ch:3128|http://ca-proxy2.cern.ch:3128|http://ca-proxy3.cern.ch:3128|http://ca-proxy4.cern.ch:3128|http://ca-proxy5.cern.ch:3128"
CVMFS_CACHE_BASE='/var/lib/cvmfs'

# the repos available
CVMFS_REPOSITORIES="\
atlas.cern.ch,\
atlas-condb.cern.ch,\
atlas-nightlies.cern.ch,\
sft.cern.ch"

===============================

The content of /etc/security/limits.d/cvmfs.conf

cvmfs soft nofile 32768
cvmfs hard nofile 32768

Backup of management scripts

All the management scripts are stored in atint01:/export or /net/share/, they are checked into cern git lab

https://:@gitlab.cern.ch:8443/wwu/umt3cern.git

In order the check in any new changes:
1)kinit wwu@CERN.CH
2)klist # check the kerbores ticket## After modifing a file, it needs to added again with git, see3)#git status can view the status of files(modified, deleted, not being tracked)
3)git add filenames
4) git commit -m "comments"
5) git push -u origin master
-- BobBall - 19 Nov 2014
Topic revision: r20 - 02 Oct 2019, WenjingWu
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback