Transition from CFEngine v2 to v3, and Build dCache Pool Servers

Introduction

As documented elsewhere in this Wiki, cfengine2 is currently (Oct 2012) in use to configure service machines such as dCache pool servers, gatekeepers, etc. We are in the process of transitioning this to use cfengine3 just as Worker Nodes use cfengine3, and based upon the same svn trunk code.

The initial svn branch for this work is in svn at repos/cfengine/branches/masterfiles-storage, as it is first targeted for use with dCache pool servers. Beyond that first target, many classes and hooks are also being set in place to handle other machine configurations.

The Policy Server node for cfengine2 is manage.aglt2.org. For cfengine3 this will become umcfe.aglt2.org at UM, and msucfe.aglt2.org at MSU.

Basic Plan

The basic plan for this transition involves changing the cfengine2 to have the same effect on a given machine using cfengine3. Many of the promises made in version 2 will be merged with those from version 3, and others will become new promises. In all cases, full compatibility of the code with that in use for Worker Nodes will be maintained.

The policy will be served from "/var/cfengine/masterfiles-T2"

Implementation on dCache Servers

The procedure for building a dCache server (pool or door, admin not yet ready) consists of the following steps.
  • If this is a rebuild, SAVE THE METAFILE DATA.
    • cron does this automatically to afs every 15 minutes
    • A final sync should be made using /root/tools/save_final_dcache_metadata.sh
      • This will remove the cron entry.
  • Add Mac information for PXEboot to the appropriate Rocks head node. See RocksPXEbootServers
  • Do bare metal build of the machine
    • Includes our aglt2-specific cfengine rpm
  • Create the file /root/monit_secret, perms 600, containing the monit PW for httpd access
  • run cf-agent for the first time
    • May have already run, but an initial run takes 7 minutes or more to complete. LOOK FOR IT.
    • Scripts will be placed in the /root/tools directory to assist in (all TBD)
      • Creating vdisks on the MD3260 storage units (md3260-setup.sh)
        • For example, md3260-setup.sh MD3260_UMFS09_1 umfs09 6
        • Comments elsewhere indicated that rescan_dm_devs should be run at this point
        • It may be the case that the first reboot following the first cf-agent run will run this automatically; no answer available
        • This took 20 minutes on umfs09
      • Show vdisks, etc, on all attached storage (storage_details.sh)
      • Creating RAID-6 arrays for MD1000 and MD1200 pools (setup-raid-R6.sh)
        • First runs storage_details.sh
      • Create file systems on attached storage (setup-raid-R6.sh)
        • For MD1000/MD1200, the vdisk is first created
      • Creation of fstab entries based upon shelf labels. (remake_xfs_fstab.sh)
        • If not already mounted, do "mount -av; service set_readAhead start"
      • Relocate dCache metadata from the storage disks to the "/var/dcache/meta" directory (relocate_meta.sh)
      • Assist in setting up NIC bonding (not yet written)
      • restore any saved meta-directory from the afs copy. (restore_dcache_metadata.sh)
  • Re-boot to run the UL kernel, and (possibly) run rescan_dm_devs automatically
  • run cf-agent for the second time following all appropriate setups.
    • Following the first cf-agent run, this will now happen on the regular AGLT2 schedule
    • It would not hurt to run if for a 3rd time as well
Carefully examine the machine state. It should now be ready for the startup of the dCache services.

Note: I am not sure of the interaction between monit and an incomplete dCache install on a door machine. Hopefully, prior to full setup of the door, monit does not spam Email over dcap service restarts. Turns out it continuously starts the dcap door before we are ready to run the Door machines. SO, solution will be:
  • monit is not chkconfig on or started automatically (on pools and doors both)
  • When we are finally ready we should do both manually

Metafile Sync for dCache Pools

A lot of care has been taken to make it hard to overwrite existing metafile data from the dCache pools. Details for saving the last set before a dCache pool server is rebuilt are shown above.

Cron tasks will not run to sync the metafile data into afs UNTIL the script (restore_dcache_metadata.sh) is first run. This creates the "/var/dcache" directory and its children. CFE3 is keyed on this, and once the directory exists, the cron task will be placed on the machine and begin to run at its regular intervals.

For new pools, or the first build of new pool servers, AFTER the pools are created and BEFORE dcache is started the script "relocate_meta.sh" should instead be run. This will also create the "/var/dcache" directory as above, and initiate the sync of the metadata into afs.

NIC bonding at UM

Switch port configuration must match the OS expectation. In general, we will have eth0, the PXE boot NIC, always on a Native VLAN 4010 port. It should no longer be used in bonding. Such machines should have second 10Gb NICs installed to the extent possible.

The NICs used for bonding will then all be 10Gb capable.

Formatting dCache storage

It was decided long ago that any kind of automatic disk reformatting was too dangerous for us to consider. Affirming that, scripts useful to that end will be placed in the /root/tools directory of the newly configured machines, to assist in those later, manual operations. See, for example, /root/tools/setup-raid-R6.sh as detailed above.

The "storage_summary.sh" script can be run at any time, and gives a complete snapshot of attached storage. See the end of this document for examples of this information taken from both MD3260 and MD1000/1200 storage based systems.

A note on host certificates

We update host certificates from aglbatch, and directory /root/hostcert contains all the protected files. These are not placed in svn. The osg-pki-tools rpm from osg is installed on aglbatch so that new host certificates can be generated there and propagated into cf3. A tarfile of the "hostcert" directory (cd /root; tar xf hostcert.tar hostcert) here is placed at /root on umcfe and msucfe for propagation into the T2 policy.

The certs directory is, instead, copied as /var/cfengine/policy/T2/stash/hostcert, with permissions 0400 on the hostcert directory and all contained files.

Table of promise files

The following table details the cfengine2 policy files in use, indicates if they will merged with current policy (file name), or if they will become new policy promises. Completion is also indicated. Version 2 files are all named like "cf.*", whereas version 3 files are all named like "*.cf".

Policy File Merge With New File Comments Completed
definitions promises Moving as needed
groups promises Moving as needed
main  
ignore  
passwd   passwd_group Scripted X
securitychecks   Look for compromised systems
sshkeys ssh-keys X
profile   Drop X
locate   locate command DB
krb5 krb5_client X
resolv resolv X
selfmanage   Drop X
ntp ntp X
  iostat X
snmp snmpd X
banner   defer  
printers   defer  
amanda   defer  
svnverify   defer  
dcache   dcache X
  storage_service   X
  bond_check   X
  monit   X
  osg_wn_client Extend certificate sync for dcache X
    hostcert Complete host certificate install X
ocsinventory   X
    cfengine3 Configure cfengine3 X
    serial_console Set up serial console X
syslogng   syslogng X
iptables iptables More to add for other systems X
timestamp   X
nscd  
network   network X
auth  
sudo sudo X
nfs  
automount   X
rpmconfig Drop X
yum yum_repos   X
openafs openafs   X
sysctl sysctl X
packagechecks   yum_packages X
lustre  
cvmfs  
atlasdev  
kernel   kernel X
ssh sshd X
ganglia ganglia X
    limits X
    updatedb X
Note that cf.iptables was very complex. Most machine types of T2 relevance were included in the iptables.cf. Notable exclusions are:
  • Rocks head nodes
  • Gatekeepers
  • UM T3 desktops
There is no adaptation at all for any machine not in Michigan.

Cfengine promises for System Service machine class

The following promises, in the given order, are the set from which the machines are to be configured.
  • Commented lines may or may not eventually be modified for this machine class
  • Un-commented lines are set to work
  • Most recent kernel versions are hard-coded, not the best idea.

    SystemServices::
      "my_bundseq"
        slist => {
          "update",
          "tidy",
          "yum_packages",
          "yum_repos",
          "openafs",
          "kernel",
          "atd",
          "automount",
          "bond_check",
          "cfengine3",
          "crond",
#          "cvmfs",
          "dell_omsa",
          "iostat",
          "iptables",
          "krb5_client",
          "ganglia",
          "limits",
#          "logwatch",
          "network",
          "ntp",
          "ocsinventory",
          "osg_wn_client",
           "passwd_group",
          "resolv",
          "snmpd",
          "sshd",
          "ssh_keys",
          "storage_service",
          "sudo",
          "hostcert",
          "dcache",
          "monit",
          "serial_console",
          "sysctl",
          "syslogng",
          "timestamp",
#          "tools",
#          "tmpwatch",
          "updatedb",
        };   

Notes on yum_packages and yum_repos

As Ben points out, some repos are configured in yum_packages, and some in yum_repos. The primary reason that these two are separated is so that I could SEQUENCE between the two promise sets, instead of looping through them. Now, it may be true that this is an un-needed distinction. However, it is as good a place as any as a starting point, and can be modified later.

Local repos are set in the yum_repos.cf, instead of the public repos. In particular, the UM repos are referenced. I believe Tom mentioned MSU local repos, so there may be needed changes in this regard as well.

Items to be included, not yet built in cf3

  • The SL6 repos should be changed to point to our local copies.
  • Add various rpms. See, for example, GLPI ticket #3450. These will include:
    • iqvlinux
NOTE: The dell_omsa.cf in the trunk has a defined class that should be in promises.cf, as it is in this masterfiles-storage svn branch. Upon merge, that class definition can be removed.

Sample Output from the "storage_details.sh" Script

The PCI devices show up on the MD3260, but omreport does not see the details.

MD3260 Sample Output

[root@umfs09 ~]# tools/storage_details.sh

======= Controller 0 Information =======
 Controller  PERC H310 Mini (Embedded)
Details of Enclosure BP12G_           on Controller PERC H310 Mini
ID                   : 0:1
Slot Count           : 8
--- There are 1 active vdisks on this controller
    VD 0 Named: OS is 0:1:0 0:1:1

======= Controller 1 Information =======
 Controller  6Gbps SAS HBA (Slot 5)
--- There are 0 active vdisks on this controller

======= Controller 2 Information =======
 Controller  6Gbps SAS HBA (Slot 6)
--- There are 0 active vdisks on this controller

======= Controller 3 Information =======
 Controller  6Gbps SAS HBA (Slot 7)
--- There are 0 active vdisks on this controller

======= Controller 4 Information =======
 Controller  6Gbps SAS HBA (Slot 4)
--- There are 0 active vdisks on this controller

======= Multipath Information =======
 Device mpatha on MD3260_UMFS09_1, labeled umfs09_1 contains NO File System
 Device mpathb on MD3260_UMFS09_1, labeled umfs09_2 contains umfs09_2
 Device mpathj on MD3260_UMFS09_2, labeled umfs09_10 contains umfs09_10
 Device mpathl on MD3260_UMFS09_2, labeled umfs09_11 contains umfs09_11
 Device mpathk on MD3260_UMFS09_2, labeled umfs09_12 contains umfs09_12
 Device mpathc on MD3260_UMFS09_1, labeled umfs09_3 contains umfs09_3
 Device mpathd on MD3260_UMFS09_1, labeled umfs09_4 contains umfs09_4
 Device mpathf on MD3260_UMFS09_1, labeled umfs09_5 contains umfs09_5
 Device mpathe on MD3260_UMFS09_1, labeled umfs09_6 contains umfs09_6
 Device mpathg on MD3260_UMFS09_2, labeled umfs09_7 contains umfs09_7
 Device mpathh on MD3260_UMFS09_2, labeled umfs09_8 contains umfs09_8
 Device mpathi on MD3260_UMFS09_2, labeled umfs09_9 contains umfs09_9

MD1000/MD1200 Sample Output

[root@umfs01 ~]# bash tools/storage_details.sh

======= Controller 0 Information =======
 Controller  PERC H800 Adapter (Slot 2)
Details of Enclosure MD1200 on Controller PERC H800 Adapter
ID                   : 0:0
Slot Count           : 12
Details of Enclosure MD1200 on Controller PERC H800 Adapter
ID                   : 0:1
Slot Count           : 12
--- There are 2 active vdisks on this controller
    VD 1 Named: umfs01_2 is 0:0:0 0:0:1 0:0:2 0:0:3 0:0:4 0:0:5 0:0:6 0:0:7 0:0:8 0:0:9 0:0:10 0:0:11
    VD 2 Named: umfs01_3 is 0:1:0 0:1:1 0:1:2 0:1:3 0:1:4 0:1:5 0:1:6 0:1:7 0:1:8 0:1:9 0:1:10 0:1:11

======= Controller 1 Information =======
 Controller  PERC H800 Adapter (Slot 1)
Details of Enclosure MD1200 on Controller PERC H800 Adapter
ID                   : 0:0
Slot Count           : 12
Details of Enclosure MD1200 on Controller PERC H800 Adapter
ID                   : 0:1
Slot Count           : 12
--- There are 2 active vdisks on this controller
    VD 0 Named: umfs01_4 is 0:0:0 0:0:1 0:0:2 0:0:3 0:0:4 0:0:5 0:0:6 0:0:7 0:0:8 0:0:9 0:0:10 0:0:11
    VD 1 Named: umfs01_5 is 0:1:0 0:1:1 0:1:2 0:1:3 0:1:4 0:1:5 0:1:6 0:1:7 0:1:8 0:1:9 0:1:10 0:1:11

======= Controller 2 Information =======
 Controller  PERC H800 Adapter (Slot 4)
Details of Enclosure MD1200 on Controller PERC H800 Adapter
ID                   : 0:0
Slot Count           : 12
Details of Enclosure MD1200 on Controller PERC H800 Adapter
ID                   : 0:1
Slot Count           : 12
--- There are 2 active vdisks on this controller
    VD 0 Named: umfs01_6 is 0:1:0 0:1:1 0:1:2 0:1:3 0:1:4 0:1:5 0:1:6 0:1:7 0:1:8 0:1:9 0:1:10 0:1:11
    VD 1 Named: umfs01_1 is 0:0:0 0:0:1 0:0:2 0:0:3 0:0:4 0:0:5 0:0:6 0:0:7 0:0:8 0:0:9 0:0:10 0:0:11

======= Controller 3 Information =======
 Controller  PERC H200 Integrated (Embedded)
Details of Enclosure BACKPLANE        on Controller PERC H200 Integrated
ID                   : 0:0
Slot Count           : 8
--- There are 1 active vdisks on this controller
    VD 0 Named: Virtual is 0:0:0 0:0:1

Example: Building (and rebuilding) umfs09

The following steps brought umfs09 to a state where we can bring it into dCache. Note that the volumes were already created and file systems placed upon them from the earlier iterations of this machine. Often "cf-agent -K" is run twice. This hurts nothing, and often catches items that did not flow during the first pass, eg, classes set within common promises, that don't change during the course of a single run, despite fulfilled promises that would result in those class changes.

These steps were run prior to the rebuild
  • On head01, set all disk pool read only, then wait for 15 minutes to begin, for example

[root@head01 ~]# ssh -c blowfish -p 22223 -l admin localhost

    dCache Admin (VII) (user=admin)

[head01.aglt2.org] (local) admin > cd PoolManager
[head01.aglt2.org] (PoolManager) admin > psu set pool msufs14* rdonly
4 pools updated

  • dcache stop
    • Make sure that all java processes have properly quit running
  • kinit admin
  • aklog
  • /root/tools/save_final_dcache_metadata.sh
    • When this is run, stay away from 15 minutes after the hour, as that is when the cron task runs, thus avoiding any conflict
    • This will remove the cron task
  • The network info, for systems that use bonded NICs, should be saved to simplify later re-establishment of the network
    • This has been done on 12/31/2012. See tar files named /atlas/data08/ball/admin/fs_saves_dir/restore_HOSTNAME.tar
  • Save anything else you want, as the disks will be reformatted
These steps implement the rebuild
The PXE details have changed with the implementation of Cobbler at both sites
  • Make sure the hard disk is first in the boot order, followed by the onboard NIC
  • Bare metal build of umfs09, PXE booting from umrocks/msurx
    • If PXE refuses to build on the disk specified via umrocks/msurx, use Alt-F2 to get a console prompt, then
    • "fdisk -l" to find the correct device to use
    • rocks set host attr umfs09 sysdevice sdaq
    • rocks sync config
    • Now, redo the PXE boot
    • cf-execd runs via its cron schedule, but does not (apparently?) install new rpms, so....
  • service cfengine3 stop
  • chkconfig cfengine3 off
  • Set the root pw to the standard from that used for PXE builds
  • Make sure the public NIC is up. This may be a good time to get the restore_HOSTNAME.tar and examine the content (unpack it into /root)
    • This should be un-necessary on the msufsNN machines. Should be....
    • Check the /etc/sysconfig/network file. Make sure it looks something like this:

          NETWORKING=yes
          HOSTNAME=umfs09.aglt2.org
          GATEWAY=192.41.230.1

At MSU, we would want
          GATEWAY=192.41.236.1

    • If at MSU, after networking is up, ensure there is a route to the UM private network, 10.10.0.0, eg,
      • 10.10.0.0 10.10.128.1 255.255.240.0 UG 0 0 0 em1
  • cf-agent -f failsafe.cf
  • cf-agent -K
  • cf-agent -K (yes, run it twice)
  • chkconfig dcache-server off
  • Reboot into UL kernel
  • cf-agent -K
    • afs may be chkconfig off here. If so, chkconfig it on and start the service
  • tools/remake_xfs_fstab.sh
  • mount -av
  • service set_readAhead start
  • kinit admin
  • aklog
  • /root/tools/restore_dcache_metadata.sh
  • cf-agent -K
  • cf-agent -K
  • Make sure /dcache/pool/meta is properly linked to /var/dcache/meta/dcache. If one is correct, all mounts should be correct at this point
  • chkconfig cfengine3 on
  • service cfengine3 start
  • Make sure that /etc/grid-security/certificates exists. If not
    • /bin/bash /root/tools/rsync-certificates.sh
  • If /tmp/dell-storage-alerts.pl.debug does not exist (CF3 Catch-22), then set up the Dell alerts
    • /root/tools/configure_dell_alerts.sh
  • Manually set up shelf consistency check crons as the disks were not yet mounted when CF3 installed the tools for this
    • /root/tools/consistency.sh

  • One final yum update, that seems to fix some minor rpm issues not handled by CFEngine3, and the machine is now ready for dCache use.
    • dcache start
  • Place the shelves back to read/write mode, ie, "notrdonly" instead of "rdonly" in the head01 command sequence above.

Example: Building (and rebuilding) umfs03 to SL7.3

  • On head01, set all disk pool read only, then wait for 15 minutes to begin, for example

[root@head01 ~]# ssh -c blowfish -p 22223 -l admin localhost

    dCache Admin (VII) (user=admin)

[head01.aglt2.org] (local) admin > \c PoolManager
[head01.aglt2.org] (PoolManager) admin > psu set pool umfs03* rdonly
4 pools updated
[head01.aglt2.org] (PoolManager) admin > \q

  • dcache stop
    • Make sure that all java processes have properly quit running
  • kinit admin
  • aklog
  • service cfengine3 stop
  • /root/tools/save_final_dcache_metadata.sh
    • When this is run, stay away from 15 minutes after the hour, as that is when the cron task runs, thus avoiding any conflict
    • This will remove the cron task
  • The network info, for systems that use bonded NICs, should be saved to simplify later re-establishment of the network
    • Cobbler seems to now do all of this correctly but this command (and save the file off-machine) may be useful
      • cd /etc/sysconfig/network-scripts
      • tar xf /root/`hostname -s`_net_save.tar ../network ifcfg* route*
  • Save anything else you want, as the disks will be reformatted, eg
    • /var/tmp/consistency_data.txt
    • /etc/fstab
    • /etc/modprobe.d/bonding.conf

Now, PXE build the system via Cobbler. Upon boot and completion of the initial cfe runs do the following.

  • systemctl stop cfengine3
  • systemctl disable cfengine3
  • Set the root pw to the standard from that used for PXE builds
  • systemctl disable dcache-server
  • /bin/rm -rf /dcache
    • This is just a directory that let cf3 run during the initial run set, so clean it away now
  • Restore files. From the saved list above, this means only
    • /var/tmp/consistency_data.txt
  • tools/remake_xfs_fstab.sh
  • mount -av
  • systemctl start set_readAhead
  • kinit admin
  • aklog
  • /root/tools/restore_dcache_metadata.sh
  • Update firmware (optional)
    • /root/tools/update_dell_firmware.sh
  • reboot
    • Make sure all services are started by doing this, and apply firmware updates
  • cf-agent -K
  • cf-agent -K
  • Make sure /dcache/pool/meta is properly linked to /var/dcache/meta/dcache. If one is correct, all mounts should be correct at this point
  • systemctl enable cfengine3
  • systemctl start cfengine3
  • Make sure that /etc/grid-security/certificates exists. If not
    • /root/tools/rsync-certificates.sh
  • If /tmp/dell-storage-alerts.pl.debug does not exist (CF3 Catch-22), then set up the Dell alerts
    • /root/tools/configure_dell_alerts.sh
  • Manually set up shelf consistency check crons as the disks were not yet mounted when CF3 installed the tools for this
    • /root/tools/consistency.sh

  • One final yum update, that seems to fix some minor rpm issues not handled by CFEngine3, and the machine is now ready for dCache use.
    • systemctl enable dcache-server
    • dcache start
  • Place the shelves back to read/write mode, ie, "notrdonly" instead of "rdonly" in the head01 command sequence above.

Problems on SL7

See SL7 Issues for details

-- BobBall - 18 Oct 2012
Topic revision: r48 - 20 Sep 2017, BobBall
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback