Updating and Upgrading dCache Headnodes for AGLT2 June 2013
The dCache headnodes head01.aglt2.org and head02.aglt2.org were transitioned to VMware VMs in 2012. As of the beginning of June they are running:
- dCache 2.2.8
- Scientific Linux 5.8 64-bit
- Postgresql 9.0.10
We wish to transition these instances to:
- dCache 2.2.12 --- Most recent version with minor fixes
- Scientific Linux 6.4 64-bit --- We are moving to all SL6 systems
- Postgresql 9.2.4 --- Supports concurrent reindexing and is needed to minimize downtime when upgrading dCache to 2.6.x
To help with the process we have setup the original physical headnodes with new SSD disk areas and cleanly installed them with Scientific Linux 6.4, Postgresql-9.2 and dCache 2.2.12
Plan for transitioning to updated version
We have installed Slony on both the original VM headnodes and the new physical nodes. See
https://www.aglt2.org/wiki/bin/view/AGLT2/SlonyReplicationdCache for details. Using Slony allows us to replicate between Postgresql and OS versions. Once Slony was installed, all tables needed for dCache on both head01 and head02 are being replicated to n-head01 and n-head02 (the physical systems with SSDs). Here is the plan for transitioning the headnodes with minimal disruption:
- Create needed scripts to help with the transition.
- On Head01 create /root/dcache-local-scripts on head01 and move existing head01 scripts (perl, bash) into this area
- Create 'save-dcache-config.sh' to capture all the needed configuration details for;
- dCache
- OS (any grid-security, crontabs or customizations in the OS areas that need preservation)
- Create 'save-postgresql-config.sh' to capture all the needed Postgresql and Slony config info
- Create 'restore-dcache-config.sh' to reverse the process, renaming existing files suitably before overwriting.
- Create 'restore-postgresql-config.sh' to reverse the process for 'save-postgresql-config.sh', renaming files suitably before overwriting.
- Create 'rename-node.sh' to migrate the node OS configuration from oldname and oldIP to newname and newIP
- On Head02 create /root/dcache-local-scripts and move existing head02 scripts (perl, bash) into this area
- Copy the 'save-dcache-config.sh' created for head01 to head02 and update appropriately for head02
- Copy the 'restore-dcache-config.sh' created for head01 to head02 and update appropriately for head02
- Copy the 'save-postgresql-config.sh' created for head01 to head02 and update appropriately for head02
- Copy the 'restore-postgresql-config.sh' created for head01 to head02 and update appropriately for head02
- Copy the 'rename-node.sh' created for head01/n-head01 to head02 and update appropriately for head02
- Prepare new VMs from existing n-head01 and n-head02
- Use VMware to P2V from existing n-head01/02 before proceeding further. Don't copy the database in the P2V process to save time/space.
- Stop postgresql hot-standby replication to t-head01 and t-head02 and upgrade those nodes
- Update the Postgresql installation to 9.2.4
- Clear the existing database
- This removes a copy of the database but will allow us to much more quickly restore a replica after the transition.
- Verify Slony replication is running on head01, n-head01, head02 and n-head02
At this point we have the tools in place to migrate from the VM to new physical node
- Start with head01 and capture all relevant configuration files for dCache, Postgresql and local needed system files and crons using 'save-dcache-config.sh' script in /root/dcache-local-scripts.
- Running this saves files in /tmp/dcache-conf/..
- Output is two tarballs: /tmp/head01-dcache.bz2 and /root/billing.bz2
- Copy the saved tarballs to n-head01
- Unpack the head01-dcache.bz2 into the local /tmp area
- Verify Slony is properly replicating the 'dcache' and 'billing' DBs
- In OIM put an "at-risk" notice for the AGLT2_SE noting that the head01.aglt2.org services will be transitioning to a new host
- On head01 (VM), stop dCache: dcache stop
- Check slony to verify all records to billing and dcache DBs are completely replicated to n-head01
- Configure network to NOT start if rebooted
- Shutdown head01
- Move to n-head01 for the following steps
- Stop postgresql
- Run 'rename-node.sh' to readdress the node from n-head01 to head01
- Verify host is renamed and reachable on head01.aglt2.org and head01.local
- Run 'restore-dcache-config.sh' to put the correct dCache, postgresql and OS configs in place
- Start postgresql and verify it starts OK
- Start dcache and verify proper operation
- If services are working OK, remove OIM at-risk. We have completed head01 update
Next we can repeat the same basic steps above for head02 migration:
- On head02 and capture all relevant configuration files for dCache, Postgresql and local needed system files and crons using 'save-dcache-config.sh' script in /root/dcache-local-scripts.
- Running this saves files in /tmp/dcache-conf/..
- Output is two tarballs: /tmp/head02-dcache.bz2 and /root/billing.bz2
- Copy the saved tarballs to n-head02
- Unpack the head02-dcache.bz2 into the local /tmp area
- Verify Slony is properly replicating the 'dcache' and 'billing' DBs
- In OIM put an "at-risk" notice for the AGLT2_SE noting that the head02.aglt2.org services will be transitioning to a new host
- On head02 (VM), stop dCache: dcache stop
- Check slony to verify all records to billing and dcache DBs are completely replicated to n-head02
- Configure network to NOT start if rebooted
- Shutdown head02
- Move to n-head02 for the following steps
- Stop postgresql
- Run 'rename-node.sh' to readdress the node from n-head02 to head02
- Verify host is renamed and reachable on head02.aglt2.org and head02.local
- Run 'restore-dcache-config.sh' to put the correct dCache, postgresql and OS configs in place
- Start postgresql and verify it starts OK
- Start dcache and verify proper operation
- If services are working OK, remove OIM at-risk. We have completed head02 update
At this point we should have two physical nodes running the head01 and head02 dCache services. They should be able to replicate via Postgresql to t-head01 and t-head02 respectively
once those nodes are updated. Until t-head01 and t-head02 are updated to Postgresql 9.2.4 we have no replica of the dCache DB and are at risk!
Restoring Hot-Standby or Replication
Two choices
- We can create two new VMs call n-head01 and n-head02, built just like we built the new physical nodes. (See step above) We can then setup Slonly replication to these nodes right after head01 and head02 transition to their new host.
- We can update t-head01 and t-head02 to match the same version of Postgresql and restart the hot-standy process. In fact this can be done before the migration.
Things I Forgot
On
head01
I forgot:
- The /root/.ssh area
- The /etc/ssh files (need to especially keep the keys)
- The /var/lib/pgsql/9.2/data/postgres.conf needs to NOT just use the postgres.conf from the 9.0 original instance. Instead edit the 9.2 instance and set
archive_mode=on
archive_command = 'cp -i %p /atlas/data08/postgres/archive/%f </dev/null'
vacuum_defer_cleanup_age = 1000
log_filename = 'postgresql-%Y-%m-%d_M%S.log'
- The /etc/fstab needs a mount for /pnfs
head02.aglt2.org:/pnfs /pnfs nfs rw,hard,nfsvers=3 0 0
- Make sure /pnfs exists
- Make sure the /etc/grid-security area has the certificates directory and it is being updated by an /etc/cron.d/rsync-certificates.cron entry
- Make sure the /etc/grid-security/hostkey.pem is owned by dcache
- Copy the /var/lib/dcache/.pgpass file and make sure is it mode 600 and owned by dcache
- The network restart in the rename-nodes.sh hangs when run via ssh. Need to run out-of-band
- Make sure Slony is stopped and the Slony schema is removed on the new headnode before activating dCache. Otherwise the tables are locked.
su - postgres
cd /usr/psql-92/bin
./slon_kill --config /etc/slon_tools-dcache.conf
./slon_kill --config/etc/slon_tools-billling.conf
psql
\c dcache
\dnS+
delete schema _rep_dcache cascade;
\c billilng
\dnS+
delete schema _rep_billing cascade;
\q
- Need to suitably update monit on nodes
- Make sure /etc/security/limits.d/90-nproc.conf is empty and chattr +i
- Make sure /etc/security/limits.conf has the following settings:
* soft nofile 32000
* hard nofile 42000
* soft nproc 62000
* hard nproc 64000
For
head02 a similar list:
- The /root/.ssh area
- The /etc/ssh files (need to especially keep the keys)
- The /etc/exports file (needed for dCache NFS)
- Make sure new head02 has Java (jdk) 1.7 installed
- The /var/lib/pgsql/9.2/data/postgres.conf needs to NOT just use the postgres.conf from the 9.0 original instance. Instead edit the 9.2 instance and set
archive_mode=on
archive_command = 'cp -i %p /atlas/data08/postgres/archive/%f </dev/null'
vacuum_defer_cleanup_age = 1000
log_filename = 'postgresql-%Y-%m-%d_M%S.log'
- Make sure the /etc/grid-security area has the certificates directory and it is being updated by an /etc/cron.d/rsync-certificates.cron entry
- Make sure the /etc/grid-security/hostkey.pem is owned by dcache
- The network restart in the rename-nodes.sh hangs when run via ssh. Need to run out-of-band
- Had a problem with the default gateway not being setup which caused red health issues...check for default route.
- Make sure Slony is stopped and the Slony schema is removed on the new headnode before activating dCache. Otherwise the tables are locked.
su - postgres
cd /usr/psql-92/bin
./slon_kill --config /etc/slon_tools-chimera.conf
./slon_kill --config/etc/slon_tools-rephot.conf
psql
\c chimera
\dnS+
delete schema _rep_chimera cascade;
\c rephot
\dnS+
delete schema _rep_rephot cascade;
\q
- Need to suitably update monit on nodes (as appropriate)
- Make sure /etc/security/limits.d/90-nproc.conf is empty and chattr +i
- Make sure /etc/security/limits.conf has the following settings:
* soft nofile 32000
* hard nofile 42000
* soft nproc 62000
* hard nproc 64000
- Make sure proper iptables is in place
--
ShawnMcKee - 06 Jun 2013