Restarting the MSU OSG Grid
How to restart the system after an outage.
Bring Up and Check Services
Cluster Services General cluster services are required, for instance Kerberos, YP, NFS, AFS. This list depends somewhat on what is going on, but the standard cluster services are required and not covered here.
MSU2
msu2.aglt2.org is the dCache admin node and runs pnfs server. Boot it first if down.
root@msu2 ~# service pnfs start
Starting pnfs services (PostgreSQL version):
Shmcom : Installed 8 Clients and 8 Servers
Starting database server for admin (/opt/pnfsdb/pnfs/databases/admin) ... O.K.
Starting database server for data1 (/opt/pnfsdb/pnfs/databases/data1) ... O.K.
Starting database server for test (/opt/pnfsdb/pnfs/databases/test) ... O.K.
Starting database server for dzero-cache (/opt/pnfsdb/pnfs/databases/dzero-cache) ... O.K.
Waiting for dbservers to register ... Ready
Starting Mountd : pmountd
Starting nfsd : pnfsd
postgresql and pnfs services should start automatically at boot. /pnfs/fs should be mounted:
root@msu2 ~# df /pnfs/fs
Filesystem 1K-blocks Used Available Use% Mounted on
localhost:/fs 400000 80000 284000 22% /pnfs/fs
root@msu2 ~# mount | grep pnfs
localhost:/fs on /pnfs/fs type nfs (rw,udp,intr,noac,hard,nfsvers=2,addr=127.0.0.1)
Start dCache services, note that replica manager is not currently in use, you'll see a message about it not starting:
Nov 2008, use "service dcache start"
root@msu2 ~# /opt/d-cache/bin/dcache-core start
Starting dcache services:
Starting lmDomain Done (pid=7810)
Starting dCacheDomain Done (pid=7860)
Starting pnfsDomain Done (pid=7910)
Starting dirDomain Done (pid=7960)
Starting adminDoorDomain Done (pid=8025)
Starting httpdDomain 6 Done (pid=8080)
Starting utilityDomain Done (pid=8143)
Starting gPlazma-msu2Domain Done (pid=8204)
Starting infoProviderDomain Done (pid=8262)
Batch file doesn't exist : /opt/d-cache/config/replica.batch, can't continue ...
***TDR*** in dcache-srm start/stop script doing start
Using CATALINA_BASE: /opt/d-cache/libexec/apache-tomcat-5.5.20
Using CATALINA_HOME: /opt/d-cache/libexec/apache-tomcat-5.5.20
Using CATALINA_TMPDIR: /opt/d-cache/libexec/apache-tomcat-5.5.20/temp
Using JRE_HOME: /opt/d-cache/jdk1.6.0_03
Pinging srm server to wake it up, will take few seconds ...
Done
MSU4
msu4.aglt2.org provides a RAID pool area. Boot system if it is down. Start dcache-core and dcache-pool services:
Note If msu4 says that pnfs service is not available, but you have it running on msu2, suspect firewall issues.
The dcache services mount /pnfs/msu-t3.aglt2.org/ (this mount not done in /etc/fstab), if they have been stopped, you can also drop the mount for a fresh start...
Nov 2008, use "service dcache start"
root@msu4 ~# /opt/dcache/bin/dcache-core start
/pnfs/msu-t3.aglt2.org/ not mounted - going to mount it now ...
Starting dcache services:
Starting dcap-msu4Domain Done (pid=561037)
Starting gridftp-msu4Domain Done (pid=561104)
Starting gsidcap-msu4Domain Done (pid=561168)
root@msu4 ~# /opt/dcache/bin/dcache-pool start
start dcache pool: Starting msu4Domain Done (pid=561339)
The pool should be mounted and available:
root@msu4 ~# df -h | grep pool
/dev/sdb 4.1T 575G 3.6T 14% /dpool/pool1
root@msu4 ~# time ls -l /pnfs/msu-t3.aglt2.org/dzero/cache/upload/
...
real 0m34.674s
user 0m0.130s
sys 0m0.877s
You can look at
/var/log/dcache/msu4Domain.log
to see that the pool is healthy.
msu-osg
msu-osg.aglt2.org provides the frontend services to for the compute element. E.g. it is the grid gatekeeper. It runs as a VMWare server client on msu1.aglt2.org. If it needs to be booted, login to msu1 and see the script /root/start-vmware-client.sh which will start the clients from the command-line in a way that releases them from the terminal (you can start them and logout and they continue to run). You can also start them from the VMWare GUI console (vmware command). You may need to start vmware services first (service vmware start).
The running client processes are called vmware-vmx:
[root@msu1 ~]# ps auxw | grep vmware-vmx
root 30590 77.3 1.2 328976 198644 ? S<sl 16:15 0:36 /usr/lib/vmware/bin/vmware-vmx -C /vmware-disks/MSUROX/MSUROX.vmx -@ ""
root 30609 92.3 0.8 272020 144600 ? S<sl 16:15 0:34 /usr/lib/vmware/bin/vmware-vmx -C /vmware-disks/MSU-OSG/MSU-OSG.vmx -@ ""
Condor
Login to msu-osg and see that the condor server is running (is is just for the three slots on the frontend, but when the workers are up, you'll see them as well):
[root@msu-osg ~]# condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@msu-osg.aglt LINUX X86_64 Unclaimed Idle 0.920 654 0+00:00:04
slot2@msu-osg.aglt LINUX X86_64 Unclaimed Idle 0.000 654 0+00:00:05
slot3@msu-osg.aglt LINUX X86_64 Unclaimed Idle 0.000 654 0+00:00:06
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 3 0 0 3 0 0 0
Total 3 0 0 3 0 0 0
Grid services
The globus grid services don't need started; the gatekeeper is run from xinetd. You can check the log at /msu/opt/osg/globus/var/globus-gatekeeper.log for connections.
Compute Nodes
The nodes dc2-102-22 to dc2-102-42 (20 nodes) are Dell DZero compute nodes currently in ROCKS. They can be reinstalled as needed, ROCKS is setup to give them their proper condor config and to put them into the ganglia "MSU OSG" cluster for monitoring.
--
TomRockwell - 17 Jun 2008