Setting up gate02.grid.umich.edu as our AGL-Tier2 Gatekeeper
There are a number of steps we followed to get our new gatekeeper running.
Hardware
We had an Intel
SE7520AF2 motherboard from VENUS.ultralight.org (at CERN, bldg 513) which we replaced. We used this motherboard as the basis for the new gatekeeper. We purchased an Intel chassis
SC5300BASE to house it. Initial installs failed (problem with motherboard) so we RMA'ed the board. Recieved the replacement on September 6th. Processors are Intel Xeon P4 3.6 GHz, 2MB cache dual procesors. We have 4 GB of DDR2 400 RAM.
Completed install of SLC V4.3 on Thursday, September 6th.
Software
After installing and upgrading (via YUM) SLC V4.3 x86_64, we tried to install Intel Server Management software V8.4. Mostly successful except for the IMB driver build. Newest version of gcc (V3.4.6) complains:
gcc -O2 -DLINUX -D__KERNEL__ -DMODULE -DMODULES -I. -I/lib/modules/2.6.9-42.0.2.EL.cernsmp/build/include -DCONFIG_IA64_GENERIC -D__SMP__ -c -o imb_lin.o imb_lin.c
imb_lin.c: In function `IMBmmap':
imb_lin.c:891: error: dereferencing pointer to incomplete type
imb_lin.c:910: error: dereferencing pointer to incomplete type
imb_lin.c:910: error: dereferencing pointer to incomplete type
imb_lin.c:910: error: dereferencing pointer to incomplete type
imb_lin.c:910: error: dereferencing pointer to incomplete type
make: *** [imb_lin.o] Error 1
Older version (V3.4.5) doesn't have this problem.
Need to install/setup alternate gcc on gate02 to allow us to compile
September 13: No, the problem was we were apparently using an older version if ipmidrvr.rpm. I copied the code from gate01 and things worked.
Setup Authentication using AFS/NIS/Kerberos
We needed to install our AFS software and configure NIS/Kerberos:
- We first copied the /etc/yum.conf and /etc/yum.repos.d from gate01 to gate02
- We installed openafs V1.4.1-1.4 from linat05 (our repository) after removing the CERN openafs
- We copied the /etc/krb5.conf from gate01
- We copied the /etc/nsswitch.conf from gate01
- We copied the /etc/pam.d/system-auth from gate01
At this point logins worked and users would get Kerberos TGTs but not AFS tokens. We need to copy the /etc/ssh/sshd_config from gate01 and restart sshd. After this users get both Kerberos TGTs and AFS tokens.
Michigan node configuration
We have a number of tasks to get the node working as part of our cluster:
- Setup
yum
correctly
- Setup
net-snmp
- Install correct
smartd
and configure to monitor local disks
- Add node to
Cacti
- Setup
iptables
Setup OSG software
We had already
installed the OSG software on gate01.grid.umich.edu. Our plan is to use this installation for gate02 as well. The software was installed in AFS at /afs/atlas.umich.edu/OSG. Details:
[root@gate02 src]# fs lsmount /afs/atlas.umich.edu/OSG64
'/afs/atlas.umich.edu/OSG64' is a mount point for volume '#OSG_041'
[root@gate02 src]# fs lsmount /afs/atlas.umich.edu/OSG32
'/afs/atlas.umich.edu/OSG32' is a mount point for volume '#OSG32_041'
[root@gate02 src]# fs lsmount /afs/atlas.umich.edu/OSG
'/afs/atlas.umich.edu/OSG' is a mount point for volume '%OSG32_041'
I copied a number of services from gate01 to gate02:
- MLD
- edg-crl-upgraded
- gris
- globus-ws
- mysql
I also copied
gsiftp
and
globus-gatekeeper
from gate01:/etc/xinetd.d to gate02.
Last change was to add the above services to /etc/services:
siftp 2811/tcp # Added by the VDT
globus-gatekeeper 2119/tcp # Added by the VDT
Reinstallation of OSG software
The OSG software needed an update and running 'pacman -update' got it in a broken state. I decided to reinstall the OSG software on September 19, 2006 into /afs/atlas.umich.edu/OSG. I first moved the existing install:
cd /afs/atlas.umich.edu/OSG
mkdir OSG_Old_Install
mv * OSG_Old_Install
I then got Pacman and unpacked it into
/afs/atlas.umich.edu/OSG/pacman-3-18.5
I ran
script install_OSG_gate02.log
and then:
export VDT_PRETEND_32=1
pacman -allow save-setup
pacman -get OSG:ce
Started at 10:31 AM. Answered some trust questions and one about installing on this platform.
Details of setup and config are on
OSGInstallGate02
Finished pacman install at 10:47 AM.
Configuration of Accounts/Permission/Authorization
Testing of access to gatekeeper
Installation of DQ2 Software
Setup Torque client software
We need to setup the Torque (
OpenPBS) client for gate02. Part of a message from Andy Caird (CAC) follows:
You can download Torque from http://www.clusterresources.com/downloads/torque/ - we're running 2.1.2 on nyx. I've opened up the Torque ports on nyx to 141.211.43.122, so when you get it compiled, you should be able to type a command like:
qstat @nyx
and get some output.
The subnet that the nyx nodes will come from is 141.212.30.0/28 (141.212.30.0-141.212.31.255).
Let us know if you have any questions about Torque.
--andy
I downloaded the source on gate02:/root/
[root@gate02 ~]# wget http://www.clusterresources.com/downloads/torque/torque-2.1.2.tar.gz
[root@gate02 ~]# cd /usr/local/src
[root@gate02 src]# tar -zxvf /root/torque-2.1.2.tar.gz
torque-2.1.2/
torque-2.1.2/contrib/
...
<verbatim>
I then built/installed the 32bit version:
=# Similarly, 32bit builds on an x86_64 platform:=
= ./configure CC="gcc -m32"=
=make=
*NOTE: this failed with the messages below:*
<verbatim>
gcc -m32 -DHAVE_CONFIG_H -I. -I. -I../../src/include -I../../src/include -I/usr/X11R6/include -DPBS_SERVER_HOME=\"/var/spool/torque\" -DPBSPD=\"/usr/local/bin/pbspd\" -g -O2 -D_LARGEFILE64_SOURCE -c `test -f 'qstat.c' || echo './'`qstat.c
qstat.c:107:23: tclExtend.h: No such file or directory
qstat.c: In function `attrlist':
qstat.c:1576: warning: passing arg 2 of `Tcl_Merge' from incompatible pointer type
qstat.c:1581: warning: passing arg 2 of `Tcl_Merge' from incompatible pointer type
qstat.c: In function `tcl_stat':
qstat.c:1673: warning: passing arg 2 of `Tcl_Merge' from incompatible pointer type
qstat.c:1678: warning: passing arg 2 of `Tcl_Merge' from incompatible pointer type
qstat.c:1682: warning: passing arg 2 of `Tcl_Merge' from incompatible pointer type
qstat.c: In function `tcl_run':
qstat.c:1709: warning: assignment discards qualifiers from pointer target type
make[2]: *** [qstat.o] Error 1
make[2]: Leaving directory `/usr/local/src/torque-2.1.2/src/cmds'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/usr/local/src/torque-2.1.2/src'
make: *** [all-recursive] Error 1
</verbatim>
I then retried the =config= step by specifying:
=./configure CC="gcc -m32" --without-tcl=
This worked. Then:
=make install_clients=
After this =qstat= works:
<verbatim>
[gate02:torque-2.1.2]# /usr/local/bin/qstat @nyx.engin.umich.edu
Job id Name User Time Use S Queue
------------------- ---------------- --------------- -------- - -----
14436.nyx insds3 fiedler 473:11:2 R violi
14898.nyx BGO4W wangyi 123:36:4 R long
14899.nyx BGO4W wangyi 121:37:2 R long
...
</verbatim>
---+++ Testing NYX (Opteron, North Campus) Access
---+++ End-to-End Testing of PANDA Submission
-- Main.ShawnMcKee - 11 Sep 2006
</verbatim>
<nop>