Getting the MSU site up in ROCKS
Had some initial difficulty getting the compute nodes installed. Restarted from scratch with the following plan:
- Clean-up /home/install/site-profiles
- Clean-up /home/install/contrib
- Clean-up DB
- Put U-M configs and files in place
- Add appliance to DB
- Install a node
- Inventory hair
This actually went pretty smoothly and we successfully installed a compute node. We then added the rest of the compute nodes.
Doing the Compute Node Mass Install
Put a description here.
Post Install Config
Set BIOS setting that I wanted...
Resolved Issues
Node dc2-102-??? didn't install, needed to redo.
Repeatedly failed to get dc2-102-20 to install. The default ROCKS partition table kept being recreated and then I'd get the message "failed to find /" (note that this is from memory) and the install would exit. I'd then go to the terminal in the installer and use dd to nuke the partitions. I also tried doing "insert-ethers --remove dc2-102-20". Finally I physically swap the two drives in the machine (reversed their positions) and then the install worked. This node was one that I had been doing test installs on, so it had some partitions on it before we rolled out the mass install.
You have to make sure partitions don't have a .rocks-release file if you want to repartition. You can force a repartition by changed the extend-auto-partition.xml. -SPM
Outstanding Issues
Bob raises these issues after looking things over:
I just logged into the machine dc2-102-1. Looks good! /etc/sysconfig/network and /etc/resolv.conf are inconsistent,
but that is no big deal as the eth1 is needed to really make this work right.
Hmmm, no 411 service running on msurox, so /etc/hosts is not distributed. With that the case, you can't find host
msurox.local, so automount of /home/install isn't possible, so that explains at least part of the first paragraph.
Condor does not configure either because of this.
Condor is not installed on msurox. So, no setups for condor in /etc/profile.d, and so on.
comments on Bob's issues
Need to setup eth1 and also change the gateway (in /etc/sysconfig/network).
/etc/resolv.conf is broken --- UM has customized it, but they use ROCKS 4.2 and DNS works differently in ROCKS 4.3. This is issue with 411 and /etc/hosts as well --- /etc/hosts no longer distributed by 411.
Condor config, will have to reinstall a node with with /etc/resolv.conf to see how far condor configuration gets...
Don't intend to install Condor on ROCKS headnode.
Other stuff:
- need to fix netmask on private network to 255.255.254.0.
- /etc/resolv.conf has perms 444. Doesn't have headnode for DNS. Has search items that we may not want; search also doesn't work as expected wi
- added headnode as first nameserver and can lookup other compute nodes...
- do to misuse of the insert-ethers --baseip option, dc1-102-1 is at 10.10.2.253 instead of .254, since it's public ip is already registered as 192.41.231.254, I might change this... How-to?
+SPM: remove the node using insert-ethers, then re-register it...-SPM
- similarly, the nodes in rack 104 have private IPs starting with .233 instead of .214.
Modifying Network Config
Had setup private network as 10.10.0.0/255.255.252.0. Want to change this to 10.10.2.0/255.255.254.0.
It used to be that the automatically setup phpMyAdmin page allowed you write access to the database, but in ROCKS 4.3 it is read-only. However, if you attempt to edit, the SQL command will be shown on the webpage allow with the permission error message. You can then use this SQL command in the command-line tool "mysql" to modify the database. Here is how I changed the private network in the database by hand, note that this can be done more easily with rocks command:
$ mysql -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 330 to server version: 4.1.20
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> use cluster;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> update subnets set netmask = '255.255.254.0' where ID = 1 limit 1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> update subnets set subnet = '10.10.2.0' where ID = 1 limit 1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> select * from subnets where name = "private";
+----+---------+-----------+---------------+
| ID | name | subnet | netmask |
+----+---------+-----------+---------------+
| 1 | private | 10.10.2.0 | 255.255.254.0 |
+----+---------+-----------+---------------+
1 row in set (0.00 sec)
mysql> exit
Bye
The way to do this in ROCKS 4.3 is:
$ rocks set network netmask private 255.255.254.0
$ rocks set network subnet private 10.10.2.0
Change node ip address:
$ rocks set host interface ip dc2-102-1 eth0 10.10.2.254
then reinstall the node.
Noticed that private network netmask is also in
app_globals
table as a Kickstart item, change:
rocks set var service=Kickstart component=PrivateNetwork value=10.10.2.0
rocks set var service=Kickstart component=PrivateNetmask value=255.255.254.0
Remaining private net issues
- msurox eth0 - fixed netmask by hand
- in the database, the private network interfaces for all nodes show old value - this may get fixed with reinstall?
- what needs to be done to force reinstall of node? %ICON{"tip"} Use
/boot/kickstart/cluster-kickstart-pxe
on the node.-SPM I changed the private ip address on dc2-102-2 and rebooted it, but it just booted normally without reinstalling. Seems that adding public ip address is forcing reinstall? (can do shoot-node 10.10.2.252)
Add config for eth1
This is specific to ROCKS 4.3 using the
rocks
command. After the node has been installed once, the installer will have found eth1 and added it to the database with its MAC and module settings.
/opt/rocks/bin/rocks add host interface dc2-102-1 eth1
/opt/rocks/bin/rocks set host interface mac dc2-102-1 eth1 00:19:b9:ef:d0:b2
/opt/rocks/bin/rocks set host interface module dc2-102-1 eth1 bnx2
Add the subnet (refers to subnet table in db), ip and values to the database manually:
rocks set host interface ip dc2-102-1 192.41.231.254
rocks set host interface subnet dc2-102-1 public
rocks set host interface gateway dc2-102-1 192.41.231.1
Further Notes
DNS...
Have a node reinstalled with just the default name resolution setup (no customization). Here is
/etc/resolv.conf
[root@dc2-102-1 ~]# more /etc/resolv.conf
#
# Do NOT Edit (generated by dbreport)
#
# Private side resolver configuration file
#
nameserver 10.10.2.15
search local aglt2.org
Note that name lookup for dc2-102-1 returns both the local and public ip addresses (as expected) but on sequential lookups, sometimes the local comes first and sometimes the public comes first? Is that a problem? Also note that reverse lookup is not working.
It turns out that named.conf on the frontend is not being updated automatically with a
rocks sync config
. So forced it to be updated with:
[root@msurox install]# dbreport named > /etc/named.conf
[root@msurox install]# diff /etc/named.conf /etc/named.conf.nov7
62c62
< zone "2.10.10.in-addr.arpa" {
---
> zone "10.10.in-addr.arpa" {
10.10.3. addresses couldn't be reverse looked-up and 10.10.2. reverse lookups return multiple names if there is a corresponding 10.10.3 address:
[root@dc2-102-1 ~]# nslookup 10.10.2.254
Server: 10.10.2.15
Address: 10.10.2.15#53
254.2.10.10.in-addr.arpa name = dc2-102-1.local.
254.2.10.10.in-addr.arpa name = pdu-101-1.local.
[root@dc2-102-1 ~]# nslookup 10.10.3.254
Server: 10.10.2.15
Address: 10.10.2.15#53
** server can't find 254.3.10.10.in-addr.arpa: NXDOMAIN
[root@dc2-102-1 ~]# host pdu-101-1
pdu-101-1.local has address 10.10.3.254
If I change the named.conf file so that the reverse zone is 10.10.in-addr.arpa, and the individual host entries in the reverse file to be like 254.2 ... Then things work as desired. Not sure why dbreport isn't creating the config files this way.
+SPM Rocks bug?-SPM
Fix
A revision of the rocks-command is here
http://www.rocksclusters.org/ftp-site/pub/rocks/fixes/4.3/rocks-command-4.3-1.x86_64.rpm It adds functionality to the dns.py report generator that can deal with our /23 private network.
Having both the public and private IPs in DNS seems to be by design. There are some messages on the email list about that. I don't understand the full implications of or reason for this design.
We'll see if it causes us problems...
app_globals
Some of the values in app_globals will need to be changed when the network is up, currently the gateway points to 192.41.231.2, which is the VMWare host...
Will need to change Kickstart_PublicGateway, Kickstart_PublicDNS
--
TomRockwell - 14 Nov 2007