Adding New OSS to Lustre
Below are the steps needed to add a new OSS (storage server) to Lustre
- Install or re-purpose an SL5.5 node
- Update all BIOS/Firmware/Drivers and run yum update
- Make sure the network is (re)configured to use bonding. Note that since Lustre uses the private subnet for OST access, the private VLAN should be the untagged one on the bonded interface.
Note that we have scripts and RPMS in
/afs/atlas.umich.edu/hardware/Lustre. You should check there to see what is available.
Once the node is properly prepared you can begin to install and configure Lustre:
- Install the following RPMS for Lustre (examples here are for 1.8.4 and using ext4 variant; modify according to the version being used):
- kernel-2.6.18-194.3.1.el5_lustre.1.8.4-ext4.x86_64.rpm
- kernel-devel-2.6.18-194.3.1.el5_lustre.1.8.4-ext4.x86_64.rpm
- kernel-headers-2.6.18-194.3.1.el5_lustre.1.8.4-ext4.x86_64.rpm (*NOTE: this one must be run with 'rpm -Uvh')
- e2fsprogs-1.41.10.sun2-0redhat.rhel5.x86_64.rpm
- lustre-1.8.4-2.6.18_194.3.1.el5_lustre.1.8.4-ext4.x86_64.rpm
- lustre-ldiskfs-3.1.3-2.6.18_194.3.1.el5_lustre.1.8.4-ext4.x86_64.rpm
- lustre-modules-1.8.4-2.6.18_194.3.1.el5_lustre.1.8.4-ext4.x86_64.rpm
- lustre-tests-1.8.4-2.6.18_194.3.1.el5_lustre.1.8.4-ext4.x86_64.rpm
Since this is an "install" rather than an upgrade you can use the
rpm -ivh form except for the kernel-headers RPM (use -Uvh there). There is documentation about setting up Lustre at
https://hep.pa.msu.edu/twiki/bin/view/AGLT2/LustreNew
After installing the RPMS, reboot into the Lustre kernel. Our next step is to create the appropriate partitions for OSTs. There are some scripts in AFS which can help with the OST creation at the hardware level. We utilize the Dell
omconfig utility to create RAID-6 arrays on the disk shelves. In previous cases we have chosen to create 2xRAID-6 per 15 disk shelf. This was to better match the I/O capabilities for Lustre but is wasteful of disk space (4 out of 15 disks are parity). For UMDIST03 we will utilize 6 RAID-6 partitions; 1 per 15 disk MD1000 shelf resulting in 6 "devices" from /dev/sdb to /dev/sdg. There is a script called
setup_lustre_ost.sh in AFS which can help create RAID-6 partitions on Dell hardware.
For UMDIST06 we are reusing the existing RAID-6 partitions from the prior NFS setup. One RAID-60 array over 2 shelves was rebuilt as two RAID-6 arrays. These arrays then require formatting for use within Lustre. There is a script called
format_lustre.sh which can be used as a template to create the
lustre filesystem for the OSTs. The parameters that may require tuning are the
stripe and
inodes settings. We are using the following guidelines to set the value for these:
- The stripe is the mount option for the number of stripe blocks. Each Lustre block is 4096 bytes. Our RAID-6 arrays in use on UMDIST03 are setup with 512KB stripe elements. Therefore each stripe element (512KB) contains 128 Lustre blocks. For RAID-6 on one MD1000 shelf we have 13 disks participating. Therefore we should set 1664 (13x128) for the stripe.
- The 'inodes' are built assuming each file is 8-9MB. For 9TB (UMDIST03) that means inodes about 1,100,000.
Once the formatting is complete we need to setup the /etc/fstab to mount by UUID. Use the
make_lustre_fstab.sh script:
[root@umdist03 ~]# ./make_lustre_fstab.sh
[root@umdist03 ~]# cat /etc/fstab
LABEL=/ / ext3 defaults 1 1
LABEL=/var /var ext3 defaults 1 2
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
LABEL=SWAP-sda3 swap swap defaults 0 0
head02.aglt2.org:/pnfs /pnfs nfs rw,hard,nfsvers=3 0 0
UUID=407204a4-e32f-4be3-9d70-e340ea9b6d68 /mnt/ost11 lustre _netdev 0 0
UUID=4e5322fc-221c-420d-8664-aefaae334d17 /mnt/ost12 lustre _netdev 0 0
UUID=deab207e-2ccd-4172-8fd2-156aebbf3d0f /mnt/ost21 lustre _netdev 0 0
UUID=3dcc417e-95b7-4ec5-9013-c31afdff56c5 /mnt/ost22 lustre _netdev 0 0
UUID=bbc8be6e-f047-4764-80ee-03bd0d2a70e5 /mnt/ost31 lustre _netdev 0 0
UUID=4c140eb3-e4ff-412c-af32-7cc809dcf273 /mnt/ost32 lustre _netdev 0 0
Next we need to setup the /etc/modprobe.conf to correctly prepare the require Lustre
lnet setup. We only need to add a single line which includes the required "routing" in place to support the physics subnet:
options lnet networks=tcp0(bond0) routes="tcp2 10.10.1.[50-52]@tcp0"
Then run
depmod -a. Last thing before starting is to create the mount points for the OSTs:
mkdir /mnt/ost11
...
mkdir /mnt/ost32
Now we are ready to "startup" Lustre. All we need to do is mount the OSTs:
[root@umdist03 ~]# mount -a -t lustre
[root@umdist03 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 33099928 9824332 21567088 32% /
/dev/sda2 19840924 890940 17925844 5% /var
tmpfs 16470428 0 16470428 0% /dev/shm
head02.aglt2.org:/pnfs
10995116277760 1006731885472 9988384392288 10% /pnfs
AFS 9000000 0 9000000 0% /afs
/dev/sdb 9513526012 498492 9037203392 1% /mnt/ost11
/dev/sdc 9513526012 451816 9037250068 1% /mnt/ost12
/dev/sdd 9513526012 498656 9037203228 1% /mnt/ost21
/dev/sde 9513526012 483604 9037218280 1% /mnt/ost22
/dev/sdf 9513526012 491812 9037210072 1% /mnt/ost31
/dev/sdg 9513526012 500160 9037201724 1% /mnt/ost32
Now you can check
dmesg to verify things are OK. Also look with 'lctl dl':
[root@umdist03 ~]# lctl dl
0 UP mgc MGC10.10.1.140@tcp 277aa67c-7391-13f2-8f84-91a9674c7765 5
1 UP ost OSS OSS_uuid 3
2 UP obdfilter umt3-OST001c umt3-OST001c_UUID 249
3 UP obdfilter umt3-OST001d umt3-OST001d_UUID 245
4 UP obdfilter umt3-OST001e umt3-OST001e_UUID 251
5 UP obdfilter umt3-OST001f umt3-OST001f_UUID 253
6 UP obdfilter umt3-OST0020 umt3-OST0020_UUID 250
7 UP obdfilter umt3-OST0021 umt3-OST0021_UUID 240
That's it. Should be online and working within Lustre now.
--
ShawnMcKee - 02 Sep 2010