Lustre At Aglt2

Lustre Deployment

MDS(metadata Server)

we have a failover pair of metadata servers,lmd01 and lmd02, both servers can access the same device (/dev/Lusre/MDS which is a logical volume over the iscsi device), the logical volume currently has a size of 1.8TB which is to store both MDT(metadata target) and MGS(management service).

the heartbeat services are running on both lmd01 and lmd02, so that lmd01 and lmd02 can detect each other to see if the other machine is alive, lmd01 is the preferred machine to mount the MDS volume which hosts the lustre metadata. (this means, if both lmd01 and lmd02 are alive, only lmd01 can mount the MDS volume, if lmd01 is down, lmd02 would detect the down status of lmd01 by heartbeat and take over the mount of MDS, and power cycle lmd01, then if lmd01 is back from power cycle, it would take over the MDS mount from lmd02 again, this continues until the next downtime of lmd01) .During the "take over procedure between lmd01 and lmd02", the client would hang until either lmd01 or lmd02 takes over the MDS mount, and tries to communicates with all OSTs and clients, during the meanwhile, you could query the recovery status from proc. when no ? appears in the following file, the recovery has been finished, the client would be alive again..

root@lmd01 ~# cat /proc/fs/lustre/mds/lustre-MDT0000/recovery_status
status: COMPLETE
recovery_start: 1229375218
recovery_duration: 68
completed_clients: 2/2
replayed_requests: 0
last_transno: 32

OSS(Object Storage Server)

we have 4 OSS, (umfs05/umfs11/umfs12/umfs13) which are very powerful,16GB RAM, 8CPU cores, 2Perc6E as raid controller, 4MD1000 as storage enclosure.

each OSS has 8OST, because of the 8TB limitation of ext3 file system.

on a typical OSS node, each perc6E hosts 2 MD1000, we made r6 on each MD1000 shelf, usually it would end up with 4 devices(/dev/sdb, /dev/sdc, /dev/sdd,/dev/sde), we make 2 partitions on each device, and use each partition for a OST(object Storage Target) device. to make the clients high available,we configure the OSTs to fail through model, so if one OST fails, client would just ignore this OST and stay available, thus all data which has chunks on this OST would become unaccessible.

Clients

On the production system, we are trying to avoid to setup the clients and OSS on the same machine which would cause deadlock of memory contention, tho our OSS machines all have 10ge NIC which would benifit the bandwidth of the clients, we built the patchless clients on the Linux 2.6.20-20UL5smp kernel , we failed on build the patchless clients on the kernel version upper than this with the current lustre 1.6.4 version. Right now, all nodes which 2.6.20UL5 kernel can mount the lustre filesystem with the following setup..

How To install Lustre

MDS

1) install the following rpms on both lmd01 and lmd02

lustre-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64
kernel-lustre-smp-2.6.18-92.1.10.el5_lustre.1.6.6.x86_64
lustre-modules-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64
lustre-ldiskfs-3.0.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64

2) apply the lnet (lustre network) mod parameters on both lmd01 and lmd02

vi /etc/modprobe.conf
options lnet networks=tcp0(bond0.4001),tcp1(bond0.4010)

bond01.4001 is the bonding network interface for private IP, bond0.4010 is the bonding network interface for public IP, without bonding, the interface would be like eth0 or eth1..

Bonding is a way to assure the high availability of the network interface, by bonding eth0 and eth1 to a single network interface bond0, traffic can either come or go from the real eth0 or real eth1, but this is an active/standby model, only 1 interface would be used at a time, if eth0 is down, all traffic(no matter private or public IP traffic) would all go to eth1, vice versa.

also, here we use both public and private interface for MDS node, otherwise, you could just use either of them.

3) Load lnet module and get NIDs of MDS..

modeprobe lnet  (load lnet module)
lctl network up   (bring up the network)
lctl list_nids (list available NIDS of this machine)
lctl ping 192.41.230.48@tcp (see if the network is working where 192.41.230.48@tcp is the NID of lmd01)

step 3 is not neccessary if you are sure about the NIDs, when starting lustre service, lnet would be loaded and configured automatically

4) on either lmd01 or lmd02(as they can both see the device /dev/Lustre/MDS), formatting MDS

 mkfs.lustre  --fsname=lustre --mdt --mgs --failnode=192.41.230.49@tcp --reformat /dev/Lustre/MDS

192.41.230.49@tcp is the NID(network ID) of lmd02, you can get it by run this cmd on lmd02 after the lnet mod is loaded and lnet is configured. --failover means lmd02 is the failover node for lmd01, which means whenever lmd01 is not accessible, either OSS or clients should try lmd02 for service instead..

lctl list_nids

5) start/stop MDS service

manual start
 mount -t lustre /dev/Lustre/MDS /mnt/mds

manual stop
umount  /mnt/mds

on our system, you should never do it manually , let heartbeat service to handle it,

start MDS service
service heartbeat start 

stop MDS service
service heartbeat stop

and always make sure, when you start heartbeat, start it on lmd01 first, then lmd02, so lmd01 can take the mount firstly, and when you stop heartbeat, stop it on lmd02 first, (so lmd02 wont take over the mount if lmd01 was stop firstly), then stop it on lmd01.

OSS

1) install the following rpms on both lmd01 and lmd02

lustre-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64
kernel-lustre-smp-2.6.18-92.1.10.el5_lustre.1.6.6.x86_64
lustre-modules-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64
lustre-ldiskfs-3.0.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64

2) apply the lnet (lustre network) mod parameters on OSS machines
vi /etc/modprobe.conf
options lnet networks=tcp0(eth2)

here, eth2 is the interface for public, on OSS, we only use eth2, because eth2 is the 10ge NIC..

3) Formatting OSS
mkfs.lustre --ost --mgsnode=192.41.230.48@tcp --mgsnode=192.41.230.49@tcp  --param="failover_mode=failout"  /dev/sdd1 
mkfs.lustre --ost --mgsnode=192.41.230.48@tcp --mgsnode=192.41.230.49@tcp  --param="failover_mode=failout"  /dev/sdb2 
mkfs.lustre --ost --mgsnode=192.41.230.48@tcp --mgsnode=192.41.230.49@tcp  --param="failover_mode=failout"  /dev/sdd2 
mkfs.lustre --ost --mgsnode=192.41.230.48@tcp --mgsnode=192.41.230.49@tcp  --param="failover_mode=failout"  /dev/sdc1 
mkfs.lustre --ost --mgsnode=192.41.230.48@tcp --mgsnode=192.41.230.49@tcp  --param="failover_mode=failout" /dev/sde1 
mkfs.lustre --ost --mgsnode=192.41.230.48@tcp --mgsnode=192.41.230.49@tcp  --param="failover_mode=failout"  /dev/sdc2 
mkfs.lustre --ost --mgsnode=192.41.230.48@tcp --mgsnode=192.41.230.49@tcp  --param="failover_mode=failout"  /dev/sde2 

when the devices are formated to ext3 file sytem(the underlying file system for lustre), you can use linux tool e2lable to see the labels of the devices. at initial, it would be tagged with "lustre-OSTffff", depends on the sequence you mount the OSTs, each OST would be assigned a sequential and unique Lable staring from "lustre-OST0000".This label would be assigned at the first mount of OST and be fixed ..

4) starting the OSS services

mount -t lustre /dev/sdb1 /mnt/umfsXX_OST1
....

in our situation , we have an concern about the sequence of OSTs, so we make sure, we mount the first partition of first device on the 1st OSS machine(/dev/sdb1) first, then the first partition of the 3rd device on the 1st OSS machine(/dev/sdd1), then 1st partition of first device on the 2nd OSS Machine(/dev/sdb1). then 1st partiton of 3rd device on the 2nd OSS machines(/dev/sdd1)....this would help Lustre to better utilize the OSTs,

Clients

patchless clients (dont need to load the lustre kernel,only using a set of luster kenel modules and patches)

1)install the following rpm

rpm -ivh lustre-1.6.6-2.6.20_20UL5smp_200812111553.x86_64.rpm 
rpm -ivh lustre-1.6.6-2.6.20_20UL5smp_200812111553.x86_64.rpm 

these rpms are built from lustre-source-1.6.6.rpm against the 2.6.20UL5 kernel.

2) apply the lnet (lustre network) mod parameters on OSS machines

vi /etc/modprobe.conf
options lnet networks=tcp0(eth1),tcp1(eth0)

3) mount the lustre file system

mkdir /lustre
mount -t lustre 10.10.1.48@tcp1:10.10.1.49@tcp1:/lustre /lustre

or mount it on the public interface

mount -t lustre 192.41.230.48@tcp:192.41.230.49@tcp:/lustre /lustre

Right now, with all our OSS equipments, one client from our dell1950 computer node can archive 112MB/s write, 110MB/s Read.. and client gets the same performance no matter it uses stripe or not(no stripe means all data goes to a single OST, stripe means data are chunked into pieces, and go to different OSTs). as our OST is powerful enough to be equivalent to the 1GE network interface on the computer node. For this characteristics, as a user, you may want to put your data in one OST which is safer for your data.

-- WenjingWu - 15 Dec 2008
Topic revision: r1 - 15 Dec 2008, WenjingWu
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback