Lustre At Aglt2
Lustre Deployment
we have a failover pair of metadata servers,lmd01 and lmd02, both servers can access the same device (/dev/Lusre/MDS which is a logical volume over the iscsi device),
the logical volume currently has a size of 1.8TB which is to store both MDT(metadata target) and MGS(management service).
the heartbeat services are running on both lmd01 and lmd02, so that lmd01 and lmd02 can detect each other to see if the other machine is alive, lmd01 is the preferred machine
to mount the MDS volume which hosts the lustre metadata. (this means, if both lmd01 and lmd02 are alive, only lmd01 can mount the MDS volume, if lmd01 is down, lmd02 would detect the down
status of lmd01 by heartbeat and take over the mount of MDS, and power cycle lmd01, then if lmd01 is back from power cycle, it would take over the MDS mount from lmd02 again, this continues until the next
downtime of lmd01) .During the "take over procedure between lmd01 and lmd02", the client would hang until either lmd01 or lmd02 takes over the MDS mount, and tries to communicates with all OSTs and clients, during the meanwhile, you could query the recovery status from proc. when no ? appears in the following file, the recovery has been finished, the client would be alive again..
root@lmd01 ~# cat /proc/fs/lustre/mds/lustre-MDT0000/recovery_status
status: COMPLETE
recovery_start: 1229375218
recovery_duration: 68
completed_clients: 2/2
replayed_requests: 0
last_transno: 32
OSS(Object Storage Server)
we have 4 OSS, (umfs05/umfs11/umfs12/umfs13) which are very powerful,16GB RAM, 8CPU cores, 2Perc6E as raid controller, 4MD1000 as storage enclosure.
each OSS has 8OST, because of the 8TB limitation of ext3 file system.
on a typical OSS node, each perc6E hosts 2 MD1000, we made r6 on each MD1000 shelf, usually it would end up with 4 devices(/dev/sdb, /dev/sdc, /dev/sdd,/dev/sde),
we make 2 partitions on each device, and use each partition for a OST(object Storage Target) device.
to make the clients high available,we configure the OSTs to fail through model, so if one OST fails, client would just ignore this OST and stay available, thus all data which has chunks on this OST would become
unaccessible.
Clients
On the production system, we are trying to avoid to setup the clients and OSS on the same machine which would cause deadlock of memory contention, tho our OSS machines all have 10ge NIC which would benifit the bandwidth of the clients,
we built the patchless clients on the Linux 2.6.20-20UL5smp kernel , we failed on build the patchless clients on the kernel version upper than this with the current lustre 1.6.4 version. Right now, all nodes which 2.6.20UL5 kernel can mount the lustre filesystem with the following setup..
How To install Lustre
MDS
1) install the following rpms on both lmd01 and lmd02
lustre-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64
kernel-lustre-smp-2.6.18-92.1.10.el5_lustre.1.6.6.x86_64
lustre-modules-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64
lustre-ldiskfs-3.0.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64
2) apply the lnet (lustre network) mod parameters on both lmd01 and lmd02
vi /etc/modprobe.conf
options lnet networks=tcp0(bond0.4001),tcp1(bond0.4010)
bond01.4001 is the bonding network interface for private IP, bond0.4010 is the bonding network interface for public IP,
without bonding, the interface would be like eth0 or eth1..
Bonding is a way to assure the high availability of the network interface, by bonding eth0 and eth1 to a single network interface bond0, traffic can either come or go from the real eth0 or real eth1,
but this is an active/standby model, only 1 interface would be used at a time, if eth0 is down, all traffic(no matter private or public IP traffic) would all go to eth1, vice versa.
also, here we use both public and private interface for MDS node, otherwise, you could just use either of them.
3) Load lnet module and get NIDs of MDS..
modeprobe lnet (load lnet module)
lctl network up (bring up the network)
lctl list_nids (list available NIDS of this machine)
lctl ping 192.41.230.48@tcp (see if the network is working where 192.41.230.48@tcp is the NID of lmd01)
step 3 is not neccessary if you are sure about the NIDs, when starting lustre service, lnet would be loaded and configured automatically
4) on either lmd01 or lmd02(as they can both see the device /dev/Lustre/MDS), formatting MDS
mkfs.lustre --fsname=lustre --mdt --mgs --failnode=192.41.230.49@tcp --reformat /dev/Lustre/MDS
192.41.230.49@tcp is the NID(network ID) of lmd02, you can get it by run this cmd on lmd02 after the lnet mod is loaded and lnet is configured.
--failover means lmd02 is the failover node for lmd01, which means whenever lmd01 is not accessible, either OSS or clients should try lmd02 for service instead..
lctl list_nids
5) start/stop MDS service
manual start
mount -t lustre /dev/Lustre/MDS /mnt/mds
manual stop
umount /mnt/mds
on our system, you should never do it manually , let heartbeat service to handle it,
start MDS service
service heartbeat start
stop MDS service
service heartbeat stop
and always make sure, when you start heartbeat, start it on lmd01 first, then lmd02, so lmd01 can take the mount firstly, and when you stop heartbeat, stop it on lmd02 first, (so lmd02 wont take over the mount if lmd01 was stop firstly), then stop it on lmd01.
OSS
1) install the following rpms on both lmd01 and lmd02
lustre-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64
kernel-lustre-smp-2.6.18-92.1.10.el5_lustre.1.6.6.x86_64
lustre-modules-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64
lustre-ldiskfs-3.0.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64
2) apply the lnet (lustre network) mod parameters on OSS machines
vi /etc/modprobe.conf
options lnet networks=tcp0(eth2)
here, eth2 is the interface for public, on OSS, we only use eth2, because eth2 is the 10ge NIC..
3) Formatting OSS
mkfs.lustre --ost --mgsnode=192.41.230.48@tcp --mgsnode=192.41.230.49@tcp --param="failover_mode=failout" /dev/sdd1
mkfs.lustre --ost --mgsnode=192.41.230.48@tcp --mgsnode=192.41.230.49@tcp --param="failover_mode=failout" /dev/sdb2
mkfs.lustre --ost --mgsnode=192.41.230.48@tcp --mgsnode=192.41.230.49@tcp --param="failover_mode=failout" /dev/sdd2
mkfs.lustre --ost --mgsnode=192.41.230.48@tcp --mgsnode=192.41.230.49@tcp --param="failover_mode=failout" /dev/sdc1
mkfs.lustre --ost --mgsnode=192.41.230.48@tcp --mgsnode=192.41.230.49@tcp --param="failover_mode=failout" /dev/sde1
mkfs.lustre --ost --mgsnode=192.41.230.48@tcp --mgsnode=192.41.230.49@tcp --param="failover_mode=failout" /dev/sdc2
mkfs.lustre --ost --mgsnode=192.41.230.48@tcp --mgsnode=192.41.230.49@tcp --param="failover_mode=failout" /dev/sde2
when the devices are formated to ext3 file sytem(the underlying file system for lustre), you can use linux tool e2lable to see the labels of the devices.
at initial, it would be tagged with "lustre-OSTffff", depends on the sequence you mount the OSTs, each OST would be assigned a sequential and unique Lable
staring from "lustre-OST0000".This label would be assigned at the first mount of OST and be fixed ..
4) starting the OSS services
mount -t lustre /dev/sdb1 /mnt/umfsXX_OST1
....
in our situation , we have an concern about the sequence of OSTs, so we make sure, we mount the first partition of first device on the 1st OSS machine(/dev/sdb1) first,
then the first partition of the 3rd device on the 1st OSS machine(/dev/sdd1), then 1st partition of first device on the 2nd OSS Machine(/dev/sdb1). then 1st partiton of 3rd device
on the 2nd OSS machines(/dev/sdd1)....this would help Lustre to better utilize the OSTs,
Clients
patchless clients (dont need to load the lustre kernel,only using a set of luster kenel modules and patches)
1)install the following rpm
rpm -ivh lustre-1.6.6-2.6.20_20UL5smp_200812111553.x86_64.rpm
rpm -ivh lustre-1.6.6-2.6.20_20UL5smp_200812111553.x86_64.rpm
these rpms are built from lustre-source-1.6.6.rpm against the 2.6.20UL5 kernel.
2) apply the lnet (lustre network) mod parameters on OSS machines
vi /etc/modprobe.conf
options lnet networks=tcp0(eth1),tcp1(eth0)
3) mount the lustre file system
mkdir /lustre
mount -t lustre 10.10.1.48@tcp1:10.10.1.49@tcp1:/lustre /lustre
or mount it on the public interface
mount -t lustre 192.41.230.48@tcp:192.41.230.49@tcp:/lustre /lustre
Right now, with all our OSS equipments, one client from our dell1950 computer node can archive 112MB/s write, 110MB/s Read..
and client gets the same performance no matter it uses stripe or not(no stripe means all data goes to a single OST, stripe means data are chunked into pieces, and go to different OSTs).
as our OST is powerful enough to be equivalent to the 1GE network interface on the computer node.
For this characteristics, as a user, you may want to put your data in one OST which is safer for your data.
--
WenjingWu - 15 Dec 2008