Notes on setting up and configuring Lustre version 2.7
Index of Sections
Source rpms
We have chosen to use the kernel distributed with the rpms from the Lustre repos. This is version 2.6.32-504.8.1, patched for Lustre in the case of the file and metadata servers, stock for the clients. All available Lustre rpms can be downloaded from
here.
Component setup
Combined mgs/mdt
Created the host in VMWare mdtmgs with
- 4 cpus
- 4GB Ram
- 45GB boot disk
This was built via Cobbler, and then a second disk was created in VMWare, a 1TB RAID-10 volume of 15k disks for the combined DB. The combined mdtmgs is mounted at /mnt/mdtmgs (/dev/sdb).
Install the kernel rpms patched for Lustre. At the time of this build, those were located in /atlas/data08/ball/admin/LustreSL6/2.7/server/. Other required rpms have not updated since Lustre 2.5, and were stored at /atlas/data08/ball/admin/LustreSL6/2.5/other/.
yum localinstall kernel-2.6.32-504.8.1.el6_lustre.x86_64.rpm kernel-devel-2.6.32-504.8.1.el6_lustre.x86_64.rpm \
kernel-firmware-2.6.32-504.8.1.el6_lustre.x86_64.rpm kernel-headers-2.6.32-504.8.1.el6_lustre.x86_64.rpm
yum localupdate e2fsprogs-1.42.12.wc1-7.el6.x86_64.rpm e2fsprogs-libs-1.42.12.wc1-7.el6.x86_64.rpm \
libcom_err-1.42.12.wc1-7.el6.x86_64.rpm libcom_err-devel-1.42.12.wc1-7.el6.x86_64.rpm \
libss-1.42.12.wc1-7.el6.x86_64.rpm
yum localinstall lustre-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm \
lustre-modules-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm \
lustre-osd-ldiskfs-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm \
lustre-osd-ldiskfs-mount-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm
Created the volume, and mounted it via fstab entry
- mkfs.lustre --fsname=umt3B --mgs --mdt --index=0 /dev/sdb
- LABEL=umt3B:MDT0000 /mnt/mdtmgs lustre acl 0 0
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 746G 8.8G 686G 2% /mnt/mdtmgs
Using the locally built rpms
When installing the locally built rpms, there is an extra step. History from the installation is shown here.
34 cd /atlas/data08/ball/admin/LustreSL6/2.7.58/server
39 yum localinstall kernel-2.6.32.504.16.2.el6_lustre-1.x86_64.rpm
41 /sbin/new-kernel-pkg --package kernel --mkinitrd --dracut --depmod \
--install 2.6.32.504.16.2.el6_lustre
42 cd ../../2.5/other
51 yum localupdate e2fsprogs-1.42.12.wc1-7.el6.x86_64.rpm e2fsprogs-libs-1.42.12.wc1-7.el6.x86_64.rpm \
libcom_err-1.42.12.wc1-7.el6.x86_64.rpm libss-1.42.12.wc1-7.el6.x86_64.rpm
52 yum -y localinstall libcom_err-devel-1.42.12.wc1-7.el6.x86_64.rpm
53 reboot
63 cd /atlas/data08/ball/admin/LustreSL6/2.7.58/server
68 yum localinstall lustre-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-modules-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-osd-ldiskfs-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-osd-ldiskfs-mount-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm
70 mkfs.lustre --fsname=T3test --mgs --mdt --index=0 /dev/mapper/vg0-lv_home
Permanent disk data:
Target: T3test:MDT0000
Index: 0
Lustre FS: T3test
Mount type: ldiskfs
Flags: 0x65
(MDT MGS first_time update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:
checking for existing Lustre data: not found
device size = 32768MB
formatting backing filesystem ldiskfs on /dev/mapper/vg0-lv_home
target name T3test:MDT0000
4k blocks 8388608
options -J size=1310 -I 512 -i 2048 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg \
-E lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L T3test:MDT0000 -J size=1310 -I 512 -i 2048 -q -O \
dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init \
-F /dev/mapper/vg0-lv_home 8388608
Writing CONFIGS/mountdata
-------- The fstab entry is
/dev/mapper/vg0-lv_home /mnt/mgs lustre acl 1 2
(This volume was created specially during the build of this machine, and so retains the /dev/mapper entry)
This is the empty, startup state
/dev/mapper/vg0-lv_home
23G 1.3G 21G 6% /mnt/mgs
/dev/mapper/vg0-lv_home
1523712 49207 1474505 4% /mnt/mgs
See below about making a kmod-openafs rpm.
Could have added "--mountfsoptions=acl" when creating the mdt and mgs, but for this version of Lustre, the mountfsoptions are over-written, not additive, so would have also had to add back these defaults: errors=remount-ro,iopen_nopriv,user_xattr
BUT, the online web page says instead the default is "errors=remount-ro,user_xattr" which turns out to be correct. But, who knew for certain?
Instead, just add the -o acl to the mount instead.
Adding Lustre rpms to a file server such as umdist09
Pre-install zfs if it is to be used
See
here for directions on installing zfs
NOTE: Lustre 2.7.0 DOES NOT WORK WITH ZFS 0.6.4, IT WORKS ONLY WITH 0.6.3. Various methods below were tried to work with the newer zfs, and all ultimately failed. So, at this moment, zfs 0.6.3 rpms have been built from source and are located at
- /atlas/data08/ball/admin/zfs_rpms
Changed rpms
This is similar to what we did on mdtmgs. The difference is that zfs is in use for the volumes of umdist09, so we replace these two rpms
- lustre-osd-ldiskfs-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm
- lustre-osd-ldiskfs-mount-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm
With these two rpms
- lustre-osd-zfs-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64
- lustre-osd-zfs-mount-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64
The servers cannot simultaneously use both ldiskfs and zfs.
For testing, these were also installed
yum install lustre-tests-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm \
lustre-iokit-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm \
perf-2.6.32-504.8.1.el6_lustre.x86_64.rpm
Building new Lustre rpms
See
this URL.
To fully summarize the steps from this URL, and locally (as it does not detail the requirements for zfs and spl modules) herewith is the history of needed commands. Note that it is NOT necessary to downgrade the epel repo from our standard version.
The problem encountered following this recipe is that the kernel does not boot, at least in our environment. Despite indications from the dracut directories, the lvm driver does not appear to correctly load into the initramfs. grub.conf is loaded from the sda partition, but when control transfers, the bootup just stops. So, these directions are included here only for the sake of future reference, and, really, unless you are academically interested, this section should be skipped. FYI, the assumption below is that the zfs rpms for 0.6.4.2 were installed.
50 yum -y groupinstall "Development Tools"
51 yum -y install xmlto asciidoc elfutils-libelf-devel zlib-devel binutils-devel newt-devel \
python-devel hmaccalc perl-ExtUtils-Embed bison elfutils-devel audit-libs-devel
60 yum -y install quilt libselinux-devel
109 yum install python-docutils
143 yum install zfs-devel
173 yum install libuuid-devel
234 cd /usr/src/spl-0.6.4.2/
243 ./configure --with-config=kernel
244 make all
245 cd ../zfs-0.6.4.2/
246 ./configure --with-config=kernel
247 make all
74 useradd -m build
75 su build
# Now, as user build
1 cd $HOME
3 git clone git://git.hpdd.intel.com/fs/lustre-release.git
(another variation is git clone git://git.hpdd.intel.com/fs/lustre-release.git -b 'v2_7_0_0')
4 cd lustre-release/
5 sh ./autogen.sh
6 cd $HOME
10 mkdir -p kernel/rpmbuild/{BUILD,RPMS,SOURCES,SPECS,SRPMS}
12 cd kernel
15 echo '%_topdir %(echo $HOME)/kernel/rpmbuild' > ~/.rpmmacros
19 rpm -ivh http://ftp.redhat.com/pub/redhat/linux/enterprise/6Server/en/os/SRPMS/kernel-2.6.32-504.16.2.el6.src.rpm \
2>&1 | grep -v mockb
21 cd rpmbuild/
23 rpmbuild -bp --target=`uname -m` ./SPECS/kernel.spec
# Add a unique build id so we can be certain our kernel is booted. Do this by
# Edit ~/kernel/rpmbuild/BUILD/kernel-2.6.32-504.16.2.el6/linux-2.6.32-504.16.2.el6.x86_64/Makefile
# and modify line 4, the EXTRAVERSION to read: EXTRAVERSION = .504.16.2.el6_lustre
26 cd BUILD/kernel-2.6.32-504.16.2.el6/linux-2.6.32-504.16.2.el6.x86_64/
30 cp ~/lustre-release/lustre/kernel_patches/kernel_configs/kernel-2.6.32-2.
6-rhel6-x86_64.config ./.config
#
# Now, this step 30 failed to provide a bootable kernel. I (Bob) was unable to figure out why?
# Instead, copied the .config file from umdist08 and used that below with make oldconfig
# /usr/src/kernels/2.6.32-504.16.2.el6.x86_64/.config
#
33 ln -s ~/lustre-release/lustre/kernel_patches/series/2.6-rhel6.series series
34 ln -s ~/lustre-release/lustre/kernel_patches/patches patches
35 quilt push -av
36 cd ~/kernel/rpmbuild/BUILD/kernel-2.6.32-504.16.2.el6/linux-2.6.32-504.16.2.el6.x86_64/
37 make oldconfig || make menuconfig
38 make include/asm
39 make include/linux/version.h
40 make SUBDIRS=scripts
41 make include/linux/utsrelease.h
42 make rpm
Now, we can make the full set of rpms, that include zfs support for the 0.6.4.1 distribution
WE HAVE LEARNED! DISABLE THE ZFS REPO FROM FURTHER UPDATES
127 cd lustre-release/
128 ./configure --with-linux=/home/build/kernel/rpmbuild/BUILD/kernel-2.6.32.504.16.2.el6_lustre/ \
--with-zfs=/usr/src/zfs-0.6.4.2 --with-spl=/usr/src/spl-0.6.4.2
129 make rpms
All of the rpms are in the directory ~build/kernel/rpmbuild/RPMS/x86_64/ Below is a complete list.
These were also placed in /atlas/data08/ball/admin/LustreSL6/2.7.52/server
310161168 May 14 13:18 kernel-2.6.32.504.16.2.el6_lustre-1.x86_64.rpm
536456 May 18 12:37 lustre-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
30254364 May 18 12:37 lustre-debuginfo-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
42708 May 18 12:37 lustre-iokit-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
3562124 May 18 12:37 lustre-modules-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
359748 May 18 12:37 lustre-osd-ldiskfs-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
13144 May 18 12:37 lustre-osd-ldiskfs-mount-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
91776 May 18 12:37 lustre-osd-zfs-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
7876 May 18 12:37 lustre-osd-zfs-mount-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
7692592 May 18 12:37 lustre-source-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
3810480 May 18 12:37 lustre-tests-2.7.52-2.6.32.504.16.2.el6_lustre_g96f566e.x86_64.rpm
Now, to install these rpms, copy them all somewhere safe, and do
yum localinstall kernel-2.6.32.504.16.2.el6_lustre-1.x86_64.rpm
/sbin/new-kernel-pkg --package kernel --mkinitrd --dracut --depmod --install 2.6.32.504.16.2.el6_lustre
Now reboot, install the helpers such as e2fsprogs, and then the needed lustre rpms.
As noted, the kernel would not boot, so, until some other pressing need brings us back here, these rpms are not useful.
Build openafs-kmod rpms
It's easiest to do this build on the same system where you built lustre and kernel rpms in previous steps. Get and install latest openafs srpm from SL mirror (I used root to build, could probably also use the "build" user already used).
You will also need to be sure and install libcom_err-devel from the lustre repos to be able to install krb5-devel.
rpm -ivh http://mirror.lstn.net/scientific/6x/SRPMS/sl6/openafs.SLx-1.6.14-218.src.rpm
rpm -ivh https://downloads.hpdd.intel.com/public/e2fsprogs/1.42.12.wc1/el6/RPMS/x86_64/libcom_err-devel-1.42.12.wc1-7.el6.x86_64.rpm
We don't have the kernel-devel requirement so edit /root/rpmbuild/SPECS/openafs.SLx.spec to remove it:
%package -n openafs%{?nsfx}-kmod
Summary: This is really just a dummy to get the build requirement
Group: Networking/Filesystems
# BuildRequires: %{kbuildreq}
Next we'll put a kernel tree in the standard place with a standard name. Some RPM macros later will expect it in /usr/src regardless of what we set ksrcdir to in the next step.
cp -rfa /home/build/kernel/rpmbuild/BUILD/kernel-2.6.32.504.16.2.el6_lustre /usr/src/kernels/2.6.32.504.16.2.el6_lustre
The lustre kernel rpm we build is a little non-standard because it doesn't include the .x86_64 in the /lib/modules directory. Likewise for the source we can't use the .x86_64 because some rpm build macros
are going to be looking for the directory in /usr/src without the arch included. Let's hack around automatic definitions by redefining some macros near the beginning of the spec. We'll also fix the version. The "." in place of "-" in lustre kernel rpms confuses the definition of krelmajor in the script.
# be sure your changes come after the openafs-sl-defs.sh. The shell script is where the macros are defined and we want to over-ride those definitions.
%{expand:%(%{_sourcedir}/openafs-sl-defs.sh %{?kernel})}
%define ksrcdir /usr/src/kernels/%{kernel}
%define kmoddir /lib/modules/%{kernel}/extra/openafs
%define kmoddst /lib/modules/%{kernel}/kernel/fs/openafs
# make the version make sense - comes out to kmod-openafs-2 if left at the automatically generated value.
%define krelmajor 504-lustre
Finally, we need to modify a templating macro to use the stripped (no .x86_64) macro we used to define our destination. This macro sets up the %files section which needs to correctly reference the installed location of the kmod:
Line 476 change:
%{expand:%(%{kmodtool} rpmtemplate %{kmod_name} %{unamer} %{depmod} %{kvariants} 2>/dev/null)}
To replace unamer with kernel:
%{expand:%(%{kmodtool} rpmtemplate %{kmod_name} %{kernel} %{depmod} %{kvariants} 2>/dev/null)}
Now let's build. You may find you're missing some -devel packages which it will immediately complain about. They can be installed as usual with yum from SL repos. Take note of the requirement for libcom_err-devel - since we replaced the stock rpm with one from the lustre repos our -devel will also need to be installed from there.
rpmbuild -bb SPECS/openafs.SLx.spec --define "build_kmod 1"
....
Wrote: /root/rpmbuild/RPMS/x86_64/kmod-openafs-2-1.6.14-218.sl6.2.6.32.504.16.2.x86_64.rpm
Wrote: /root/rpmbuild/RPMS/x86_64/kmod-openafs-2-debuginfo-1.6.14-218.sl6.2.6.32.504.16.2.x86_64.rpm
Output at start of build should indicate the kernel version, correct path to source, and show build_kmod defined as 1.
Pools on multipath storage
The zfs pools on umdist09 were created utilizing the mpathXY devices. It seems that these map to different dm-Z devices with every reboot, but they always correctly map to the correct physical disk in the MD3060e chassis. For example....
zpool create ost-012 raidz2 mapper/mpathbt mapper/mpathbu mapper/mpathcf \
mapper/mpathcg mapper/mpathcr mapper/mpathcs mapper/mpathdd mapper/mpathde \
mapper/mpathdp mapper/mpathdq
All these zpool correspond to two disks in each drawer, so the failure of an entire drawer will be bad, but not totally destructive to any of the pools, assuming no other failures also occur at the same time.
Should also set the disk to automagically re-add after replacement of a failed disk
zpool set autoreplace=on ost-012
The Lustre OST were then created with one OST per pool, eg
mkfs.lustre --backfstype=zfs --fsname=umt3B --mgsnode=10.10.2.173@tcp0 --ost --index=0 ost-001/ost0000
These were mounted at, eg, /mnt/ost-001, etc.
The zpools themselves were set to "legacy" status, so they do not simultaneously mount. This is per the advice of Andreas Dilger.
zfs set mountpoint=legacy ost-001
Pools on Storage Shelves such as the MD1000
zpools will be created from individual vidisks of 1 disk each. Each set up as a single disk RAID-0. The naming convention for the vdisks will be
cMdNOO
where M is the controller number, N is the shelf number on the controller, and OO is the disk within that shelf. For example
- omconfig storage controller action=createvdisk controller=1 size=max raid=r0 pdisk=0:0:1 name=c1d001
This is a full set of vdisk creation for Controller 2, two shelves of 15 disks each. In practice, we did not do this as the shelves were split up.
for((i=0;i<10;i++)); do omconfig storage controller action=createvdisk controller=2 size=max raid=r0 pdisk=0:0:$i name=c2d00$i; done
for((i=10;i<15;i++)); do omconfig storage controller action=createvdisk controller=2 size=max raid=r0 pdisk=0:0:$i name=c2d0$i; done
for((i=0;i<10;i++)); do omconfig storage controller action=createvdisk controller=2 size=max raid=r0 pdisk=0:1:$i name=c2d10$i; done
for((i=10;i<15;i++)); do omconfig storage controller action=createvdisk controller=2 size=max raid=r0 pdisk=0:1:$i name=c2d1$i; done
Following the creation of the vdisks, the zpools can be created. The best option found is to do this from the "by-path" devices that will not change unless we reconfigure the hardware itself. The sd, dm and mpath devices (if multipath is installed) simply don't remain sufficiently static. For example, from umdist01:
zpool create -f -m legacy ost-003 raidz2 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:10:0 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:11:0 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:12:0 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:13:0 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:14:0 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:25:0 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:26:0 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:27:0 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:28:0 \
disk/by-path/pci-0000:08:00.0-scsi-0:2:29:0
Several scripts have been created to simplify this. All are in /root/tools. These are
- full_control.sh This shows the overall order for sub-script execution
- create_all_JBOD_vdisk.sh
- map_sd_to_by-path.sh
- create_all_zpools.sh
- dev_list.sh This is a utility script of create_all_zpools.sh
Several utility files are created in /root/zpoolFiles as these scripts execute, and are used in successor scripts.
For example, to output the full list of disk/by-path devices for the ost above, use this, where the input arguments are all of the vdisk that will make up this zpool.
/root/tools/dev_list.sh c1d010 c1d011 c1d012 c1d013 c1d014 c1d110 c1d111 c1d112 c1d113 c1d114
Conventions for creating the ost and mounting them
- zfs pools are numbered sequentially on each OSS, eg, ost-001, ost-002, etc
- Mount points for the OST are named identically in /mnt, eg, /mnt/ost-001, etc
- Each lustre file system is created on the zfs pool, with the decimal index as part of the name, for example
- for index 12 on umdist01, mkfs.lustre uses --index=12 ost-001/ost0012
- Each OSS has OST that are sequentially numbered via their mkfs.lustre index
- The mdtmgs node, and the WN, generally know these by the hexadecimal equivalent of the decimal index
- for index 12, the official OST name is then umt3B-OST000c
- lustre_one_control.sh works with the official name after the dash, eg, OST000c
Example, create lustre file systems and mount them
This set of example commands is taken from the ost creation on umdist01
mkfs.lustre --backfstype=zfs --fsname=umt3B --mgsnode=10.10.2.173@tcp0 --ost --index=12 ost-001/ost0012
mkfs.lustre --backfstype=zfs --fsname=umt3B --mgsnode=10.10.2.173@tcp0 --ost --index=13 ost-002/ost0013
mkfs.lustre --backfstype=zfs --fsname=umt3B --mgsnode=10.10.2.173@tcp0 --ost --index=14 ost-003/ost0014
mkfs.lustre --backfstype=zfs --fsname=umt3B --mgsnode=10.10.2.173@tcp0 --ost --index=15 ost-004/ost0015
mkfs.lustre --backfstype=zfs --fsname=umt3B --mgsnode=10.10.2.173@tcp0 --ost --index=16 ost-005/ost0016
mkfs.lustre --backfstype=zfs --fsname=umt3B --mgsnode=10.10.2.173@tcp0 --ost --index=17 ost-006/ost0017
The corresponding /etc/fstab entries now are
ost-001/ost0012 /mnt/ost-001 lustre _netdev 0 0
ost-002/ost0013 /mnt/ost-002 lustre _netdev 0 0
ost-003/ost0014 /mnt/ost-003 lustre _netdev 0 0
ost-004/ost0015 /mnt/ost-004 lustre _netdev 0 0
ost-005/ost0016 /mnt/ost-005 lustre _netdev 0 0
ost-006/ost0017 /mnt/ost-006 lustre _netdev 0 0
Do not forget, make the mount points!
mkdir /mnt/ost-001
(etc)
NOTE: When re-creating an OST after it was destroyed for some reason, also add the parameter "--replace"
in addition to the index number.
Rebuilding a client that was using the older Lustre rpms
This action consists of stopping lustre, unloading all the modules, erasing the lustre client rpms, updating the kernel and related rpms, and installing the new lustre client rpms. The system can then be rebooted. On a Worker node, where the rpms are in the Rocks repos, the following sequence was employed. This also makes sure all of the various grub conf files are updated.
957 service lustre_mount_umt3 stop
958 /atlas/data08/ball/admin/unload_lustre.sh
960 yum erase lustre-client lustre-client-modules
964 yum update kernel kernel-devel kernel-doc kernel-firmware kernel-headers kmod-openafs*
965 cd /boot/grub
966 cp -p grub.conf grub-orig.conf
967 cat grub.conf [ pick out the new kernel entries, and add them after the "reinstall" part of rocks.conf ]
968 vi rocks.conf
971 yum install lustre-client lustre-client-modules
Rolling back zfs on a pool server
It became necessary to roll back the zfs version when the unexpected rpm update broke Lustre. Furthermore, zpools created under 0.6.4.1 used properties that were unknown and incompatible with zfs 0.6.3.1. Fortunately, no Lustre file systems had yet been created on these pools, so they were simply destroyed and re-created using the script above.
[root@umdist04 zfs_rpms]# zpool status ost-001
pool: ost-001
state: UNAVAIL
status: The pool cannot be accessed on this system because it uses the
following feature(s) not supported on this system:
com.delphix:hole_birth
com.delphix:embedded_data
action: Access the pool from a system that supports the required feature(s),
or restore the pool from backup.
scan: none requested
The procedure to perform this rollback, short of a rebuild, is as follows.
- service zfs stop
- dkms uninstall -m zfs -v 0.6.4.1 -k 2.6.32-504.8.1.el6_lustre.x86_64
- dkms uninstall -m spl -v 0.6.4.1 -k 2.6.32-504.8.1.el6_lustre.x86_64
- yum erase libuutil1 libnvpair1 libzpool2 spl-dkms zfs-dkms spl libzfs2 zfs lsscsi zfs-test zfs-dracut
- This has the side effect of uninstalling 3 lustre rpms, that must then later be re-installed
- for i in /var/lib/dkms/*/[^k]*/source; do [ -e "$i" ] || echo "$i";done
- Delete the files found. This is because the zfs rpm removal is stupid
- rm /var/lib/dkms/spl/0.6.4.1/source
- rm /var/lib/dkms/zfs/0.6.4.1/source
- More stupid zfs cleanup
- cd /var/lib/dkms
- /bin/rm -rf zfs spl
- cd /lib/modules/2.6.32-504.16.2.el6.x86_64/weak-updates
- /bin/rm -rf avl nvpair spl splat unicode zcommon zfs zpios
- Make sure no more stupidities such as this are found
- /etc/cron.daily/mlocate.cron
- locate zfs
- Unload old zfs modules
- rmmod zfs zcommon znvpair spl zlib_deflate zavl zunicode
- Re-install zfs
- cd /atlas/data08/ball/admin/zfs_rpms
- yum localinstall libnvpair1-0.6.3-1.3.el6.x86_64.rpm libuutil1-0.6.3-1.3.el6.x86_64.rpm libzfs2-0.6.3-1.3.el6.x86_64.rpm libzpool2-0.6.3-1.3.el6.x86_64.rpm spl-0.6.3-1.3.el6.x86_64.rpm spl-dkms-0.6.3-1.3.el6.noarch.rpm zfs-0.6.3-1.3.el6.x86_64.rpm zfs-dkms-0.6.3-1.3.el6.noarch.rpm zfs-dracut-0.6.3-1.3.el6.x86_64.rpm
- Re-install lost lustre rpms
- cd ../LustreSL6/2.7/server
- yum localinstall lustre-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm lustre-modules-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm lustre-osd-zfs-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm lustre-osd-zfs-mount-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64.rpm
- service zfs start
Now, destroy and re-create the zpools, then make the Lustre file systems
Updating the OSS to a new version of zfs and Lustre
Now, on the OSS, first save the zpool info for the sake of safety
1. cp /etc/zfs/zpool.cache /root
2. service zfs stop
3. chkconfig zfs off
4. yum erase lustre lustre-modules lustre-osd-zfs lustre-osd-zfs-mount
5. yum erase libnvpair1 libuutil1 libzfs2 libzpool2 spl spl-dkms zfs zfs-dkms zfs-dracut
6. cd /atlas/data08/ball/admin/LustreSL6/2.7.58/server
7. yum localinstall kernel-2.6.32.504.16.2.el6_lustre-1.x86_64.rpm
8. /sbin/new-kernel-pkg --package kernel --mkinitrd --dracut --depmod \
--install 2.6.32.504.16.2.el6_lustre
9. Remove these files
for i in /var/lib/dkms/*/[^k]*/source; do [ -e "$i" ] || echo "$i";done
/var/lib/dkms/spl/0.6.3/source
/var/lib/dkms/zfs/0.6.3/source
10. Reboot to new kernel
11. mkdir -p /home/build/kernel/rpmbuild/BUILD
12. cd /home/build/kernel/rpmbuild/BUILD
13. tar xzf /atlas/data08/ball/admin/LustreSL6/2.7.58/server/lustre_2.7.58_headers.tgz
14. cd /atlas/data08/ball/admin/zfs_0.6.4_rpms
15. yum localinstall libnvpair1-0.6.4.2-1.el6.x86_64.rpm libuutil1-0.6.4.2-1.el6.x86_64.rpm \
libzfs2-0.6.4.2-1.el6.x86_64.rpm libzpool2-0.6.4.2-1.el6.x86_64.rpm \
spl-0.6.4.2-1.el6.x86_64.rpm spl-dkms-0.6.4.2-1.el6.noarch.rpm \
zfs-0.6.4.2-1.el6.x86_64.rpm zfs-dkms-0.6.4.2-1.el6.noarch.rpm \
zfs-dracut-0.6.4.2-1.el6.x86_64.rpm
16. cd ../LustreSL6/2.7.58/server
17. yum localinstall lustre-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-modules-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-osd-zfs-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-osd-zfs-mount-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm
18. yum localinstall kmod-openafs-504-lustre-1.6.14-218.sl6.2.6.32.504.16.2.x86_64.rpm
19. depmod -a
20. Reboot
21. zpool upgrade -a
22. Uncomment the OST in /etc/fstab and mount them via "mount -av"
If/when a disk fails on the MD1000 or MD1200
I have not been able to get a disk to automatically re-import after a "pull the disk to simulate failure" event. The failure of the pdisk, results also in the failure of the vdisk. If the machine is still online, and the disk is replaced, the vdisk can be re-created.
The failing vdisk leaves preserved cache on the controller that must be cleared. See
Troubleshooting and Managing Preserved Cache on a Dell PowerEdge\x99 RAID Controller (PERC) for details, but, bottom line, is to follow this procedure that has worked.
- Pull the failed disk, eg, disk 0:0:1 on controller 2
- omconfig storage controller action=discardpreservedcache controller=2 force=enabled
- Insert the replacement disk
You can now proceed to re-create the vdisk and re-add it to the zpool.
- omconfig storage controller action=clearforeignconfig controller=2
- omconfig storage controller action=createvdisk controller=2 size=max raid=r0 pdisk=0:0:1 name=c2d001
When tests are performed with NO filesystem on the zpool, the zpool never "notices" the failed disk, and also does not seem to re-import into the pool. This importation can be forced, using the same disk in the same cabled locale:
- zpool replace -f ost-004 pci-0000:0a:00.0-scsi-0:2:2:0
If there is an active file system on the pool, then as soon as a file is written to (probably also read from) the file system, the failed disk is noticed and marked failed.
- pci-0000:0a:00.0-scsi-0:2:3:0 UNAVAIL 3 49 0 corrupted data
"zpool replace" as above successfully handles this.
If there is a random disk failure, after replacing it, it is easy to re-create the vdisk using the controller, shelf, and disk number as above. /root/tools/storage_details.sh is useful in helping to find the missing vdisk name as well. However, you must then find the affected zpool. If there is only a single failure, you can sequentially look at the zpool status for each until you find the failed disk
and so on. The failed disk will be obvious. The argument for the "zpool replace" command above can then be selected from the status output, just leaving off the extra characters in the listed device name. Those can be determined by comparison to the device names of other, good disks in the zpool. If for any reason the disk is NOT showing failed once all the zpool status commands are issued, then see the next paragraph.
The helper files in /root/zpoolFiles can be useful, particulary "sd_to_by-path.txt". A sample section, defined by the corresponding virtual disk ID, is shown below. In the situation in the paragraph just above, just find the zpool with this by-path identifier, and do the "zpool replace".
ID : 0
Name : c1d000
Device Name : /dev/sdb
by-pathName : disk/by-path/pci-0000:08:00.0-scsi-0:2:0:0
Finding file associated with a zpool error
It may happen that a zpool reports errors. It may look something like this
errors: Permanent errors have been detected in the following files:
ost-004/ost0027:<0x28697>
ost-004/ost0027:<0x257a1>
ost-004/ost0027:<0x286b1>
ost-004/ost0027:<0x444d0>
ost-004/ost0027:<0x285e5>
ost-004/ost0027:<0x288ef>
ost-004/ost0027:<0x4c2fb>
Let's take that first example. On the OSS either make a zfs snapshot of the volume, or umount the OST, mount it as type zfs and then use Linux find to look for the inode.
- zfs snapshot ost-004/ost0027@Sep28
- mount -t zfs ost-007/ost0027@Sep28 /mnt/tmp
- find /mnt/tmp/O -inum 165527
- This returned /mnt/tmp/O/0/d0/344992
The inode number 165527 is the decimal equivalent of the hexadecimal error pointer (to an inode) above, 0x28697.
The Lustre OID of this file is 344992. Translate it to something akin to the Lustre FID
- ll_decode_filter_fid /mnt/tmp/O/0/d0/344992
- This returned /mnt/tmp/O/0/d0/344992: parent=[0x200004bf0:0x3984:0x0] stripe=0
The bit between the pair of [] is what we are looking for. Go to some Lustre client machine and do this.
- lfs fid2path /lustre/umt3 [0x200004bf0:0x3984:0x0]
- This returned the name of the affected file, ie, /lustre/umt3/user/daits/data12/NTUP_ONIAMUMU/data12_8TeV.periodI.physics_Bphysics.PhysCont.NTUP_ONIAMUMU.repro14_v01.r4065_p1278_p1424_nmy19_pUM999999/NTUP_ONIAMUMU.nmy19.00213695._000022.root.1
Once the bad file is found, delete it from Lustre, repeat for all the permanent errors on the OST, and do "zpool clear" to clear the bad file report. Also, umount and destroy the zpool snapshot (zfs destroy ost-007/ost0027@Sep28).
Test results
Running the "obdfilter-survey" test on some OSS.
So, looks like we are getting the following performance on umdist09 with 2 threads per OST
write 4781.18 [ 310.98, 412.99] rewrite 5881.79 SHORT read 3775.87 [ 268.99, 397.99]
Units are MB/s, with "aggregate [ min/OST, max/OST ]" detailed for various conditions.
The aggregate is faster then the network bandwidth out of the machine.
There is also a test with each 2 objects per ost with one thread per object,
write 5915.34 SHORT rewrite 5844.71 SHORT read 3812.05 [ 284.99, 387.99]
The test is run as follows:
nobjhi=2 thrhi=2 size=1024 case=disk sh /usr/bin/obdfilter-survey
As expected, the results are not as good on umdist01. The full output is as follows. The second line of this is the "2 threads per OST" result quoted above. The third line is the second test quoted for umdist09. Results from umdist03, with 6 MD1000 shelves, and 9 OST, are also shown.
umdist01 and umdist03 are both PE2950.
Mon Jun 1 10:00:31 EDT 2015 Obdfilter-survey for case=disk from umdist01.aglt2.org
ost 6 sz 6291456K rsz 1024K obj 6 thr 6 write 793.36 [ 65.99, 177.96] rewrite 1498.67 [ 119.99, 314.97] read 883.23 [ 73.99, 247.91]
ost 6 sz 6291456K rsz 1024K obj 6 thr 12 write 1417.80 [ 179.96, 301.93] rewrite 738.33 [ 41.99, 316.93] read 1088.38 [ 89.99, 375.92]
ost 6 sz 6291456K rsz 1024K obj 12 thr 12 write 1900.24 [ 322.78, 369.97] rewrite 438.91 [ 0.00, 567.84] read 1668.66 [ 228.99, 370.83]
Mon Jun 1 12:13:37 EDT 2015 Obdfilter-survey for case=disk from umdist03.aglt2.org
ost 9 sz 9437184K rsz 1024K obj 9 thr 9 write 595.45 [ 0.00, 297.97] rewrite 754.19 [ 0.00, 312.96] read 1710.89 [ 144.99, 280.98]
ost 9 sz 9437184K rsz 1024K obj 9 thr 18 write 611.77 [ 0.00, 285.95] rewrite 621.57 [ 0.00, 290.97] read 1962.69 [ 190.99, 270.96]
ost 9 sz 9437184K rsz 1024K obj 18 thr 18 write 535.08 [ 0.00, 280.97] rewrite 562.50 [ 0.00, 273.98] read 2020.20 [ 196.99, 287.99]
umdist02 is an R710
Mon Jun 1 12:24:34 EDT 2015 Obdfilter-survey for case=disk from umdist02.aglt2.org
ost 6 sz 6291456K rsz 1024K obj 6 thr 6 write 3409.24 SHORT rewrite 2489.33 [ 304.97, 389.96] read 3051.71 SHORT
ost 6 sz 6291456K rsz 1024K obj 6 thr 12 write 4378.68 SHORT rewrite 1456.43 [ 0.00, 602.95] read 3834.28 SHORT
ost 6 sz 6291456K rsz 1024K obj 12 thr 12 write 1782.82 SHORT rewrite 625.03 [ 0.00, 187.99] read 4303.14 SHORT
More test results, zfs vs ldiskfs
During the process of building Lustre rpms with zfs 0.6.4.2, it was decided to do several tests
- IO tests using cp from/to /tmp of dc2-10-23 with zfs
- IO tests using cp from/to /tmp of dc2-10-23 with ldiskfs
- IO tests using stock 2.7.0 from/to /tmp of dc2-10-23 with zfs 0.6.3
- Upgrade zfs and kernel to the 2.7.58 build with zfs 0.6.4.2
- Repeat of first test with zfs
Size formatted ldiskfs
/dev/sdb 3.7T 69M 3.5T 1% /mnt/ost-001
Size formatted zfs
ost-001/ost0000 3.6T 3.8M 3.6T 1% /mnt/ost-001
This is the
log of test results performed
Test results conclusions
- Best single-machine read rate is from ldiskfs at ~16MB/s, otherwise range is from 9-12MB/s
- Best single-machine write rate is to ldiskfs at 20MB/s, but zfs is statistically the same, not far behind at 19.7
- Writes to zfs are expensive, in an iostat sense, 40% vs 10% for ldiskfs on dual threads
- Reads from ldiskfs are expensive, in an iostat sense, 95% vs 60% for zfs on dual threads
- Single thread read iostat on umdist10 is 33-48% on zfs. single ldiskfs was 60%
- Single thread write iostat on umdist10 is 20-25% on zfs.
Some post-production update stats
- Typical iostat on umdist09 during writes is 15-20% of the OST capability
- Typical write rate on umdist09 is 10MB/s/OST
Table keys
- The Version is that of Lustre
- 0 = stock 2.7.0
- 58-1 = Initial install is 2.7.58 with zfs 0.6.4.2
- 58-2 = Install is 2.7.0 with zfs 0.6.3 upgraded to 2.7.58 with zfs 0.6.4.2
- 58-3 = "zpool upgrade -a" run on the 58-2 pool
- Final = Post-full-upgrade Production system
- LD is ldiskfs formatted disk
Sequence |
Host 1 |
Host 2 |
dist10 NIC |
Version |
LD or ZFS |
Read |
Write |
iostat |
load_one |
1 |
dc2-10-23 |
1Gb |
58-1 |
ZFS |
11.74 |
2 |
dc2-10-23 |
1Gb |
58-1 |
ZFS |
11.88 |
3 |
dc2-10-23 |
1Gb |
58-1 |
LD |
10.27 |
4 |
dc2-10-23 |
1Gb |
58-1 |
LD |
15.7 |
5 |
dc2-10-23 |
10Gb |
58-1 |
LD |
11.22 |
6 |
dc2-10-23 |
10Gb |
58-1 |
LD |
12.81 |
7 |
dc40-16-25 |
10Gb |
58-1 |
LD |
16.4 |
60 |
8 |
dc40-16-25 |
10Gb |
58-1 |
LD |
20.2 |
9 |
dc2-10-23 |
dc40-16-25 |
10Gb |
58-1 |
LD |
21.06 |
10 |
dc2-10-23 |
dc40-16-25 |
10Gb |
58-1 |
LD |
19.54 |
95 |
11 |
dc2-10-23 |
dc40-16-25 |
10Gb |
58-1 |
LD |
34 |
10 |
12 |
dc40-16-25 |
10Gb |
0 |
ZFS |
19.79 |
23 |
13 |
dc2-10-23 |
10Gb |
0 |
ZFS |
10.21 |
14 |
14 |
dc2-10-23 |
dc40-16-25 |
10Gb |
0 |
ZFS |
35 |
40 |
0.8 |
15 |
dc40-16-25 |
10Gb |
0 |
ZFS |
12.02 |
42 |
2.5 |
16 |
dc2-10-23 |
10Gb |
0 |
ZFS |
11.74 |
14 |
17 |
dc2-10-23 |
dc40-16-25 |
10Gb |
0 |
ZFS |
18 |
62 |
1.5 |
18 |
dc2-10-23 |
10Gb |
58-2 |
ZFS |
13.92 |
16.38 |
0.2 |
19 |
dc40-16-25 |
10Gb |
58-2 |
ZFS |
19.93 |
25.64 |
20 |
dc40-16-25 |
10Gb |
58-2 |
ZFS |
9.41 |
43.69 |
0.9 |
21 |
dc2-10-23 |
10Gb |
58-2 |
ZFS |
9.44 |
35.55 |
0.9 |
22 |
dc2-10-23 |
dc40-16-25 |
10Gb |
58-2 |
ZFS |
19 |
55.76 |
1.7 |
23 |
dc2-10-23 |
dc40-16-25 |
10Gb |
58-2 |
ZFS |
34 |
42.15 |
0.4 |
24 |
dc40-16-25 |
10Gb |
58-3 |
ZFS |
19.73 |
23.57 |
0.4 |
25 |
dc2-10-23 |
10Gb |
58-3 |
ZFS |
13.39 |
20.26 |
0.25 |
26 |
dc2-10-23 |
dc40-16-25 |
10Gb |
58-3 |
ZFS |
36 |
43.88 |
0.55 |
27 |
dc40-16-25 |
10Gb |
58-3 |
ZFS |
9.46 |
48.32 |
1.1 |
28 |
dc2-10-23 |
10Gb |
58-3 |
ZFS |
8.96 |
33.08 |
0.7 |
29 |
dc2-10-23 |
dc40-16-25 |
10Gb |
58-3 |
ZFS |
14.65 |
60.55 |
1.8 |
30 |
dc2-10-23 |
10Gb |
Final |
ZFS |
12.67 |
0.8 |
31 |
dc40-16-25 |
10Gb |
Final |
ZFS |
20.20 |
1.0 |
32 |
dc2-10-23 |
10Gb |
Final |
ZFS |
26.0 |
1.0 |
33 |
dc40-16-25 |
10Gb |
Final |
ZFS |
13.96 |
1.0 |
Random notes gleaned while reading up on the topic
OSS in Lustre 2.4 max size at 16TB, but with 2.5 can use up to 256TB? This assumes zfs beneath. For ldiskfs, max is 128TB.
MDS should have 1-2% of the storage of the full system, so, for our 1PB, this would be 1-2TB.
49M inodes currently in use on the MDT, at 2kB each, is 98GB of space. Double that to 200GB. Current is 263GB.
For better performance, we recommend that you create RAID sets with 4 or 8 data disks plus one or two parity disks. Using larger RAID sets will negatively impact performance compared to having multiple independent RAID sets.
For RAID 5, RAID 6, or RAID 1+0 storage, specifying the following option to the --mkfsoptions parameter option improves the layout of the file system metadata, ensuring that no single disk contains all of the allocation bitmaps: -E stride = chunk_blocks The chunk_blocks variable is in units of 4096-byte blocks and represents the amount of contiguous data written to a single disk before moving to the next disk. This is alternately referred to as the RAID stripe size. This is applicable to both MDT and OST file systems.
For best results, use RAID 5 with 5 or 9 disks or RAID 6 with 6 or 10 disks, each on a different controller. The stripe width is the optimal minimum I/O size. Ideally, the RAID configuration should allow 1 MB Lustre RPCs to fit evenly on a single RAID stripe without an expensive read-modify-write cycle. Use this formula to determine the stripe_width, where number_of_data_disks does not include the RAID parity disks (1 for RAID 5 and 2 for RAID 6):
stripe_width_blocks = chunk_blocks * number_of_data_disks = 1 MB
If the RAID configuration does not allow chunk_blocks to fit evenly into 1 MB, select stripe_width_blocks, such that is close to 1 MB, but not larger. The stripe_width_blocks value must equal chunk_blocks * number_of_data_disks. Specifying the stripe_width_blocks parameter is only relevant for RAID 5 or RAID 6, and is not needed for RAID 1 plus 0. Run --reformat on the file system device (/dev/sdc), specifying the RAID geometry to the underlying ldiskfs file system, where:
--mkfsoptions "other_options -E stride=chunk_blocks, stripe_width=stripe_width_blocks"
A RAID 6 configuration with 6 disks has 4 data and 2 parity disks. The chunk_blocks <= 1024KB/4 = 256KB. Because the number of data disks is equal to the power of 2, the stripe width is equal to 1 MB.
--mkfsoptions "other_options -E stride=chunk_blocks, stripe_width=stripe_width_blocks
For best performance, should put the journal elsewhere than on the OST. See page 34 of the manual. For example:
oss# mke2fs -b 4096 -O journal_dev /dev/sdb journal_size The value of journal_size is specified in units of 4096-byte blocks. For example, 262144 for a 1 GB journal size.
[oss#] mkfs.lustre --mgsnode=mds@osib --ost --index=0 --mkfsoptions="-J device=/dev/sdb1" /dev/sdc
Growing an OST
It is possible to replace all disks on an OST with larger disks, and when all disks are replaced, the OST will grow to the new size. The procedure outlined below replaces one disk at a time in the OST, with 2 replacements per day possible. This was successfully done on umdist01, and a new purchase of 20 1TB disks will again be employed to grow two more OST, providing 750GB disk spares in the process.
- zpool set autoexpand=on ost-001
- zpool offline ost-001 pci-0000:08:00.0-scsi-0:2:16:0
- omconfig storage vdisk action=deletevdisk controller=1 vdisk=16
- Replace the disk
- omconfig storage controller action=clearforeignconfig controller=1
- omconfig storage controller action=createvdisk controller=1 size=max raid=r0 pdisk=0:1:1 name=c1d101
- zpool replace -f ost-001 pci-0000:08:00.0-scsi-0:2:16:0
- Wait for the resilver to complete, and move to the next disk in the list
The list of devices in a zpool can be obtained from the "zpool status ost-001" command. This in turn can be matched up to the output of "/root/tools/map_sd_for_iostat.sh", thus obtaining a full list of physical disks and vdisk names that correspond to those devices.
After the last disk is replaced and resilvered, make sure the new, grown OST is properly ensconced in the zfs configuration.
- umount /mnt/ost-001
- zpool export ost-001
- zpool import ost-001
- mount ost-001/ost0012
--
BobBall - 07 Apr 2015