Test results comparing zfs to ldiskfs
The tests below run a test Lustre system (mgs + umdist10) through its paces, starting with a zfs 0.6.4.2 straight-up install, with both ldiskfs and zfs partitions, then proceeding to a stock Lustre 2.7.0 install with zfs 0.6.3, then to an upgrade to our home-build 2.7.58 rpms.
[root@c-10-23 lustre]# cat /root/copy_script.sh
#!/bin/bash
cp $1 .
copying 25GB of file out of old Lustre to /tmp/condor/lustre (keeps it from being deleted).
[root@c-10-23 lustre]# time find /lustre/umt3/datadisk/mc09_7TeV/event -type f -print \
| xargs -n 1 /root/copy_script.sh
real 54m15.407s
user 0m41.144s
sys 3m4.177s
/tmp utilization is running around 15% but is spiky, not consistent
wkB_s around 9000kB/s
[root@c-10-23 condor]# du -s -x -m lustre
19622 lustre
[root@c-10-23 ~]# ls -1 /tmp/condor/lustre|wc -l
71344
Average file size is 0.275MB
So, 6.03MB/s average copy from old lustre this way to local disk /tmp
-------------------------
Now, write this to the new Lustre. /tmp utilization is 72% and is steady.
Bytes out about 11MB/s, rkB_S around 10000kB/s
[root@c-10-23 copiedTo]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh
real 27m51.289s
user 0m43.058s
sys 3m0.993s
So, 11.74MB/s average local disk /tmp to new lustre
--------------------
Now, copy from new Lustre to new /tmp/condor/bkfromnew. /tmp util steady around 25-30%.
The load_one on umdist10 is higher now (0.6) than for the writing (mostly less than 0.25)
[root@c-10-23 lustre]# time find /lustre/T3test/copiedTo -type f -print | xargs -n 1 /root/copy_script.sh
real 27m31.534s
user 0m42.136s
sys 3m2.407s
So, 11.88MB/s average new lustre back to local disk /tmp
--------------------
umount zfs disk on umdist10, and ldiskfs /mnt/mgs on mgs.
Re-create mgs and mount it
mkfs.lustre --fsname=T3test --mgs --mdt --reformat --index=0 /dev/mapper/vg0-lv_home
Destroy zfs volume, delete RAID-0 vdisks, create RAID-6 volume, and initialize it (wait for completion)
667 omconfig storage vdisk action=deletevdisk controller=0 vdisk=5
668 omconfig storage vdisk action=deletevdisk controller=0 vdisk=4
669 omconfig storage vdisk action=deletevdisk controller=0 vdisk=3
670 omconfig storage vdisk action=deletevdisk controller=0 vdisk=2
671 omconfig storage vdisk action=deletevdisk controller=0 vdisk=1
672 omconfig storage vdisk action=deletevdisk controller=0 vdisk=0
674 omconfig storage controller action=createvdisk controller=0 size=max raid=r6 pdisk=0:0:0,0:0:1,0:0:2,0:0:3,0:0:4,0:0:5 \
stripesize=256kb readpolicy=ra writepolicy=wb name=ost0
676 omconfig storage vdisk action=initialize controller=0 vdisk=0
Wait for init to complete, about 15hrs
699 yum erase lustre-ost-zfs lustre-osd-zfs-mount
700 cd /atlas/data08/ball/admin/LustreSL6/2.7.58/server
702 yum localinstall lustre-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-osd-ldiskfs-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-osd-ldiskfs-mount-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm
705 mkfs.lustre --ost --mgsnode=10.10.1.140@tcp0 --fsname=T3test --index=0 --mountfsoptions="stripe=256" /dev/sdb
------------------
Copy to ldiskfs formatted storage
Load_one on umdist10 is <~0.15
[root@c-10-23 copiedTo]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh
real 31m51.338s
user 0m43.523s
sys 3m2.792s
This is 10.27MB/s to Lustre.
----------------------
Copy from Lustre back to /tmp
Load_one on umdist10 is <~0.55
[root@c-10-23 bkfromnew2]# time find /lustre/T3test/copiedTo -type f -print | xargs -n 1 /root/copy_script.sh
real 20m49.689s
user 0m42.474s
sys 3m3.575s
This is 15.70MB/s
-----------------------
Add 10Gb Myricom on umdist10, create /lustre/T3test/copiedTo10G and write to it.
Modified /etc/modprobe.d/lustre.conf and /etc/ganglia/gmond.conf, and created ifcfg-eth2.
Load on umdist10 unchanged.
[root@c-10-23 copiedTo10G]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh
real 29m8.506s
user 0m42.990s
sys 3m2.682s
This is 11.22MB/s
--------------------------
Reverse direction back to /tmp in bkfromnew3
[root@c-10-23 bkfromnew3]# time find /lustre/T3test/copiedTo -type f -print | xargs -n 1 /root/copy_script.sh
real 25m31.611s
user 0m42.181s
sys 3m4.468s
This is 12.81MB/s
---------------------------------
Switch to 10GB host dc40-16-25 for some testing. Copy to node from Lustre.
umdist10 IO rate is about 17MB/s
[root@c-16-25 lustre]# time find /lustre/T3test/copiedTo -type f -print | xargs -n 1 /root/copy_script.sh
real 19m56.640s
user 0m30.901s
sys 2m28.088s
This is 16.40MB/s
-----------------------------
Copy to Lustre from dc40-16-25
umdist10 IO rate is about 22MB/s.
[root@c-16-25 from_16-25]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh
real 16m11.373s
user 0m24.455s
sys 1m44.448s
This is 20.20MB/s
-----------------------------
Now, let's run both dc40-16-25 and dc2-10-23 simultaneously. Note the separation of
source directories so that they are distinct.
ost utilization on umdist10 is about 90-95%. Single stream was around 60% yesterday.
Bytes out from umdist10 around 22-24MB/s
[root@c-10-23 bkfromnew3]# time find /lustre/T3test/copiedTo -type f -print | xargs -n 1 /root/copy_script.sh
real 31m3.092s
user 0m41.830s
sys 3m0.513s
This is 10.53MB/s
[root@c-16-25 bkfromnew]# time find /lustre/T3test/copiedTo10G -type f -print | xargs -n 1 /root/copy_script.sh
real 27m50.670s
user 0m7.688s
sys 0m58.788s
This is 11.74MB/s
Aggregate over longest time is 21.06MB/s
----------------------------
Repeat this, copying TO two distinct lustre directories.
ost utilization on umdist10 is about 10%
Bytes in to umdist10 around 34MB/s
[root@c-16-25 new_from_16-25]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh
real 16m10.288s
user 0m23.471s
sys 1m43.641s
This is very close to the single-stream write recorded above for this machine, ie,
20.22MB/s
[root@c-10-23 new_from_10-23]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh
real 32m0.445s
user 0m43.284s
sys 2m59.878s
This is 10.22MB/s, again very close to the single stream write rate above.
These two streams appear to be independent, and do not impact each other.
---------------------------------------------------
Repeat the "from lustre" copies one last time using both hosts.
Iostat, etc, all look about the same as before.
[root@c-16-25 bkfromnew4]# time find /lustre/T3test/copiedTo -type f -print | xargs -n 1 /root/copy_script.sh
real 24m35.714s
user 0m7.876s
sys 0m59.209s
This is 13.30MB/s
[root@c-10-23 bkfromnew4]# time find /lustre/T3test/copiedTo10G -type f -print | xargs -n 1 /root/copy_script.sh
real 33m28.274s
user 0m42.190s
sys 3m2.625s
This is 9.77MB/s
Aggregate over the longest time is 19.54MB/s
------------------------------------------------------------------------------------
------------------------- Now, the upgrade path ----------------------------
Rebuild both mgs and umdist10 with zfs 0.6.3 and the stock Lustre, make them, re-run the throughput tests.
After that, upgrade to our home-built rpms, and do it all again.
---------------------
Copy from dc40-16-25 to /lustre/T3test/copiedTo10G
[root@c-16-25 copiedTo10G]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh
real 16m31.722s
user 0m27.663s
sys 1m59.183s
This is 19.79MB/s
Repeat following reboot of dc40-16-25
iostat usage on umdist10 6 zfs sd devices averages about 23%
[root@c-16-25 copiedTo10G_b]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh
real 16m11.089s
user 0m24.273s
sys 1m45.823s
Nearly identical??? Why is this faster than before? Is it a network bottleneck? Or is it my rpms?
-------------------------------
[root@c-10-23 copiedTo1G]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh
real 32m0.230s
user 0m43.092s
sys 3m1.415s
This is more typical, 10.21MB/s
iostat usage on umdist10 6 zfs sd devices averaged about 14% during this time.
------------------------------
Run both writes simultaneously now
iostat usage on umdist10 6 zfs sd devices averaged about 40% with both running. Seems
simple additive of the individual rates.
Peak IO around 35MB/s
load_one around 0.8
[root@c-16-25 copiedTo10G_c]# time find /tmp/condor/lustre -type f -print | xargs -n 1 /root/copy_script.sh
real 15m55.832s
user 0m20.618s
sys 1m34.171s
[root@c-10-23 copiedTo1G_c]# time find /tmp/condor/bkfromnew -type f -print | xargs -n 1 /root/copy_script.sh
real 30m44.877s
user 0m43.150s
sys 3m1.965s
-----------------------------------
Read from new Lustre
Start with dc40-16-25
Average iostat is 42-43%
load_one around 2.5
[root@c-16-25 T2back1]# time find /lustre/T3test/copiedTo1G -type f -print | xargs -n 1 /root/copy_script.sh
real 27m12.723s
user 0m8.614s
sys 1m4.618s
12.02MB/s
----------------------------------
Read on dc2-10-23
Average iostat is about 14%
[root@c-10-23 T2back1]# time find /lustre/T3test/copiedTo10G -type f -print | xargs -n 1 /root/copy_script.sh
real 27m51.404s
user 0m42.114s
sys 3m2.842s
11.74MB/s
-----------------------------
Now do simultaneous reads.
Average iostat of umdist10 is 60-63%
load_one around 1.5
IO rate around 18MB/s
[root@c-16-25 T2back2]# time find /lustre/T3test/copiedTo1G_c -type f -print | xargs -n 1 /root/copy_script.sh
date
real 33m56.085s
user 0m8.137s
sys 1m1.258s
[root@c-10-23 T2back2]# time find /lustre/T3test/copiedTo10G_b -type f -print | xargs -n 1 /root/copy_script.sh
date
real 37m39.361s
user 0m42.361s
sys 3m3.502s
Using aggregate time, get 17.37MB/s
-------------------------------------------------------------------------
Now, upgrade. Start with mgs, move to umdist10
1. umount on both machines, umdist10 first. Comment out in fstab.
2. yum remove lustre rpms
yum erase lustre lustre-osd-ldiskfs-mount lustre-modules lustre-osd-ldiskfs
yum erase kernel-firmware
3. upgrade kernel and reboot
4. install new Lustre rpms
5. reboot
6. mount mgs
Now, on umdist10
7. cp /etc/zfs/zpool.cache /root
8. service zfs stop
9. chkconfig zfs off
10. yum erase lustre lustre-modules lustre-osd-zfs lustre-osd-zfs-mount
11. yum erase libnvpair1 libuutil1 libzfs2 libzpool2 spl spl-dkms zfs zfs-dkms zfs-dracut
12. yum erase kernel-firmware (gets 573 version out of the way)
13. cd /atlas/data08/ball/admin/LustreSL6/2.7/server
14. yum localinstall kernel-firmware
12. upgrade kernel and reboot
13. Remove these files
for i in /var/lib/dkms/*/[^k]*/source; do [ -e "$i" ] || echo "$i";done
/var/lib/dkms/spl/0.6.3/source
/var/lib/dkms/zfs/0.6.3/source
14. install zfs and lustre rpms.
See zfs install directions here: https://www.aglt2.org/wiki/bin/view/AGLT2/ZFsforAFS#Install_ZFS
yum localinstall lustre-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-modules-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-osd-zfs-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm \
lustre-osd-zfs-mount-2.7.58-2.6.32.504.16.2.el6_lustre_gdeb6fcd.x86_64.rpm
15. Reboot
16. Mount the ost
------------------------- All is working, run read/write tests ---------------
From dc2-10-23 to Lustre
The average iostat is 16.38%
IO rate started at 22MB/s, has dropped to 14MB/s
load_one of umdist10 is around 0.2
[root@c-10-23 copiedTo1G_e]# time find /tmp/condor/T2back2 -type f -print | xargs -n 1 /root/copy_script.sh
real 23m29.328s
user 0m42.869s
sys 3m1.129s
13.92MB/s
--------------------------
Ditto dc40-16-25
Average iostat 25.64%
IO rate pretty steady at ~23MB/s
[root@c-16-25 copiedTo10G_e]# time find /tmp/condor/T2back2 -type f -print | xargs -n 1 /root/copy_script.sh
real 16m24.367s
user 0m24.729s
sys 1m44.971s
19.93MB/s
--------------------------
Now, read back from Lustre
Read from Lustre dc40-16-25
Average iostat 43.69%
IO rate around 13MB/s
load_one around 0.9
[root@c-16-25 T2back3]# time find /lustre/T3test/copiedTo1G_c -type f -print | xargs -n 1 /root/copy_script.sh
real 34m44.445s
user 0m8.097s
sys 1m2.598s
9.41MB/s
-----------------------------
Read from Lustre dc2-10-23
Average iostat 35.55%
IO rate around 11MB/s
load_one around 0.9
[root@c-10-23 T2back3]# time find /lustre/T3test/copiedTo1G_c -type f -print | xargs -n 1 /root/copy_script.sh
real 34m37.604s
user 0m42.412s
sys 3m2.963s
9.44MB/s
---------------------------
Simultaneous reads
Average iostat 55.76%
IO rate around 17-19MB/s (10-11MB on 16-25 and 8-9MB on 10-23)
load_one around 1.7
[root@c-10-23 T2back4]# time find /lustre/T3test/copiedTo1G -type f -print | xargs -n 1 /root/copy_script.sh
real 42m57.068s
user 0m41.797s
sys 3m2.971s
[root@c-16-25 T3back4]# time find /lustre/T3test/copiedTo10G_b -type f -print | xargs -n 1 /root/copy_script.sh
real 34m38.139s
user 0m10.243s
sys 1m31.489s
Using aggregate (largest time) get 15.22
---------------------------
Simultaneous writes
Average iostat 42.15%
IO rate around 34MB/s (20-22MB on 16-25 and 14MB on 10-23)
load_one around 0.4
[root@c-16-25 copiedTo10G_f]# time find /tmp/condor/T2back1 -type f -print | xargs -n 1 /root/copy_script.sh
real 16m59.451s
user 0m26.940s
sys 1m49.630s
[root@c-10-23 copiedTo1G_f]# time find /tmp/condor/T2back1 -type f -print | xargs -n 1 /root/copy_script.sh
real 26m11.999s
user 0m43.154s
sys 3m3.011s
------------------------- Last of all, upgrade the zpool features -----------------
umount all OST
[root@umdist10 ~]# zpool upgrade
This system supports ZFS pool feature flags.
All pools are formatted using feature flags.
Some supported features are not enabled on the following pools. Once a
feature is enabled the pool may become incompatible with software
that does not support the feature. See zpool-features(5) for details.
POOL FEATURE
---------------
ost-001
spacemap_histogram
enabled_txg
hole_birth
extensible_dataset
embedded_data
bookmarks
[root@umdist10 ~]# zpool upgrade -a
This system supports ZFS pool feature flags.
Enabled the following features on 'ost-001':
spacemap_histogram
enabled_txg
hole_birth
extensible_dataset
embedded_data
bookmarks
remount the OST
------------------------ Repeat last set of tests ----------------------
From dc40-16-25 to Lustre
Average iostat 23.57%
IO rate around 22MB/s
load_one around 0.4
[root@c-16-25 copiedTo10G_g]# time find /tmp/condor/T2back1 -type f -print | xargs -n 1 /root/copy_script.sh
real 16m34.413s
user 0m23.646s
sys 1m44.393s
This is 19.73MB/s
-------------------------------------------------
From dc2-10-23 to Lustre
Average iostat 20.26%
IO rate around 15MB/s
load_one around 0.25
[root@c-10-23 copiedTo1G_g]# time find /tmp/condor/T2back1 -type f -print | xargs -n 1 /root/copy_script.sh
real 24m24.959s
user 0m43.396s
sys 3m0.664s
13.39MB/s
-------------------------------------------------
Simultaneous writes
Average iostat 43.88%
IO rate around 36MB/s (Individually, these machines are unchanged from single write rates)
load_one around 0.55
root@c-16-25 copiedTo10G_h]# time find /tmp/condor/T2back2 -type f -print | xargs -n 1 /root/copy_script.sh
real 16m30.593s
user 0m25.605s
sys 1m48.979s
[root@c-10-23 copiedTo1G_h]# time find /tmp/condor/T2back2 -type f -print | xargs -n 1 /root/copy_script.sh
real 24m23.808s
user 0m43.325s
sys 3m1.776s
-------------------------------------------------------
Now, read tests
On dc40-16-25 first....
Average iostat 48.32%
IO rate around 8-10MB/s
load_one around 1.1
[root@c-16-25 T2back5]# time find /lustre/T3test/copiedTo1G -type f -print | xargs -n 1 /root/copy_script.sh
real 34m32.438s
user 0m8.226s
sys 1m3.338s
9.46MB/s
On dc2-10-23 read from Lustre
Average iostat 33.08%
IO rate around 9-10MB/s
load_one around 0.7
[root@c-10-23 T2back5]# time find /lustre/T3test/copiedTo10G_g -type f -print | xargs -n 1 /root/copy_script.sh
real 36m30.948s
user 0m42.654s
sys 3m3.977s
This is 8.96MB/s
-------------------------------------
Now, do the combined reads from Lustre
Average iostat 60.55%
IO rate around 15MB/s (dc40 and dc2 each around 8-9)
load_one around 1.8
[root@c-16-25 T2back6]# time find /lustre/T3test/copiedTo1G_c -type f -print | xargs -n 1 /root/copy_script.sh
date
real 41m4.598s
user 0m9.673s
sys 1m14.899s
[root@c-10-23 T2back6]# time find /lustre/T3test/copiedTo10G_b -type f -print | xargs -n 1 /root/copy_script.sh
date
real 44m39.748s
user 0m42.022s
sys 3m2.961s
Aggregate rate is 14.65MB/s
----------------------------------------
----------------------------------------
Following upgrade,write from dc2-10-23 to /lustre/umt3
IO rate is steady and stable, with 10-23 load_one around 0.8
[root@c-10-23 copyTo1G]# time find /tmp/condor/T2back1 -type f -print | xargs -n 1 /root/copy_script.sh
real 25m48.471s
user 0m43.222s
sys 3m0.494s
12.67MB/s
---------------------
Now, write from dc40-16-25
IO rate is steady and stable, with 40-25 load_one around 1.0
[root@c-16-25 copyTo10G]# time find /tmp/condor/T3back4 -type f -print | xargs -n 1 /root/copy_script.sh
real 16m11.506s
user 0m28.363s
sys 2m3.528s
20.20MB/s
------------------------------
read on dc2-10-23
IO rate steady, load_one around 1.0
[root@c-10-23 final1]# time find /lustre/umt3/bobtest/copyTo10G -type f -print | xargs -n 1 /root/copy_script.sh
real 12m33.031s
user 0m43.726s
sys 3m9.278s
26.06MB/s
----------------------------------
read on dc40-16-25
IO rate steady, but also steadily decreasing, never exceeded about 20MB/s, load_one around 1.0
[root@c-16-25 T3back5]# time find /lustre/umt3/bobtest/copyTo1G -type f -print | xargs -n 1 /root/copy_script.sh
real 23m25.403s
user 0m10.222s
sys 1m21.805s
13.96MB/s
-----------------
Interestingly, although disparate both read rates are higher than in the mgs/umdist10 tests.
The write rates are compatible with the mgs/umdist10 tests..
--
BobBall - 02 Sep 2015