Links
Physical Inspection
Have suffered shipping damage on an MD1200, check that mounting tabs are ok and drives removal is not obstructed.
RAID Initialization
The shelf should be fully initialized before putting into production.
A full initialization, which will blank out any existing data on shelf can be run with "slow initialization" option in BIOS setup tool. The shelf (vdisk) will be unavailable to the OS until this completes.
Background initialization will automatically run on any new vdisk. The rate that this proceeds at can be adjusted to balance with other use of the controller, the default rate is 30% (?). If nothing else has priority, do:
omconfig storage controller controller=0 action=setbgirate rate=100
The init seems to take about as long as reading or writing once through all the drives.
There is some confusion over what RAID init does. How are bits on the drives actually changed during a BGI? (June 6, 2010 Have asked Dell is data is rewritten during background init; tech will email answer if he can find it...)
Note that the BGI does not need to complete before the array can be used --- but user use of the array will take priority and halt or slow the BGI. It is possible to write data on parts of the array before the BGI has gotten to that part of the shelf. Writes of course involve an update (creation) of RAID parity blocks.
Fill filesystem
It isn't clear if a background initialiazation rewrites the user data areas of the array or not. So, suggest filling filesystem once and reading files verifying their checksums.
Ah, think I lost the script I used for writting, but basically, just used perl to fill the filesystem with 4GB files that were just 1s and 0s (binary). Each file hould have the same md5sum (except the truncated final files).
find /mnt/msufs10_a | xargs -I QQQ md5sum QQQ > msufs10_a.md5sum &
Patrol Read
Seems good to complete one complete patrol read pass if there is time. This may show media errors, or even result in ejecting a disk from an array.
Patrol reads can be initiated using omconfig
omconfig storage controller controller=0 action=startpatrolread
See that patrol read is finished:
[root@msufs10 ~]# grep 99.99 /var/log/lsi_0605.log
06/04/10 21:10:29: EVT#04040-06/04/10 21:10:29: 94=Patrol Read progress on PD 22(e0x29/s7) is 99.99%(65429s)
06/04/10 21:17:43: EVT#04049-06/04/10 21:17:43: 94=Patrol Read progress on PD 24(e0x29/s9) is 99.99%(327s)
06/04/10 21:21:28: EVT#04055-06/04/10 21:21:28: 94=Patrol Read progress on PD 25(e0x29/s1) is 99.99%(552s)
06/04/10 21:21:46: EVT#04056-06/04/10 21:21:46: 94=Patrol Read progress on PD 28(e0x29/s11) is 99.99%(570s)
06/04/10 21:23:49: EVT#04061-06/04/10 21:23:49: 94=Patrol Read progress on PD 2a(e0x36/s2) is 99.99%(693s)
06/04/10 21:24:08: EVT#04062-06/04/10 21:24:08: 94=Patrol Read progress on PD 2c(e0x36/s4) is 99.99%(712s)
06/04/10 21:24:41: EVT#04063-06/04/10 21:24:41: 94=Patrol Read progress on PD 33(e0x36/s0) is 99.99%(745s)
06/04/10 21:25:00: EVT#04064-06/04/10 21:25:00: 94=Patrol Read progress on PD 23(e0x29/s6) is 99.99%(764s)
06/04/10 21:26:08: EVT#04065-06/04/10 21:26:08: 94=Patrol Read progress on PD 2b(e0x36/s5) is 99.99%(832s)
06/04/10 21:28:06: EVT#04066-06/04/10 21:28:06: 94=Patrol Read progress on PD 21(e0x29/s8) is 99.99%(950s)
06/04/10 21:28:27: EVT#04067-06/04/10 21:28:27: 94=Patrol Read progress on PD 1d(e0x29/s2) is 99.99%(971s)
06/04/10 21:28:45: EVT#04068-06/04/10 21:28:45: 94=Patrol Read progress on PD 2e(e0x36/s8) is 99.99%(989s)
06/04/10 21:29:51: EVT#04069-06/04/10 21:29:51: 94=Patrol Read progress on PD 20(e0x29/s3) is 99.99%(1055s)
06/04/10 21:30:11: EVT#04070-06/04/10 21:30:11: 94=Patrol Read progress on PD 1e(e0x29/s5) is 99.99%(1075s)
06/04/10 21:31:01: EVT#04071-06/04/10 21:31:01: 94=Patrol Read progress on PD 32(e0x36/s1) is 99.99%(1125s)
06/04/10 21:31:58: EVT#04073-06/04/10 21:31:58: 94=Patrol Read progress on PD 35(e0x36/s11) is 99.99%(1182s)
06/04/10 21:32:31: EVT#04074-06/04/10 21:32:31: 94=Patrol Read progress on PD 27(e0x29/s10) is 99.99%(1215s)
06/04/10 21:32:59: EVT#04075-06/04/10 21:32:59: 94=Patrol Read progress on PD 26(e0x29/s0) is 99.99%(1243s)
06/04/10 21:33:25: EVT#04076-06/04/10 21:33:25: 94=Patrol Read progress on PD 1f(e0x29/s4) is 99.99%(1269s)
06/04/10 21:34:08: EVT#04077-06/04/10 21:34:08: 94=Patrol Read progress on PD 34(e0x36/s10) is 99.99%(1312s)
06/04/10 21:34:26: EVT#04078-06/04/10 21:34:26: 94=Patrol Read progress on PD 31(e0x36/s9) is 99.99%(1330s)
06/04/10 21:35:53: EVT#04080-06/04/10 21:35:53: 94=Patrol Read progress on PD 30(e0x36/s6) is 99.99%(1417s)
06/04/10 21:36:51: EVT#04081-06/04/10 21:36:51: 94=Patrol Read progress on PD 2d(e0x36/s3) is 99.99%(1475s)
06/04/10 21:37:21: EVT#04082-06/04/10 21:37:21: 94=Patrol Read progress on PD 2f(e0x36/s7) is 99.99%(1505s)
06/04/10 22:35:55: EVT#04129-06/04/10 22:35:55: 94=Patrol Read progress on PD 03(e0x0f/s2) is 99.99%(5019s)
06/04/10 22:55:15: EVT#04130-06/04/10 22:55:15: 94=Patrol Read progress on PD 0d(e0x0f/s10) is 99.99%(6179s)
06/04/10 22:57:30: EVT#04131-06/04/10 22:57:30: 94=Patrol Read progress on PD 0a(e0x0f/s9) is 99.99%(6314s)
06/04/10 22:57:42: EVT#04132-06/04/10 22:57:42: 94=Patrol Read progress on PD 19(e0x1c/s0) is 99.99%(6326s)
06/04/10 22:59:42: EVT#04133-06/04/10 22:59:42: 94=Patrol Read progress on PD 05(e0x0f/s4) is 99.99%(6446s)
06/04/10 23:01:08: EVT#04134-06/04/10 23:01:08: 94=Patrol Read progress on PD 0b(e0x0f/s1) is 99.99%(6532s)
06/04/10 23:03:59: EVT#04135-06/04/10 23:03:59: 94=Patrol Read progress on PD 17(e0x1c/s9) is 99.99%(6703s)
06/04/10 23:04:11: EVT#04136-06/04/10 23:04:11: 94=Patrol Read progress on PD 0e(e0x0f/s11) is 99.99%(6715s)
06/04/10 23:04:34: EVT#04137-06/04/10 23:04:34: 94=Patrol Read progress on PD 14(e0x1c/s8) is 99.99%(6738s)
06/04/10 23:04:57: EVT#04138-06/04/10 23:04:57: 94=Patrol Read progress on PD 0c(e0x0f/s0) is 99.99%(6761s)
06/04/10 23:05:03: EVT#04139-06/04/10 23:05:03: 94=Patrol Read progress on PD 09(e0x0f/s6) is 99.99%(6767s)
06/04/10 23:06:19: EVT#04140-06/04/10 23:06:19: 94=Patrol Read progress on PD 16(e0x1c/s6) is 99.99%(6843s)
06/04/10 23:06:23: EVT#04141-06/04/10 23:06:23: 94=Patrol Read progress on PD 07(e0x0f/s8) is 99.99%(6847s)
06/04/10 23:06:50: EVT#04142-06/04/10 23:06:50: 94=Patrol Read progress on PD 08(e0x0f/s7) is 99.99%(6874s)
06/04/10 23:08:12: EVT#04143-06/04/10 23:08:12: 94=Patrol Read progress on PD 06(e0x0f/s3) is 99.99%(6956s)
06/04/10 23:08:34: EVT#04144-06/04/10 23:08:34: 94=Patrol Read progress on PD 04(e0x0f/s5) is 99.99%(6978s)
06/04/10 23:12:42: EVT#04145-06/04/10 23:12:42: 94=Patrol Read progress on PD 12(e0x1c/s4) is 99.99%(7226s)
06/04/10 23:13:26: EVT#04146-06/04/10 23:13:26: 94=Patrol Read progress on PD 15(e0x1c/s7) is 99.99%(7270s)
06/04/10 23:13:32: EVT#04147-06/04/10 23:13:32: 94=Patrol Read progress on PD 1b(e0x1c/s11) is 99.99%(7276s)
06/04/10 23:14:19: EVT#04148-06/04/10 23:14:19: 94=Patrol Read progress on PD 10(e0x1c/s2) is 99.99%(7323s)
06/04/10 23:15:07: EVT#04149-06/04/10 23:15:07: 94=Patrol Read progress on PD 13(e0x1c/s3) is 99.99%(7371s)
06/04/10 23:16:35: EVT#04150-06/04/10 23:16:35: 94=Patrol Read progress on PD 11(e0x1c/s5) is 99.99%(7459s)
06/04/10 23:18:08: EVT#04151-06/04/10 23:18:08: 94=Patrol Read progress on PD 18(e0x1c/s1) is 99.99%(7552s)
06/04/10 23:19:31: EVT#04152-06/04/10 23:19:31: 94=Patrol Read progress on PD 1a(e0x1c/s10) is 99.99%(7635s)
Dump controller logs and look for errors
Use omconfig to dump the controller's internal log:
omconfig storage controller controller=0 action=exportlog
Then look for a file /var/log/lsi_MMDD.log
Some strings to "grep -i" for: "medium", "error", "warning", "unexpected"
Device IDs
The controller uses a HEX ID for the enclosures and shelves. On the test setup with 4 12 disk shelves, see this:
T29: Total Device = 52
T29: PD Flags State Type Size S N Vendor Product Rev P C ID SAS Addr Port Phy DevH BFw BRev
T29: --- -------- ----- ---- -------- - - -------- ---------------- ---- - - -- ---------------- ---- --- ---- ---- ----
T29: 3 f1c0000f 00020 00 e8e088af 0 0 0 SEAGATE ST32000444SS KS65 0 0 0d 5000c500103604ba 00 18 0d NA NA
T29: 1 0 15 5000c500103604b9 01 18 15
T29: 4 f1c0000f 00020 00 e8e088af 0 0 0 SEAGATE ST32000444SS KS65 0 0 0e 5000c5001044d6c6 00 19 0e NA NA
T29: 1 0 16 5000c5001044d6c5 01 19 16
T29: 5 f1c0000f 00020 00 e8e088af 0 0 0 SEAGATE ST32000444SS KS65 0 0 0f 5000c5001044ff7e 00 1a 0f NA NA
T29: 1 0 17 5000c5001044ff7d 01 1a 17
.
.
.
T29: 36 01c0000f 00020 0d 0 0 0 0 DELL MD1200 1.01 0 0 6b 500c04f2a1a932bd 00 24 6b NA NA
T29: 1 0 78 500c04f2a1a9323d 01 24 78
T29: 100 00400005 00020 03 0 0 0 0 LSI SMP/SGPIO/SEP 0729 0 0 ffff 0 00 ff 00 NA NA
Lower in the logs, you can find a set of entries that include both the PD index in HEX and the Encolsure ID/Slot number:
T29: EVT#01168-T29: 91=Inserted: PD 03(e0x0f/s2)
T29: EVT#01169-T29: 247=Inserted: PD 03(e0x0f/s2) Info: enclPd=0f, scsiType=0, portMap=10, sasAddr=5000c500103604ba,5000c500103604b9
Medium Errors
05/31/10 1:42:03: EVT#01493-05/31/10 1:42:03: 47=Background Initialization corrected medium error (VD 01/1 at a75eef08, PD 17(e0x1c/s9) at a75eef08)
05/31/10 1:42:03: EVT#01494-05/31/10 1:42:03: 47=Background Initialization corrected medium error (VD 01/1 at a75eef09, PD 17(e0x1c/s9) at a75eef09)
05/31/10 1:42:03: EVT#01495-05/31/10 1:42:03: 47=Background Initialization corrected medium error (VD 01/1 at a75eef47, PD 17(e0x1c/s9) at a75eef47)
05/31/10 1:42:03: EVT#01496-05/31/10 1:42:03: 47=Background Initialization corrected medium error (VD 01/1 at a75eef6d, PD 17(e0x1c/s9) at a75eef6d)
06/04/10 19:39:34: DEV_REC:Medium Error DevId[21] devHandle 48 RDM=807ab800 retires=0
06/04/10 19:39:34: prCallback: Medium Error on pd=21, StartLba=a7856f9b, ErrLba=a7856fda
06/04/10 19:39:34: EVT#03907-06/04/10 19:39:34: 110=Corrected medium error during recovery on PD 21(e0x29/s8) at a7856fda
06/04/10 19:39:34: EVT#03908-06/04/10 19:39:34: 93=Patrol Read corrected medium error on PD 21(e0x29/s8) at a7856fda
Update firmware
Update firmware to current levels.
Have contacted Dell about corrected medium errors on new disks... No warranty replacement until (or if) drive is ejected from array by controller.
--
TomRockwell - 07 Jun 2010