Current Mood: Here bugs…
This is more for my reference than a real post, but I’m doing sod all with the blog and typing this out will help me think it through.
When a hard drive errors the messages can be… cryptic, to say the least. So lets interpret a few of them.
Other error? How very helpful.
The first port of call is your raid card’s status command, megasasctl is a good example. Some times it will be helpful and say “Bazzilion media errors!!1” or something with a similar meaning. Other times, not so much:
a0e252s1 558GiB a0d1 online errs: media:0 other:42
This tells us that the drive in slot 1 of adaptor 0 has suffered no media errors, but has suffered 42 “other” errors. How wonderfully helpful.
Media errors here are simple enough. Replace the drive and surface scan the faulty one. “Other” errors require more digging.
Sense key messages
One common message type looks something like this
04/18/12 15:01:58: EVT#01115-04/18/12 15:01:58: 113=Unexpected sense: PD 05(e0xfc/s1) Path 4433221101000000, CDB: 28 00 2c 9e c7 30 00 00 d0 00, Sense: b/00/00
04/18/12 15:01:58: Raw Sense for PD 5: 70 00 0b 00 00 00 00 0a 00 00 00 00 00 00 00 00 00 00
You can obtain similar messages from dmesg, but this one comes direct from the raid card’s logs. If you’re using a megaraid sas card such from the 9240/9260/9280 range, megacli -FwTermLog -Dsply -a0is your friend. To explain the cryptic parts..
- PD 05(e0xfc/s1) = Physical Disk 05, enclosure 0xFC (that’s 252), slot 1. Since this chassis has a multilane connection, it groups drives in to sets of 4… 0-3, 4-7, 8-11, 12-15 if it’s a 16 bay.
It’s telling us disk 5 for slot 1 because the cables aren’t connected in order on both ends. The enclosure ID does mean something, it refers to which physical box they’re in, but in this case it’s actually redundant since they’re all on the same back plane.
- Path 4433221101000000 = this is the SAS Address of the drive
- CDB: 28 00 2c 9e c7 30 00 00 d0 00 = Command Descriptor Block. This is the request that the Sense Key relates to.
The first byte, 28, is the main command. Here we are dealing with an extended 10 byte read instruction that provides an LBA address to cope with drives up to 2TiB.
It’s a 00 d0 read request from address 2c 9e c7 30, not super helpful, but it may hint at a problem.
To make sense of the string you find the first byte on the table at http://en.wikipedia.org/wiki/SCSI_command#List_of_SCSI_commands and check out what extra data there is.
- Sense: b/00/00 = This is the Sense Key. This is horribly cryptic. The first is the type of error, the second two narrow down the actual fault. See http://en.wikipedia.org/wiki/Key_Code_Qualifier for a pretty good list. In this case, b means an Aborted Command, the 00/00 means it doesn’t want to be more specific.
- Raw Sense = This is largely irrelevant, but it’s covered in detail at http://en.wikipedia.org/wiki/SCSI_Request_Sense_Command in addition to further expounding on the meaning of the first part of the Sense Key.
The 70 appears to mean it is non-scsi current errors in fixed format.
The 0b is 0 for the first 3 fields in byte 2 of the response and the first part of the sense key above.
The 0a in byte 7 tells us how long the rest of the message is.. 0xa, so 10 bytes more.
All the other fields are zeros … mostly indicating a lack of more information.
So we know that it’s an aborted read command with no special reasons.
Physical Drive information
Just a quick side note actually.
You probably want to grab a copy of the drive information at this point. Good options for this include megacli -pdinfo -physdrv'[252:1]’ -a0 for the MegaRAID cards mentioned above, hdparm -i /dev/sdc if it’s directly connected (doesn’t work if the drive in question is a raid logical drive), or perhaps smartctl -a /dev/sdc as a fall back.
Mainly you want the drive serial number of verifying the correct drive is being replaced, but it may possibly tell you something you didn’t already know. Doesn’t this time, but ah well.
Oh, and since it’s worth mentioning… megacli doesn’t make the serial quite as obvious as one would like. The line you need is the one labelled “Inquiry Data”… this, depending on what the drive actually has set, will contain some combination of the drive maker, the model number, the serial number, and the hardware revision. Not firmware revision though, that is listed a few lines above. This particular drive is a 600GB raptor, so it looks something like this (I’ve redacted it, just on the off chance there is something that could be done with it…)
Inquiry Data: WD-WXL1C10*****WDC WD6000HLHX-01J**** 04.05G04
The serial number for this drive is the bit starting at WXL and ending with the bit I’ve replaced with asterisks. “WD-” and “WDC” are vendor references (Western Digital Corporation, obviously) and the WD6000HLHX is the model number. The bit starting 01J is a hardware revision. Not entirely sure what the 04.05G04 bit is, that string doesn’t show up anywhere on the label.
Linux Kernel errors
Usually drive errors are accompanied by errors from the storage layer… this is no exception:
[403034.114416] sd 4:2:1:0: [sdc] Unhandled error code
[403034.114422] sd 4:2:1:0: [sdc] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[403034.114427] sd 4:2:1:0: [sdc] CDB: Read(10): 28 00 2c 9e c7 30 00 00 f8 00
[403034.114437] end_request: I/O error, dev sdc, sector 748603184
This actually tells us nothing of use, but lets break it down anyway.
- Event occured 403034 seconds after boot up. Very close timestamps are a good indication of linked events. even if the text of the message doesn’t state such.
- sd 4:2:1:0 = This is actually the hardware path to the scsi device. Only important thing about it is the 4… that means it’s on PCI ID 4.
- hostbyte=DID_ERROR = Defined in linux/include/scsi/scsi.h the hostbyte is the response from the low level kernel driver. In this case it’s an internal error.
Since this error has come from megaraid_sas we can make an educated guess that it came from one of the two references in that driver. It’s possible it came from elsewhere, but we’ll focus there for the moment.
Both occur when a read or write attempt is made to a logical drive, the driver will emit DID_ERROR when doing so during array init, when scsi io results in a failure, and when the firmware produces a status it doesn’t understand. In this context I’m going to go with the IO Failure option.
- driverbyte=DRIVER_OK = Also defined in linux/include/scsi/scsi.h the driverbyte is the mid level kernel driver status. Essentially one up from the hostbyte. Not much to see here.
- CDB means exactly the same thing as the one we got from the raid card errors above. The only difference is that this is being fed to the raid card. The message is still a 28 extended read command, still for the same location, but it requested slightly more data.
- end_request: I/O error = quite possible the most unhelpful error message ever. Interpret it literally. The request ended because of an I/O error of some type. This could be anything from a hammer to the drive to cutting the cables all the way up to a bug in the kernel’s adapter driver.
So class, what have we learnt? That the drive stubbornly refused to read that one sector for no readily apparent reason. Isn’t that nice of it.
We also learnt that the errors that look to have clear meanings actually mean little to nothing and the strings of unimportant looking numbers tell you lots.
From these messages I’m willing to accept the raid card’s statement that it was the drives fault, but it clearly isn’t a hardware fault. There are quite a list of specifically stated sense codes to cover hardware failure.
Next stop will likely be hard drive firmware. Again. Urg.
I’m quite certain this will be a series “To Be Continued”
A full drive read test failed with a slew of b/00/00 senses and finally a 3/11/00 for CDB 28 00 2c 9e c6 e0 00 01 00 00.
Some of interesting messages to look at here:
- 04/18/12 23:40:56: EVT#01174-04/18/12 23:40:56: 113=Unexpected sense: PD 05(e0xfc/s1) Path 4433221101000000, CDB: 28 00 2c 9e c6 e0 00 01 00 00, Sense: 3/11/00
Per the page I linked above… that’s a Medium Error – unrecovered read error. You’ll notice that although the LBA is close, it is not the same. 2c 9e c6 e0 here and 2c 9e c7 30 before. Since the requests are different sizes, the faulty address is going to be somewhere in the overlap.
- 04/18/12 23:40:56: Raw Sense for PD 5: f0 00 03 2c 9e c7 a5 0a 00 00 00 00 11 00 00 00 00 00
This the f0 from the first byte deserves note. Looking back to the Sense Response you can see that the first bit of the first byte is actually a flag that states if the data is SCSI compliant. The b/00/00 response had the first byte 0111 000 and so wasn’t compliant, where as this 3/11/00 has 1111 0000, same response code (Current error in fixed format) but with SCSI compatible data.
Otherwise the response follows the same pattern as before. 70 for the response code, the sense code is in byte 2 and the additional data is visible later on, still remaining length 0x0a. Helpfully this tells you the actual location of the fault in the information field. 2c 9e c7 a5, just inside the previous request.
- 04/18/12 23:40:56: DEV_REC:Medium Error DevId devHandle c RDM=80875400 retires=5
Not sure what RDM is… could be interesting to know.
- 04/18/12 23:40:56: MedErr is for: cmdId=b, ld=1, src=1, cmd=1, lba=2c9ec6e0, cnt=100, rmwOp=0
Note the LBA here is the one requested…
- 04/18/12 23:40:56: ErrLBAOffset (c5) LBA(2c9ec6e0) BadLba=2c9ec7a5
04/18/12 23:40:56: Read MED ERR!!! ErrLBA(c5) LBA(2c9ec6e0) TotalBlks (100)
While these lines gives us both the request start and the actual bad address. They even helpfully provided the offset of the bad block within the request.
- 04/18/12 23:40:56: MPTCMD: Issue Read ME Recovery RDM=80875400 Cmd 1 Flags 60000 Rec State 17 Retry 1
04/18/12 23:40:56: DM_REC:Completing recovery SM DevId 5 pRdm 80875400 CQC 0 WQC 0
There’s that RDM again. I’ve not found anything about these messages. Please leave a comment if you have any leads. RDM, CQC and WQC must have specific meanings, but the only hit I’ve found is that RDM might be Resource Description Message.
- 04/18/12 23:40:56: EVT#01175-04/18/12 23:40:56: 111=Unrecoverable medium error during recovery on PD 05(e0xfc/s1) at 2c9ec7a5
I don’t like those words…. Unrecoverable medium error. Time for a new disk me thinks.
- 04/18/12 23:40:56: ldbbmIssue: pd=5 pdLba=2c9ec7a5 ld=1 ldLba=2c9ec7a5
One thing you’ll find about hard drive errors… they’re very repetitive. One thing that is worth mentioning, this lists the LBA in terms of the physical disk and the logical disk it is part of.
- 04/18/12 23:40:56: EVT#01176-04/18/12 23:40:56: 271=Uncorrectable medium error logged for VD 01/1 at 2c9ec7a5 (on PD 05(e0xfc/s1) at 2c9ec7a5)
04/18/12 23:40:56: GroupCmds: Fail cmd c=c0584260 ld=1 start_block=2c9ec6e0 num_blocks=100 cmd=1
After this there are a few more assorted variations of the second of these two lines as it does various different read sizes around this area. Remember, all numbers are in hex.
- 04/18/12 23:52:02: ldbbmListClear: clearing medErrlba=2c9ec7a5 for ld=1
04/18/12 23:52:02: EVT#01177-04/18/12 23:52:02: 272=VD medium error corrected on VD 01/1 at 2c9ec7a5
04/18/12 23:52:02: EVT#01178-04/18/12 23:52:02: 335=VD bad block table on VD 01/1 is cleared
Hmmm, I’m not sure if I entirely believe it … we’ll see what the scan says.
In dmesg there is the following, most of which you should now be able to understand:
[10408.863522] sd 4:2:1:0: [sdc] Unhandled error code
[10408.863526] sd 4:2:1:0: [sdc] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[10408.863531] sd 4:2:1:0: [sdc] CDB: Read(10): 28 00 2c 9e c6 e0 00 01 00 00
[10408.863540] end_request: I/O error, dev sdc, sector 748603104
[10408.863560] quiet_error: 48 callbacks suppressed
[10408.863562] Buffer I/O error on device sdc, logical block 93575388
[10408.863582] Buffer I/O error on device sdc, logical block 93575389
[10408.863599] Buffer I/O error on device sdc, logical block 93575390
[10408.863615] Buffer I/O error on device sdc, logical block 93575391
[10408.863631] Buffer I/O error on device sdc, logical block 93575392
[10408.863648] Buffer I/O error on device sdc, logical block 93575393
[10408.863664] Buffer I/O error on device sdc, logical block 93575394
[10408.863681] Buffer I/O error on device sdc, logical block 93575395
[10408.863697] Buffer I/O error on device sdc, logical block 93575396
[10408.863713] Buffer I/O error on device sdc, logical block 93575397
[10408.942534] sd 4:2:1:0: [sdc] Unhandled error code
[10408.942538] sd 4:2:1:0: [sdc] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[10408.942542] sd 4:2:1:0: [sdc] CDB: Read(10): 28 00 2c 9e c7 a0 00 00 08 00
[10408.942551] end_request: I/O error, dev sdc, sector 748603296
[10408.991669] sd 4:2:1:0: [sdc] Unhandled error code
[10408.991673] sd 4:2:1:0: [sdc] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[10408.991677] sd 4:2:1:0: [sdc] CDB: Read(10): 28 00 2c 9e c7 a0 00 00 08 00
[10408.991685] end_request: I/O error, dev sdc, sector 748603296
[10409.053380] sd 4:2:1:0: [sdc] Unhandled error code
[10409.053384] sd 4:2:1:0: [sdc] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[10409.053387] sd 4:2:1:0: [sdc] CDB: Read(10): 28 00 2c 9e c7 a0 00 00 08 00
[10409.053396] end_request: I/O error, dev sdc, sector 748603296
[11073.849734] sd 4:2:1:0: [sdc] Unhandled error code
[11073.849742] sd 4:2:1:0: [sdc] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[11073.849750] sd 4:2:1:0: [sdc] CDB: Read(10): 28 00 2c 9e c7 10 00 00 d0 00
[11073.849759] end_request: I/O error, dev sdc, sector 748603152
[11073.935696] raid5_end_read_request: 25 callbacks suppressed
[11073.935701] md/raid:md0: read error corrected (8 sectors at 748603152 on sdc)
[11073.935707] md/raid:md0: read error corrected (8 sectors at 748603160 on sdc)
[11073.935711] md/raid:md0: read error corrected (8 sectors at 748603168 on sdc)
[11073.935714] md/raid:md0: read error corrected (8 sectors at 748603176 on sdc)
[11073.935717] md/raid:md0: read error corrected (8 sectors at 748603184 on sdc)
[11073.935720] md/raid:md0: read error corrected (8 sectors at 748603192 on sdc)
[11073.935724] md/raid:md0: read error corrected (8 sectors at 748603200 on sdc)
[11073.935727] md/raid:md0: read error corrected (8 sectors at 748603208 on sdc)
[11073.935730] md/raid:md0: read error corrected (8 sectors at 748603216 on sdc)
[11073.935733] md/raid:md0: read error corrected (8 sectors at 748603224 on sdc)
The two green sections are variations on what has been covered in the earlier sections of this post. These two portions I’ve highlighted blue suggest a reasonable scope for the unreadable area of the disk.
I’m not sure why they are listed as buffer areas, but you can clearly see it covers block 93575388 to 93575397.
The second blue part is from soft raid trying to repair the dead blocks, 1 block being 8 sectors. 93,575,397 x 8 = 748,603,176 as it happens.
The two lines remaining uncoloured nicely demonstrate the ability of the kernel to rate limit it’s logging. Sadly in the process we’ve lost useful information, but we can surmise that those 25 callbacks from raid5 are corrections for 25 contiguous 8 sector blocks. Can’t be sure what all the 48 call backs suppressed from quiet_error are, some will be further Buffer I/O errors though.
The final tally from megasasctl is 1 media error and 35 other errors. That drive is definitely not happy. Time to replace and break out the drive diagnostics.