249936 – Kernel 2.6.26 doesn't report a bad block correctly for some usb-scsi devices

Bug 249936 - Kernel 2.6.26 doesn't report a bad block correctly for some usb-scsi devices

Summary: Kernel 2.6.26 doesn't report a bad block correctly for some usb-scsi devices

Status:	RESOLVED INVALID

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	All Linux

Importance:	High major (vote)
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2008-12-05 14:43 UTC by Peter Fox
Modified:	2008-12-07 23:36 UTC (History)
CC List:	2 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Peter Fox 2008-12-05 14:43:16 UTC

This is for tracking item 2 of comment 19 of bug 248698.

I have a disk with a known bad sector, when attempting to access it using one type of usb-ide adaptor, the kernel correctly reports the fault:
Dec  4 20:01:52 cool fox: About to do dd of a known bad sector
Dec  4 20:01:54 cool sd 16:0:0:0: [sdf] Result: hostbyte=0x00 driverbyte=0x08
Dec  4 20:01:54 cool sd 16:0:0:0: [sdf] Sense Key : 0x3 [current] 
Dec  4 20:01:54 cool sd 16:0:0:0: [sdf] ASC=0x11 ASCQ=0x0
Dec  4 20:01:54 cool end_request: I/O error, dev sdf, sector 3805344
Dec  4 20:01:54 cool Buffer I/O error on device sdf, logical block 475668
Dec  4 20:01:55 cool sd 16:0:0:0: [sdf] Result: hostbyte=0x00 driverbyte=0x08
Dec  4 20:01:55 cool sd 16:0:0:0: [sdf] Sense Key : 0x3 [current] 
Dec  4 20:01:55 cool sd 16:0:0:0: [sdf] ASC=0x11 ASCQ=0x0
Dec  4 20:01:55 cool end_request: I/O error, dev sdf, sector 3805344
Dec  4 20:01:55 cool Buffer I/O error on device sdf, logical block 475668
Dec  4 20:01:55 cool fox: dd done

A different type of usb disk adaptor doesn't report an error, yet still takes >2 seconds to
provide some data (which is rubbish of course):
Dec  4 20:03:51 cool fox: About to do dd of a known bad sector
Dec  4 20:03:53 cool sd 17:0:0:0: [sdf] Sense Key : 0x0 [current] 
Dec  4 20:03:53 cool sd 17:0:0:0: [sdf] ASC=0x0 ASCQ=0x0
Dec  4 20:03:53 cool fox: dd done

Comment 1 archibald haddock 2008-12-07 19:54:07 UTC

Did you check the behaviour of the unknown adapter with other disks with bad blocks? Maybe it is the adapter, who is doing strange things with bad blocks.

Comment 2 Peter Fox 2008-12-07 22:16:28 UTC

That's a good question, and it might be, but unfortunately I only have the one disk with known bad blocks, so cannot answer it.

I was hoping someone skilled in the art of analysing USB messages could look at the messages in bug 248698 attachment 174263 [details] to see if the usb-scsi adaptor reported an error that was misinterpreted, giving the ASC=0x0 ASCQ=0x0 messages, implying the kernel was at fault, or whether the adaptor didn't report an error, which would imply the adaptor was at fault.

Comment 3 DEMAINE Benoît-Pierre, aka DoubleHP 2008-12-07 22:56:03 UTC

In either case, this does not seem Gentoo/distro specific. Try to reproduce with a verbatim kernel; if you can, then, this is a generic Linux problem => upstream.

If it's a "kernel" problem, go upstream.

If the problem is due to a patch applied by gentoo-source, then, state this clearly in the subject.

It's the first time I see a bug here that complains about a feature, without any precise package or ebuild name and number. We are missing:
- kernel exact version
- emerge --info
- where you take the kernel from (gentoo-sources or verbatim or manualy downloaded ? and eventually the ebuild number)
- .config
- lsusb
- lsusb -nt

What prooves you this is a bad sector ? I had the case where a disk did not have BB, but an USB protocol mistake was interpreted as such ... The only way to check whether there is a BB or not is to take the physical disk out of the box, and plug it directly on a "real" (S)ATA adapter, compliant with SMART and possibly DMA. Then check disk with apropriate tools.

Maybe the disk is safe, and the error comes from the device caliming the disk is faulty.

... unless you can proove I am wrong :)

Of course, if the disk is recent, and if you test it with SMART enabled, the firmware is likely to correct the problem by itself (in which case SMART logs should show a non null reallocation counter).

Comment 4 Daniel Drake (RETIRED) gentoo-dev

2008-12-07 23:19:30 UTC

DoubleHP, you're right, we should send this upstream. But actually I'm quite involved in Linux USB and wanted to look at this myself. Based on the discussion here and in the other bug, I trust that the disk does actually have a bad sector.

here we go:

Starting with the cypress device to refresh my memory, here are my annotations to the relevant part of the logs:

get message 10, sector 3805344, len 2048
f5b39dc0 17.425097 S Bo:1:012:2 - 31 = 55534243 2d000000 00100000 80000a28 00003a10 a0000008 00000000 000000
f5b39dc0 17.425244 C Bo:1:012:2 0 31 >
f5b635c0 17.425251 S Bi:1:012:6 - 4096 <

short read with error EREMOTEIO
f5b635c0 19.323464 C Bi:1:012:6 -121 512 = 81260000 82260000 83260000 84260000 85260000 86260000 87260000 88260000
f5b39dc0 19.323475 S Bi:1:012:6 - 13 <

EPIPE (stalled endpoint)
f5b39dc0 19.324211 C Bi:1:012:6 -32 0

clear halt
f5b39dc0 19.324215 S Co:1:012:0 s 02 01 0000 0086 0000 0
f5b39dc0 19.325211 C Co:1:012:0 0 0

read CSW
f5b39dc0 19.325213 S Bi:1:012:6 - 13 <
residue and command FAILURE
f5b39dc0 19.326210 C Bi:1:012:6 0 13 = 55534253 2d000000 00100000 01

request sense
f5b39dc0 19.326216 S Bo:1:012:2 - 31 = 55534243 2e000000 12000000 80000603 00000012 00000000 00000000 000000
f5b39dc0 19.326336 C Bo:1:012:2 0 31 >
f5b635c0 19.326341 S Bi:1:012:6 - 18 <

sense data: response 0x70, sense key 3, additional sense length 0xa, ASC 0x11,
ASCQ 0 = unrecovered read error
f5b635c0 19.327210 C Bi:1:012:6 0 18 = 70000300 0000000a 00000000 11000000 0000
f5b39dc0 19.327214 S Bi:1:012:6 - 13 <
f5b39dc0 19.327335 C Bi:1:012:6 0 13 = 55534253 2e000000 00000000 00


then it repeats the read and fails with the same unrecovered read error.
Now to check the other one

Comment 5 Daniel Drake (RETIRED) gentoo-dev

2008-12-07 23:36:22 UTC

for the "unknown" device:

get message 10, sector 3805344, len 2048
f48c13c0 28.169737 S Bo:1:013:2 - 31 = 55534243 21000000 00100000 80000a28 00003a10 a0000008 00000000 000000
f48c13c0 28.169771 C Bo:1:013:2 0 31 >
f4927d40 28.169804 S Bi:1:013:1 - 4096 <

short read with error -EREMOTEIO
f4927d40 30.687942 C Bi:1:013:1 -121 1026 = 48008126 00008226 00008326 00008426 00008526 00008626 00008726 00008826

get CSW
f48c13c0 30.687953 S Bi:1:013:1 - 13 <
residue and failure
f48c13c0 30.688439 C Bi:1:013:1 0 13 = 55534253 21000000 fe0b0000 01

request sense
f48c13c0 30.688444 S Bo:1:013:2 - 31 = 55534243 22000000 12000000 80000603 00000012 00000000 00000000 000000
f48c13c0 30.689439 C Bo:1:013:2 0 31 >
f4927d40 30.689444 S Bi:1:013:1 - 18 <

sense data: response 0x70, sense code 0, asc=0 ascq=0
f4927d40 30.690439 C Bi:1:013:1 0 18 = 70000000 0000000a 00000000 00000000 0000
f48c13c0 30.690443 S Bi:1:013:1 - 13 <
f48c13c0 30.690564 C Bi:1:013:1 0 13 = 55534253 22000000 00000000 00


Conclusion: the device transferred some data, but said that it was all "residue" (meaning junk data) and reports an error. When the kernel tries to retrieve the error information (sense data), it gets code 0:
"NO SENSE: Indicates that there is no specific sense key information to be reported. This may occur for a successful command"

I don't know what to make of this. This is the behaviour you would expect for when the device did not encounter a problem. But it only says "may" -- doesn't imply that sense code 0 always means success.

I think this is potentially a bug, in that there was some evidence of error (in the CSW), and dd should have failed the same way both times (which I'm presuming it didn't?).

However, I also think your device is being a little less than compliant with the scsi specs.

If you want to take this further, you should write to the linux-scsi@vger.kernel.org mailing list (no subscription required, send email in plain text). Include the annotations I made above. They will have more of a clue than me.