248698 – [2.6.27 regression] long/infinite loop when bad sectors are encountered

Bug 248698 - [2.6.27 regression] long/infinite loop when bad sectors are encountered

Summary: [2.6.27 regression] long/infinite loop when bad sectors are encountered

Status:	RESOLVED OBSOLETE

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	All Linux

Importance:	High major (vote)
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:	http://bugzilla.kernel.org/show_bug.c...
Whiteboard:	linux-2.6.27-regression watch-linux-b...
Keywords:

Depends on:
Blocks:

Reported:	2008-11-24 22:36 UTC by Peter Fox
Modified:	2011-06-28 10:18 UTC (History)
CC List:	1 user (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
.config for 2.6.26-gentoo-r3 (kernel-2.6.26.config,49.70 KB, text/plain) 2008-11-27 06:54 UTC, Peter Fox	Details
.config for 2.6.27-gentoo-r2 (kernel-2.6.27.config,51.36 KB, text/plain) 2008-11-27 06:55 UTC, Peter Fox	Details
lsusb output (lsusb.txt,186 bytes, text/plain) 2008-11-27 06:59 UTC, Peter Fox	Details
usb fix (patch,3.20 KB, patch) 2008-11-27 21:50 UTC, Sergey Ovcharenko	Details \| Diff
Results of using dd to copy the second track of the dodgy disk 2 kernels, 2 usb-ide converters (usb-scsi.tgz,66.14 KB, application/octet-stream) 2008-12-01 21:28 UTC, Peter Fox	Details
Test results for 2 disks in 2 usb adaptors with 2 kernels on a different computer (usb-scsi.tgz,201.18 KB, application/octet-stream) 2008-12-03 22:09 UTC, Peter Fox	Details
A simple test reading a known bad block on 2 usb adaptors (usbdisk2.tgz,15.42 KB, application/octet-stream ) 2008-12-04 20:35 UTC, Peter Fox	Details
Simple test reading known bad block using kernel 2.6.27-r3 (usbdisk27.tgz,42.58 KB, application/octet-stream ) 2008-12-04 21:46 UTC, Peter Fox	Details
A couple of tries reading the bad sector with kernel 2.6.28-rc8 (usbdisk28.tgz,29.30 KB, application/octet-stream) 2008-12-12 15:41 UTC, Peter Fox	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Peter Fox 2008-11-24 22:36:26 UTC

Attempting to dd a usb disk image takes forever. I've been trying to copy a 2G disk, usb-2.0 and it has taken >24 hours to copy 750M.
dmesg full of messages like
sd 9:0:0:0: [sdf] Sense Key : 0x0 [current] 
sd 9:0:0:0: [sdf] ASC=0x0 ASCQ=0x0
sd 9:0:0:0: [sdf] Sense Key : 0x0 [current] 
sd 9:0:0:0: [sdf] ASC=0x0 ASCQ=0x0
sd 9:0:0:0: [sdf] Sense Key : 0x0 [current] 
sd 9:0:0:0: [sdf] ASC=0x0 ASCQ=0x0


Reproducible: Always

Steps to Reproduce:
1. Insert usb disk
2. dd from the device (I'm using ddrescue)
3.

Actual Results:  
Copy is really really slow, though sometimes goes faster.

Expected Results:  
Copy at nearly usb2.0 bandwith rate.

The kernel has a bug fix which should be included in gentoo-sources-2.6.27-rx

Comment 1 Sergey Ovcharenko 2008-11-26 21:17:26 UTC

Did you try any older kernels?

Post your kernel .config and lsusb -v please

Comment 2 Peter Fox 2008-11-27 06:53:18 UTC

With kernel 2.6.26-r3 it read the whole disk in less than 24 hours, with 'no errors'. I've now put the disk is a different usb controller and am getting read errors. Interestingly the first controller had all the data offset by 2 bytes (either kernel). The new controller (only tried with 2.6.26) seems to have different data from the old one, even allowing for the offset. I should mention that the disk is suspected faulty as if came from a windows PC that wouldn't boot, and had been scandisked.

Comment 3 Peter Fox 2008-11-27 06:54:24 UTC

Created attachment 173559 [details]
.config for 2.6.26-gentoo-r3

Comment 4 Peter Fox 2008-11-27 06:55:32 UTC

Created attachment 173560 [details]
.config for 2.6.27-gentoo-r2

Comment 5 Peter Fox 2008-11-27 06:59:54 UTC

Created attachment 173561 [details]
lsusb output

The cypress device (device 3) is the one which actually gives read errors with the duff disk, and has only been tried with kernel 2.6.26 with the faulty disk. Device 4 is the one which seems to offset the data by 2 bytes and never give an error, tried with both kernels, but much slower with kernel 2.6.27.

Comment 6 Sergey Ovcharenko 2008-11-27 21:50:26 UTC

Created attachment 173613 [details, diff]
usb fix

This patch should fix the problem.
Please apply it and tell us if it worked.

Comment 7 Sergey Ovcharenko 2008-11-27 22:29:34 UTC

Sorry but nevermind my comment #6 and the patch.
I would be great if you posted some info using usbmon.
Instructions are in the kernel source file
Documentation/usb/usbmon.txt.

Comment 8 Peter Fox 2008-12-01 21:28:10 UTC

Created attachment 174002 [details]
Results of using dd to copy the second track of the dodgy disk 2 kernels, 2 usb-ide converters

This tarball includes the script used to generate the various outputs.
I used dd to copy the second track from my dodgy disk using two different kernels (2.6.26-gentoo-r3 and 2.6.27-gentoo-r3), two different usb-ide converters (Venus (cypress device) Newlink (unknown device), and two different block sizes (sector at a time, track at a time).
The cypress chipset seems to give consistent answers in all cases, the unknown chipset gives different answers even with different block sizes.

Comment 9 Harrison Metzger 2008-12-02 08:02:34 UTC

Did you say this happens with all USB storage devices (both harddrives and flash drives) or just that one harddrive in particular?

Comment 10 Peter Fox 2008-12-02 17:43:29 UTC

Its one hard disk in two different adaptors. The cypress chipset based adaptor takes IDE disks only. The other takes IDE or SATA. The disk I'm seeing this trouble with is a 10 year old WD Caviar 22000 (CHS 3876/16/63), I haven't tried any others for this problem. But I don't think the disk has any problem reading data from the second track (bad blocks are much further in: sector 662528 is the first bad one in the list using ddrescue after a few attempts).

I could repeat the tests on my laptop, which is USB1.1 only, or on another PC, or try another hard disk, if you like.

Comment 11 Markos Chandras (RETIRED) gentoo-dev

2008-12-03 12:14:42 UTC

This bug is closed upstream as FIXED. The patch will be on kernel mainline before 2.6.28-rc6. 

I think we should close it for now

Comment 12 Daniel Drake (RETIRED) gentoo-dev

2008-12-03 13:04:11 UTC

Markos, the above kernel bug references a different device, the patch will not help (it only touches upon different USB IDs) and we are also not even sure if the upstream bug is related.

Next step is to find someone upstream to go through the usbmon logs and determine if it's the same problem on another device. I'll do that though, as I have involvement and knowledge in this area.

Comment 13 Daniel Drake (RETIRED) gentoo-dev

2008-12-03 13:20:24 UTC

Peter,

We need to see the usbmon output recorded during the time that you plug the device in (or load the usb-storage module) in order to determine if this bug is related to the upstream one that you link to.

However, I'm suspecting that it's not. And yes, it would be helpful if you could test with different computers and different disks. This may just be some weird intermittent hardware problem with the specific disk that is not worth looking into.

Also, testing under the latest development kernel (currently v2.6.28-rc7) would be useful.

Comment 14 Peter Fox 2008-12-03 22:09:14 UTC

Created attachment 174198 [details]
Test results for 2 disks in 2 usb adaptors with 2 kernels on a different computer

These results were made on a different PC, and also added a different IDE disk into the mix. The compare scripts in the results show that there are 3 different ideas of the data on the wd disk:
1. All kernels and dd blocksizes give the same data for the cypress chipset.
2. Both kernels give the same data for the unknown chipset with per sector blocks (512 x 63).
3. Both kernels give the same data for the unknown chipset with a single track block (32256 x 1).
With the other disk, identical results were obtained for all kernels, usb adaptors and block sizes.
Conclusions - kernels 2.6.26 and 2.6.27 are the same. Something weird happens with the western digital disk in the unknown adaptor, in that the data received is always wrong, and how it is wrong depends on the dd block size. Is the adaptor initialised incorrectly, or is it simply broken?
The files diskplug.txt record the usbmon output on disk insertion. In the case of the cypress chipset, the adaptor was already powered, and the cable simply inserted; in the case of the unknown chipset, the cable was connected, then the adaptor powered up.

Comment 15 Daniel Drake (RETIRED) gentoo-dev

2008-12-04 14:47:54 UTC

(In reply to comment #14)
> Conclusions - kernels 2.6.26 and 2.6.27 are the same. Something weird happens
> with the western digital disk in the unknown adaptor, in that the data received
> is always wrong

Wait, now this sounds like something different from comment #0. This bug is about slow access to your USB hard disk, whereas it was fast in 2.6.26. Please file other bugs for other issues.

Are you still experiencing that 2.6.27 is significantly slower than 2.6.26 at reading your disk?

In comment #0 you show some error messages. This is likely related to the slowness. Do these errors appear when plugging the device in, or doing your simple dd read tests?
Because in order to diagnose them we need to have usbmon output for the points in time when the error messages appear.

Comment 16 Peter Fox 2008-12-04 20:35:41 UTC

Created attachment 174263 [details]
A simple test reading a known bad block on 2 usb adaptors

I'm not sure what the problem is any more. These test results capture dmesg and usbmon while inserting the bad disk in each of the two disk controllers, then reads a known bad block (sector3805344). 

The cypress controller correctly reports the bad block, and dd fails without copying the sector:
Dec  4 20:01:52 cool fox: About to do dd of a known bad sector
Dec  4 20:01:54 cool sd 16:0:0:0: [sdf] Result: hostbyte=0x00 driverbyte=0x08
Dec  4 20:01:54 cool sd 16:0:0:0: [sdf] Sense Key : 0x3 [current] 
Dec  4 20:01:54 cool sd 16:0:0:0: [sdf] ASC=0x11 ASCQ=0x0
Dec  4 20:01:54 cool end_request: I/O error, dev sdf, sector 3805344
Dec  4 20:01:54 cool Buffer I/O error on device sdf, logical block 475668
Dec  4 20:01:55 cool sd 16:0:0:0: [sdf] Result: hostbyte=0x00 driverbyte=0x08
Dec  4 20:01:55 cool sd 16:0:0:0: [sdf] Sense Key : 0x3 [current] 
Dec  4 20:01:55 cool sd 16:0:0:0: [sdf] ASC=0x11 ASCQ=0x0
Dec  4 20:01:55 cool end_request: I/O error, dev sdf, sector 3805344
Dec  4 20:01:55 cool Buffer I/O error on device sdf, logical block 475668
Dec  4 20:01:55 cool fox: dd done

The unknown controller doesn't report an error, yet still takes >2 seconds to provide some data (which is rubbish of course):
Dec  4 20:03:51 cool fox: About to do dd of a known bad sector
Dec  4 20:03:53 cool sd 17:0:0:0: [sdf] Sense Key : 0x0 [current] 
Dec  4 20:03:53 cool sd 17:0:0:0: [sdf] ASC=0x0 ASCQ=0x0
Dec  4 20:03:53 cool fox: dd done

The reason I thought the transfers were very slow is because it was taking seconds per sector with no error reported. A google search on ASC=0x0 ASCQ=0x0 is what led me to the original URL with the kernel bug I thought I had.

It may well be that in the presence of faulty disks with some usb disk controllers kernel 2.6.27 will keep trying to re-read the sector, but I'm more interested in why this other disk controller doesn't correctly report the bad block. Not reporting a bad block can lead to massive filesystem corruption after a short time. 

Is there anything in the usbmon traces that shows the disk controller reported the faulty block, or did it pretend it was all ok? In other words, is it the kernel or my hardware at fault.

Please feel free to change the subject again to better reflect these ongoing developments (it's hard to get the subject right when entering the bug).

Comment 17 Peter Fox 2008-12-04 21:36:04 UTC

I've just tried my latest script (reading a known bad block) in a machine running 2.6.27-r3, and the unknown controller is stuck in a loop putting out those pairs of lines of which we only got one pair in 2.6.26-r3:
sd 5:0:0:0: [sdb] Sense Key : 0x0 [current] 
sd 5:0:0:0: [sdb] ASC=0x0 ASCQ=0x0
sd 5:0:0:0: [sdb] Sense Key : 0x0 [current] 
sd 5:0:0:0: [sdb] ASC=0x0 ASCQ=0x0
sd 5:0:0:0: [sdb] Sense Key : 0x0 [current] 
sd 5:0:0:0: [sdb] ASC=0x0 ASCQ=0x0
sd 5:0:0:0: [sdb] Sense Key : 0x0 [current] 
....
If it doesn't finish in the next few minutes I'll kill it.

Comment 18 Peter Fox 2008-12-04 21:46:04 UTC

Created attachment 174273 [details]
Simple test reading known bad block using kernel 2.6.27-r3

This shows the continuous kernel looping when reading a bad block on the unknown usb disk adaptor with kernel gentoo-sources-2.6.27-r3. It eventually finished, don't know whether it finally read the block correctly.

Comment 19 Peter Fox 2008-12-04 22:01:21 UTC

To summarise the problems as I see them:

1. Kernel 2.6.27 loops indefinitely reading a bad block using the unknown manuf/prodid usb-ide disk controller (identified in the original url).

2. Kernel 2.6.26 doesn't report a bad block when reading a bad block using the unknown controller.

3. Both of the two kernels tested return incorrect data when reading the wd disk in the unknown controller, even on known good blocks. Yet it reads data correctly from a different disk (maxtor).

Items 2 and 3 are a recipe for corrupting filesystems.

Comment 20 Daniel Drake (RETIRED) gentoo-dev

2008-12-04 22:11:17 UTC

OK, to avoid further pollution of this bug, please open separate bugs for 2 and 3. This one is for the scsi problem where bad sectors are not handled very well in 2.6.27.

Comment 21 Peter Fox 2008-12-05 14:49:09 UTC

Raised bugs 249936 and 249938.

Comment 22 Peter Fox 2008-12-05 14:50:33 UTC

Magic hyperlinking didn't work.
Comment 19 item 2 is in bug 249936.
Comment 19 item 3 is in bug 249938.

Comment 23 Daniel Drake (RETIRED) gentoo-dev

2008-12-12 12:09:13 UTC

We're now only talking about the "unknown" controller as the cypress issues have been split off to another bug.

Alan Stern kindly examined the issues here:
http://marc.info/?l=linux-scsi&m=122903295704381&w=2

To summarise:
1. Your device is acting strangely, it is not reporting bad sectors as it should do, but Linux should do better. There's enough information to know that something is wrong.

2. We don't have an available solution for 2.6.27, unfortunately.

3. 2.6.28 should be better than ..27 but still with room for improvement, please could you test that? Latest release is 2.6.28-rc8

4. If 2.6.28 does indeed act as we think (instead of looping it will take up to 6 minutes to give up on the bad sector), and you're interested in taking this further, then we can raise the issue upstream.

Comment 24 Peter Fox 2008-12-12 15:41:47 UTC

Created attachment 175109 [details]
A couple of tries reading the bad sector with kernel 2.6.28-rc8

Although this doesn't loop forever, it still takes a long time, and fails to report an error.

Comment 25 Daniel Drake (RETIRED) gentoo-dev

2008-12-16 13:06:20 UTC

OK, thanks for testing. If you would like to follow up on this, here are the next steps:

Open a SCSI bug report at http://bugzilla.kernel.org. It's important to be concise but informative and not attach too many files (no tarballs). Here are the key points that I would emphasize:
 - Your setup is a disk (with known bad sectors) in a USB-IDE converter
 - The USB-IDE converter doesn't report errors from reading bad sectors in the "proper" way, but does give plenty of evidence that something is wrong and that the sector cannot be read
 - Alan Stern gave a nice summary (link to it) here: http://marc.info/?l=linux-scsi&m=122903295704381&w=2
 - 2.6.26 read the whole disk without reporting errors
 - 2.6.27 gets in a very long (or possibly infinite) loop
 - 2.6.28 gets a bit stuck on the bad sectors, taking several minutes to timeout, and still does not report error
 - The purpose of your bug report is to provide data and testing to aid improvements in the SCSI layer, so that the error can be detected.

And I would attach (as 2 separate files, no archives or compression please):
 1. dmesg from 2.6.28 after it gets "stuck" on one of the sectors
 2. corresponding usbmon output for (1)

If you do go ahead, please post the new bug URL here when done. Thanks!

Comment 26 Peter Fox 2008-12-16 21:16:44 UTC

http://bugzilla.kernel.org/show_bug.cgi?id=12240

Comment 27 Axel Dyks 2008-12-16 21:46:32 UTC

Thanks, we'll keep an eye on the upstream bug.