Summary: | [2.6.27 regression] long/infinite loop when bad sectors are encountered | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Peter Fox <gentoo> |
Component: | [OLD] Core system | Assignee: | Gentoo Kernel Bug Wranglers and Kernel Maintainers <kernel> |
Status: | RESOLVED OBSOLETE | ||
Severity: | major | CC: | hwoarang |
Priority: | High | ||
Version: | 2008.0 | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | http://bugzilla.kernel.org/show_bug.cgi?id=12240 | ||
Whiteboard: | linux-2.6.27-regression watch-linux-bugzilla | ||
Package list: | Runtime testing required: | --- | |
Attachments: |
.config for 2.6.26-gentoo-r3
.config for 2.6.27-gentoo-r2 lsusb output usb fix Results of using dd to copy the second track of the dodgy disk 2 kernels, 2 usb-ide converters Test results for 2 disks in 2 usb adaptors with 2 kernels on a different computer A simple test reading a known bad block on 2 usb adaptors Simple test reading known bad block using kernel 2.6.27-r3 A couple of tries reading the bad sector with kernel 2.6.28-rc8 |
Description
Peter Fox
2008-11-24 22:36:26 UTC
Did you try any older kernels? Post your kernel .config and lsusb -v please With kernel 2.6.26-r3 it read the whole disk in less than 24 hours, with 'no errors'. I've now put the disk is a different usb controller and am getting read errors. Interestingly the first controller had all the data offset by 2 bytes (either kernel). The new controller (only tried with 2.6.26) seems to have different data from the old one, even allowing for the offset. I should mention that the disk is suspected faulty as if came from a windows PC that wouldn't boot, and had been scandisked. Created attachment 173559 [details]
.config for 2.6.26-gentoo-r3
Created attachment 173560 [details]
.config for 2.6.27-gentoo-r2
Created attachment 173561 [details]
lsusb output
The cypress device (device 3) is the one which actually gives read errors with the duff disk, and has only been tried with kernel 2.6.26 with the faulty disk. Device 4 is the one which seems to offset the data by 2 bytes and never give an error, tried with both kernels, but much slower with kernel 2.6.27.
Created attachment 173613 [details, diff]
usb fix
This patch should fix the problem.
Please apply it and tell us if it worked.
Sorry but nevermind my comment #6 and the patch. I would be great if you posted some info using usbmon. Instructions are in the kernel source file Documentation/usb/usbmon.txt. Created attachment 174002 [details]
Results of using dd to copy the second track of the dodgy disk 2 kernels, 2 usb-ide converters
This tarball includes the script used to generate the various outputs.
I used dd to copy the second track from my dodgy disk using two different kernels (2.6.26-gentoo-r3 and 2.6.27-gentoo-r3), two different usb-ide converters (Venus (cypress device) Newlink (unknown device), and two different block sizes (sector at a time, track at a time).
The cypress chipset seems to give consistent answers in all cases, the unknown chipset gives different answers even with different block sizes.
Did you say this happens with all USB storage devices (both harddrives and flash drives) or just that one harddrive in particular? Its one hard disk in two different adaptors. The cypress chipset based adaptor takes IDE disks only. The other takes IDE or SATA. The disk I'm seeing this trouble with is a 10 year old WD Caviar 22000 (CHS 3876/16/63), I haven't tried any others for this problem. But I don't think the disk has any problem reading data from the second track (bad blocks are much further in: sector 662528 is the first bad one in the list using ddrescue after a few attempts). I could repeat the tests on my laptop, which is USB1.1 only, or on another PC, or try another hard disk, if you like. This bug is closed upstream as FIXED. The patch will be on kernel mainline before 2.6.28-rc6. I think we should close it for now Markos, the above kernel bug references a different device, the patch will not help (it only touches upon different USB IDs) and we are also not even sure if the upstream bug is related. Next step is to find someone upstream to go through the usbmon logs and determine if it's the same problem on another device. I'll do that though, as I have involvement and knowledge in this area. Peter, We need to see the usbmon output recorded during the time that you plug the device in (or load the usb-storage module) in order to determine if this bug is related to the upstream one that you link to. However, I'm suspecting that it's not. And yes, it would be helpful if you could test with different computers and different disks. This may just be some weird intermittent hardware problem with the specific disk that is not worth looking into. Also, testing under the latest development kernel (currently v2.6.28-rc7) would be useful. Created attachment 174198 [details]
Test results for 2 disks in 2 usb adaptors with 2 kernels on a different computer
These results were made on a different PC, and also added a different IDE disk into the mix. The compare scripts in the results show that there are 3 different ideas of the data on the wd disk:
1. All kernels and dd blocksizes give the same data for the cypress chipset.
2. Both kernels give the same data for the unknown chipset with per sector blocks (512 x 63).
3. Both kernels give the same data for the unknown chipset with a single track block (32256 x 1).
With the other disk, identical results were obtained for all kernels, usb adaptors and block sizes.
Conclusions - kernels 2.6.26 and 2.6.27 are the same. Something weird happens with the western digital disk in the unknown adaptor, in that the data received is always wrong, and how it is wrong depends on the dd block size. Is the adaptor initialised incorrectly, or is it simply broken?
The files diskplug.txt record the usbmon output on disk insertion. In the case of the cypress chipset, the adaptor was already powered, and the cable simply inserted; in the case of the unknown chipset, the cable was connected, then the adaptor powered up.
(In reply to comment #14) > Conclusions - kernels 2.6.26 and 2.6.27 are the same. Something weird happens > with the western digital disk in the unknown adaptor, in that the data received > is always wrong Wait, now this sounds like something different from comment #0. This bug is about slow access to your USB hard disk, whereas it was fast in 2.6.26. Please file other bugs for other issues. Are you still experiencing that 2.6.27 is significantly slower than 2.6.26 at reading your disk? In comment #0 you show some error messages. This is likely related to the slowness. Do these errors appear when plugging the device in, or doing your simple dd read tests? Because in order to diagnose them we need to have usbmon output for the points in time when the error messages appear. Created attachment 174263 [details]
A simple test reading a known bad block on 2 usb adaptors
I'm not sure what the problem is any more. These test results capture dmesg and usbmon while inserting the bad disk in each of the two disk controllers, then reads a known bad block (sector3805344).
The cypress controller correctly reports the bad block, and dd fails without copying the sector:
Dec 4 20:01:52 cool fox: About to do dd of a known bad sector
Dec 4 20:01:54 cool sd 16:0:0:0: [sdf] Result: hostbyte=0x00 driverbyte=0x08
Dec 4 20:01:54 cool sd 16:0:0:0: [sdf] Sense Key : 0x3 [current]
Dec 4 20:01:54 cool sd 16:0:0:0: [sdf] ASC=0x11 ASCQ=0x0
Dec 4 20:01:54 cool end_request: I/O error, dev sdf, sector 3805344
Dec 4 20:01:54 cool Buffer I/O error on device sdf, logical block 475668
Dec 4 20:01:55 cool sd 16:0:0:0: [sdf] Result: hostbyte=0x00 driverbyte=0x08
Dec 4 20:01:55 cool sd 16:0:0:0: [sdf] Sense Key : 0x3 [current]
Dec 4 20:01:55 cool sd 16:0:0:0: [sdf] ASC=0x11 ASCQ=0x0
Dec 4 20:01:55 cool end_request: I/O error, dev sdf, sector 3805344
Dec 4 20:01:55 cool Buffer I/O error on device sdf, logical block 475668
Dec 4 20:01:55 cool fox: dd done
The unknown controller doesn't report an error, yet still takes >2 seconds to provide some data (which is rubbish of course):
Dec 4 20:03:51 cool fox: About to do dd of a known bad sector
Dec 4 20:03:53 cool sd 17:0:0:0: [sdf] Sense Key : 0x0 [current]
Dec 4 20:03:53 cool sd 17:0:0:0: [sdf] ASC=0x0 ASCQ=0x0
Dec 4 20:03:53 cool fox: dd done
The reason I thought the transfers were very slow is because it was taking seconds per sector with no error reported. A google search on ASC=0x0 ASCQ=0x0 is what led me to the original URL with the kernel bug I thought I had.
It may well be that in the presence of faulty disks with some usb disk controllers kernel 2.6.27 will keep trying to re-read the sector, but I'm more interested in why this other disk controller doesn't correctly report the bad block. Not reporting a bad block can lead to massive filesystem corruption after a short time.
Is there anything in the usbmon traces that shows the disk controller reported the faulty block, or did it pretend it was all ok? In other words, is it the kernel or my hardware at fault.
Please feel free to change the subject again to better reflect these ongoing developments (it's hard to get the subject right when entering the bug).
I've just tried my latest script (reading a known bad block) in a machine running 2.6.27-r3, and the unknown controller is stuck in a loop putting out those pairs of lines of which we only got one pair in 2.6.26-r3: sd 5:0:0:0: [sdb] Sense Key : 0x0 [current] sd 5:0:0:0: [sdb] ASC=0x0 ASCQ=0x0 sd 5:0:0:0: [sdb] Sense Key : 0x0 [current] sd 5:0:0:0: [sdb] ASC=0x0 ASCQ=0x0 sd 5:0:0:0: [sdb] Sense Key : 0x0 [current] sd 5:0:0:0: [sdb] ASC=0x0 ASCQ=0x0 sd 5:0:0:0: [sdb] Sense Key : 0x0 [current] .... If it doesn't finish in the next few minutes I'll kill it. Created attachment 174273 [details]
Simple test reading known bad block using kernel 2.6.27-r3
This shows the continuous kernel looping when reading a bad block on the unknown usb disk adaptor with kernel gentoo-sources-2.6.27-r3. It eventually finished, don't know whether it finally read the block correctly.
To summarise the problems as I see them: 1. Kernel 2.6.27 loops indefinitely reading a bad block using the unknown manuf/prodid usb-ide disk controller (identified in the original url). 2. Kernel 2.6.26 doesn't report a bad block when reading a bad block using the unknown controller. 3. Both of the two kernels tested return incorrect data when reading the wd disk in the unknown controller, even on known good blocks. Yet it reads data correctly from a different disk (maxtor). Items 2 and 3 are a recipe for corrupting filesystems. OK, to avoid further pollution of this bug, please open separate bugs for 2 and 3. This one is for the scsi problem where bad sectors are not handled very well in 2.6.27. Raised bugs 249936 and 249938. Magic hyperlinking didn't work. Comment 19 item 2 is in bug 249936. Comment 19 item 3 is in bug 249938. We're now only talking about the "unknown" controller as the cypress issues have been split off to another bug. Alan Stern kindly examined the issues here: http://marc.info/?l=linux-scsi&m=122903295704381&w=2 To summarise: 1. Your device is acting strangely, it is not reporting bad sectors as it should do, but Linux should do better. There's enough information to know that something is wrong. 2. We don't have an available solution for 2.6.27, unfortunately. 3. 2.6.28 should be better than ..27 but still with room for improvement, please could you test that? Latest release is 2.6.28-rc8 4. If 2.6.28 does indeed act as we think (instead of looping it will take up to 6 minutes to give up on the bad sector), and you're interested in taking this further, then we can raise the issue upstream. Created attachment 175109 [details]
A couple of tries reading the bad sector with kernel 2.6.28-rc8
Although this doesn't loop forever, it still takes a long time, and fails to report an error.
OK, thanks for testing. If you would like to follow up on this, here are the next steps: Open a SCSI bug report at http://bugzilla.kernel.org. It's important to be concise but informative and not attach too many files (no tarballs). Here are the key points that I would emphasize: - Your setup is a disk (with known bad sectors) in a USB-IDE converter - The USB-IDE converter doesn't report errors from reading bad sectors in the "proper" way, but does give plenty of evidence that something is wrong and that the sector cannot be read - Alan Stern gave a nice summary (link to it) here: http://marc.info/?l=linux-scsi&m=122903295704381&w=2 - 2.6.26 read the whole disk without reporting errors - 2.6.27 gets in a very long (or possibly infinite) loop - 2.6.28 gets a bit stuck on the bad sectors, taking several minutes to timeout, and still does not report error - The purpose of your bug report is to provide data and testing to aid improvements in the SCSI layer, so that the error can be detected. And I would attach (as 2 separate files, no archives or compression please): 1. dmesg from 2.6.28 after it gets "stuck" on one of the sectors 2. corresponding usbmon output for (1) If you do go ahead, please post the new bug URL here when done. Thanks! Thanks, we'll keep an eye on the upstream bug. |