Created attachment 288333 [details] emerge --info, part of /etc/fstab, lspci, mdstat Hi dear! Faced with this problem. There are RAID5, assembled by mdadm (/dev/md127), which is divided into 2 partitions (md127p1 and md127p2). In both reiserfs. The second section is exported via NFS. Everything works, the array is intact and fully synchronized. SMART says the drive is in order. And even all the copies, but when copying a file about 700 MB all hangs a stake, even the mouse somewhere in the 80 percent and saves only the reset. At the same time in the logs was complete silence. After a reset of course runs fsck, and then synchronize the array. With hdparm not played, so all the defaults. Has anyone encountered this problem? In what direction to dig? At gentoo.ru says that it may be reiserfs. Here is some attached information about the system. Thank you.
Next. Unmounted the partition and ran the test, which resulted in an error is detected. Why fsck at boot after a reset is not correct? # reiserfsck --check /dev/md127p2 ------------------------------------------------------------------------------ reiserfsck 3.6.21 (2009 www.namesys.com) ... Will read-only check consistency of the filesystem on /dev/md127p2 Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes ########### reiserfsck --check started at Fri Sep 30 21:04:04 2011 ########### Replaying journal: Done. Reiserfs journal '/dev/md127p2' in blocks [18..8211]: 0 transactions replayed Checking internal tree.. finished Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs. Checking Semantic tree: ... 11 - FLAC)/63 - Various_-_Ibiza_2011_(continuous_DJ_mix_3_Underground).flacvpf-10670: The file [277582 277645] has the wrong size in the StatData (0), should be (4096) vpf-10680: The file [277582 277645] has the wrong block count in the StatData (8), should be (0) finished 4 found corruptions can be fixed when running with --fix-fixable ########### reiserfsck finished at Fri Sep 30 21:12:10 2011 ########### ------------------------------------------------------------------------------ #reiserfsck --fix-fixable /dev/md127p2 ------------------------------------------------------------------------------ reiserfsck 3.6.21 (2009 www.namesys.com) ... Will check consistency of the filesystem on /dev/md127p2 and will fix what can be fixed without --rebuild-tree Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes ########### reiserfsck --fix-fixable started at Fri Sep 30 21:14:00 2011 ########### Replaying journal: Done. Reiserfs journal '/dev/md127p2' in blocks [18..8211]: 0 transactions replayed Checking internal tree.. finished Comparing bitmaps..vpf-10630: The on-disk and the correct bitmaps differs. Will be fixed later. Checking Semantic tree: ... 11 - FLAC)/63 - Various_-_Ibiza_2011_(continuous_DJ_mix_3_Underground).flacvpf-10670: The file [277582 277645] has the wrong size in the StatData (0) - corrected to (4096) vpf-10680: The file [277582 277645] has the wrong block count in the StatData (8) - corrected to (0) finished No corruptions found There are on the filesystem: Leaves 196965 Internal nodes 1248 Directories 8781 Other files 250705 Data block pointers 187358213 (34266825 of them are zero) Safe links 0 ########### reiserfsck finished at Fri Sep 30 21:23:10 2011 ########### ------------------------------------------------------------------------------ After checking again tried to copy - exact same song. I decided to check for bad sectors. Now comes the test drive of the runlevel 1 with badblocks-nvs /dev/sdX. Remaining 20 hours. In the Russian community (gentoo.ru), said that another problem may be due to export to NFS, but I have not tried to copy without starting NFS daemon. After the scan is finished try.
The first partition (/dev/md127p1) contains no errors.
Next: Surface scan with 'badblocks -nvs' revealed no bad sectors. Then I switch on runlevel 3 and check out the reiserfs partitions again, but uses --rebuild-tree. Moreover, it found that if you copy files over a 5-10MB (mp3) 10-20 pieces, then everything is OK, but if I charge up 100 of these files, then hangs up after a while, unable to complete backup. NFS has been stopped. What's next? I understand it to debug the kernel and enable verbose output? # reiserfsck --rebuild-tree /dev/md0p2 ------------------------------------------------------------------------------ reiserfsck 3.6.21 (2009 www.namesys.com)... Will rebuild the filesystem (/dev/md0p2) tree Will put log info to'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes Replaying journal: Done. Reiserfs journal '/dev/md0p2' in blocks [18..8211]: 0 transactions replayed ########### reiserfsck --rebuild-tree started at Sat Oct 1 12:23:52 2011 ########### Pass 0: ####### Pass 0 ####### Loading on-disk bitmap .. ok, 153308617 blocks marked used Skipping 19016 blocks (super block, journal, bitmaps) 153289601 blocks will be read 0%block 6736204: The number of items (4096) is incorrect, should be (1) - corrected block 6736204: The free space (51968) is incorrect, should be (4048) - corrected pass0: vpf-10110: block 6736204, item (0): Unknown item type found [217129472 3204448256 0x3 ??? (15)] - deleted ....20%..block 118354322: The number of items (4505) is incorrect, should be (1) - corrected block 118354322: The free space (39168) is incorrect, should be (4048) - corrected pass0: vpf-10110: block 118354322, item (0): Unknown item type found [0 2566914448 0x99001100 ??? (9)] - deleted ..40%....60%....80%....100% left 0, 47151 /sec 259485 directory entries were hashed with "r5" hash. "r5" hash is selected Flushing..finished Read blocks (but not data blocks) 153289601 Leaves among those 196968 - leaves all contents of which could not be saved and deleted 3 Objectids found260611 Pass 1 (will try to insert 196965 leaves): ####### Pass 1 ####### Looking for allocable blocks .. finished 0%....20%....40%....60%....80%....100% left 0, 327 /sec Flushing..finished 196965 leaves read 196860 inserted 105 not inserted ####### Pass 2 ####### Pass 2: 0%....20%....40%....60%....80%....100% left 0, 210 /sec Flushing..finished Leaves inserted item by item 105 Pass 3 (semantic): ####### Pass 3 ######### Flushing..finished Files found: 250129 Directories found: 8782 Symlinks found: 576 Pass 3a (looking for lost dir/files): ####### Pass 3a (lost+found pass) ######### Looking for lost directories: Flushing..finished, 2283 /sec Pass 4 - finishedone 135475, 6451 /sec Flushing..finished Syncing..finished ########### reiserfsck finished at Sat Oct 1 13:34:50 2011 ########### ------------------------------------------------------------------------------
I forgot to mention that the check with --rebuild-tree did not help.
1) What version of the kernel are you using? That info should be in your attachment, but I can't read it: what character encoding is it using? 2) If you aren't using the latest vanilla kernel (3.0.4 at time of writing), please try that and confirm the problem still exists. Anyway, you're right, next step is recompiling the kernel with debugging enabled. You'll want at least the following "Kernel Hacking" options enabled: Magic SysRq key Kernel debugging Detect Hung Tasks RT Mutex debugging, deadlock detection Spinlock and rw-lock debugging: basic checks Mutex debugging: basic checks Lock debugging: detect incorrect freeing of live locks Lock debugging: prove locking correctness Compile the kernel with debug info Make sure you're running the latest vanilla kernel with those options, reproduce the lockup *from the console, not in X*, and see if any useful messages are dumped.
Created attachment 288625 [details] emerge --info
Created attachment 288627 [details] /etc/fstab part
Created attachment 288629 [details] lspci
Created attachment 288631 [details] mdstat
Comment on attachment 288333 [details] emerge --info, part of /etc/fstab, lspci, mdstat look next attached. this is wrong.
I packed an attachment to tar.bz2. Now lay out properly. I use kernel-3.0.4 from gentoo-sources, world's default. I will build a kernel with debugging in the evening and tell you the result.
I am also running gentoo-sources-3.0.4 and my kernel is also getting stuck when I copy a large sized directory from an ext3 partition to an ext4 partition on a different disk. It happens when I do "mv <src> <dst>" or when I do "rsync -a <src> <dst>". Happens everytime I do it. $ uname -a Linux freehit 3.0.4-gentoo-r2 #1 SMP Mon Oct 3 20:19:55 IST 2011 x86_64 AMD Phenom(tm) II X4 940 Processor AuthenticAMD GNU/Linux I will attach my system config as well. But the behavior seems very similar to the one reported here. If you think this is a different one, I can open a separate bz.
Created attachment 288695 [details] tgz file with dmesg, lscpi, emerge --info, fstab When I mv rsync or mv big sized dir from /dev/sda4 to /dev/sdb1 I find that the kernel gets stuck.
Can you reproduce this with gentoo-sources-3.0.3? There are some reports at the LKML about block-related lockups with 3.0.4 (not resolved yet). If you can't reproduce this with 3.0.3, maybe you could try git bisect and find the commit that's causing this regression. Thanks.
So first, I updated the world. Now I have kernel-3.0.4-gentoo-r1. Then I recompiled the kernel with debug. The problem disappeared. Again, I recompiled the kernel without debug - it works again. Then I noticed that the compiler to pass additional options with CFLAGS_KERNEL. This is my optimize-build script: --------------------------------------------------------------------------------- #!/bin/bash cd ./linux export CFLAGS_KERNEL="-march=amdfam10 -O3 -pipe" genkernel --no-mrproper --menuconfig --clean --splash=livecd-2007.0 --symlink --splash-res=1024x768 --install --disklabel --mdadm all --------------------------------------------------------------------------------- After changing CFLAGS_KERNEL to "-march=amdfam10 -O2" everything works fine. It is worth noting that O3 has been a long time and I forgot about it, because that is all worked fine. Problems began after moving to software RAID. But maybe it worked because of the renovation of the world, including the kernel to version 3.0.4-gentoo-r1. Nevertheless, thanks for your support. If necessary, I can roll back to 3.0.4 again and try once more to the purity of the experiment. Tomorrow should be testing on a production server with the same configuration. In case I missed something. THX.
(In reply to comment #14) > Can you reproduce this with gentoo-sources-3.0.3? There are some reports at the > LKML about block-related lockups with 3.0.4 (not resolved yet). > > If you can't reproduce this with 3.0.3, maybe you could try git bisect and find > the commit that's causing this regression. > > Thanks. Give please link, I want to reproduce the situation.
Well, I was referring to your issue. Can you reproduce the lockup/hang with gentoo-sources-3.0.3?
Oops.. Hung by night. Worked torent client on the remote machine, saving on my RAID :( May be 3.0.4?
Many smoked and did not immediately understand. Yes I can. In the evening after work (from 12 to 18 UTC).
Paramonov and Sushant, can you confirm this is a kernel regression (i.e. it was working before then broke after a kernel upgrade), and if so what was the last kernel that worked? Sushant, when you say "the kernel gets stuck" are you talking about a hard-lockup, where the system completely ceases responding? If so then please ensure the debugging options I mentioned are enabled in your kernel config, reproduce the lockup from the console and let us know whether there are any interesting messages produced. Although you and Paramonov are seeing the same symptoms you have very different setups with basically nothing in common above the block layer. At this point we can't be sure whether it is the same bug or not. Let's see how we go.
Now I test a 3.0.3 with debug.
3.0.3 hangs too. How to use a debug options? How to use the magick key? /var/log/messages does not contain additional information :(
now i compile 3.0.1, then 3.0.0 .. etc
3.0.0 hangs too. May be it's cause hardware problem? I have Gigabyte 870A-UD3. The first processor was broken. Did not work the memory controller with 16 GB of RAM. The store was replaced with a new one. May overheat the chipset? Tell me how to use debugging and magic button? Now I check with and without NFS export. I also use tmpfs: tmpfs / tmp tmpfs size = 2g, defaults, auto 0 0 tmpfs / var / tmp tmpfs size = 14g, defaults, auto 0 0
Without the NFS does not hang.
2 times, then hang 8(
So, managed to get some debugging information (kernel-3.0.4-gentoo-r1): /var/log/messages ------------------------------------------------------------------------------ Oct 3 17:58:07 localhost kernel: ata4.00: exception Emask 0x0 SAct 0x20 SErr 0x0 action 0x6 frozen Oct 3 17:58:07 localhost kernel: ata4.00: failed command: READ FPDMA QUEUED Oct 3 17:58:07 localhost kernel: ata4.00: cmd 60/08:28:37:38:a5/00:00:02:00:00/40 tag 5 ncq 4096 in Oct 3 17:58:07 localhost kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 3 17:58:07 localhost kernel: ata4.00: status: { DRDY } Oct 3 17:58:07 localhost kernel: ata4: hard resetting link Oct 3 17:58:07 localhost kernel: ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Oct 3 17:58:12 localhost kernel: ata4.00: qc timeout (cmd 0xec) Oct 3 17:58:12 localhost kernel: ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4) Oct 3 17:58:12 localhost kernel: ata4.00: revalidation failed (errno=-5) Oct 3 17:58:12 localhost kernel: ata4: hard resetting link Oct 3 17:58:13 localhost kernel: ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Oct 3 17:58:23 localhost kernel: ata4.00: qc timeout (cmd 0xec) Oct 3 17:58:23 localhost kernel: ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4) Oct 3 17:58:23 localhost kernel: ata4.00: revalidation failed (errno=-5) Oct 3 17:58:23 localhost kernel: ata4: limiting SATA link speed to 3.0 Gbps Oct 3 17:58:23 localhost kernel: ata4: hard resetting link Oct 3 17:58:23 localhost kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Oct 3 17:58:39 localhost kernel: INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 0 1 2} (detected by 3, t=60003 jiffies) Oct 3 17:58:52 localhost kernel: INFO: rcu_bh_state detected stalls on CPUs/tasks: { 0 1 2} (detected by 3, t=60002 jiffies) Oct 3 17:58:53 localhost kernel: ata4.00: qc timeout (cmd 0xec) Oct 3 17:58:53 localhost kernel: ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4) Oct 3 17:58:53 localhost kernel: ata4.00: revalidation failed (errno=-5) Oct 3 17:58:53 localhost kernel: ata4.00: disabled Oct 3 17:58:53 localhost kernel: ata4.00: device reported invalid CHS sector 0 Oct 3 17:58:53 localhost kernel: ata4: hard resetting link Oct 3 17:58:53 localhost kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Oct 3 17:58:53 localhost kernel: ata4: EH complete Oct 3 17:58:53 localhost kernel: sd 3:0:0:0: [sdd] Unhandled error code Oct 3 17:58:53 localhost kernel: sd 3:0:0:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00 Oct 3 17:58:53 localhost kernel: sd 3:0:0:0: [sdd] CDB: cdb[0]=0x28: 28 00 02 a5 38 37 00 00 08 00 Oct 3 17:58:53 localhost kernel: end_request: I/O error, dev sdd, sector 44382263 ------------------------------------------------------------------------------
Created attachment 288777 [details] part of /var/log/messages about disk errors
May be it's cause SATA cable at /dev/sdd getting old?
I agree that it looks like a hardware issue at this point. If you can reproduce this a few times it would be very useful to see if it is always the same drive that is failing. If so then either the that cable or drive itself is probably at fault. If more than one drive fails then the most likely hardware culprit is the power supply. Poor quality and/or underpowered power supplies can cause all sorts of strange issues.
I have a brand new computer. Sleaze is not set. Motherboard gigabyte 870-UD3, Power Supply FSP 700W, memory 16Gb Kingston, CPU Phenom II X6 1090T. It is strange that SMART says disks are healthy. Full surface scan with badblocks-nvs finds no errors. It's more like overheating hard drive, not a problem with the cable. Tonight, replace the cable, add a fan to drive and check it out. See you.
Western DIGITAL drives in the array. The root is placed on a solid-state drive.
I replaced the cable and tested again. Now have BUG spinlock lockup. Nov 26 13:34:46 localhost kernel: BUG: spinlock lockup on CPU#2, mc/7609, ffff880419c37200 Oct 4 15:55:50 localhost kernel: BUG: spinlock lockup on CPU#3, flush-9:127/2391, ffff880419c37200 See details in an attachment
Created attachment 288861 [details] part of /var/log/messages about BUG: spinlock lockup on CPU #N
OK, so it looks like you have found a genuine kernel bug after all -- although it looks like it was already hanging before the spinlock lockup occurred, so maybe there is more than one thing going on. Anyway, next step is to report it upstream. Normally we encourage people to report bugs via the kernel bugzilla, however that is still down following the recent kernel.org outage. Instead you should report it to the relevant mailing list(s) and developer(s) directly. First thing is to figure out who to report it to. I'm not sure exactly where the bug is, so I'd suggest to start by sending it to LKML (linux-kernel@vger.kernel.org) and to CC the md and block layer folks (specifically: Neil Brown <neilb@suse.de>, linux-raid@vger.kernel.org, Jens Axboe <axboe@kernel.dk>). Also CC me (duaneg@dghda.com). Use a concise descriptive subject line such as "BUG: spinlock lockup while performing FS operations". Give a brief description of your configuration (i.e. running reiserfs on RAID5) and that you've been seeing hangs during heavy FS operations. Also mention it is a new machine so you don't have a previously working configuration. Include the *full* dmesg output from after a failure (inline within the email, not as an attachment). Also give a link to this bugzilla entry. Please let me know if anything is unclear.
Ok. Thanks.
Created attachment 289037 [details] kernel configuration file .config even without CONFIG_LOCKDEP
Since bгgzilla.kernel.org still does not work to answer many letters here. Nothing I have not changed in /var/log/messages. The clock is really jumping around. The rows are in order. I myself was surprised. File truncated command echo "0"> /var/log/messages before copying. Alexander Beregalov <a.beregalov@gmail.com> says that I need to compile a kernel with CONFIG_LOCKDEP. Tonight (from 12 to 18 UTC) I did this and lay out a complete log with CONFIG_LOCKDEP is switched on. I have a new computer. It's my first RAID array. Now the old config attached, without CONFIG_LOCKDEP See you.
Created attachment 289067 [details] "full" /var/log/messages Hi! I need help. Tell me how to get the kernel to write to /var/log/messages for further information? It is displayed on the screen, but in protocols after the restart the silence? I use keyboard shortcuts Alt + SysReq + <key> but only prints on the screen. Now I have next output before all hangs: .. Oct 6 08:01:07 localhost kernel: SysRq : Changing Loglevel Oct 6 08:01:07 localhost kernel: Loglevel set to 9 Oct 6 08:03:16 localhost kernel: INFO: rcu_bh_state detected stalls on CPUs/tasks: { 0 1 2 4} (detected by 5, t=60002 jiffies) Oct 6 08:03:32 localhost kernel: INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 0 1 2 4} (detected by 5, t=60002 jiffies) ... Then begins new syslog..
Just quickly: you should reply to Neil and Dan via email. Kernel developers often dislike reading bugzilla entries ;-) Do a reply-to-all on one of their messages (make sure to add Dan to the CC list if replying to Neil's). Next time please try to keep full logs, don't truncate them. BTW, for kernel logs I recommend using the "dmesg" command instead of the system log. Lastly, regarding how to capture the logs: if the SysRq output isn't being written to disk then the short answer is netconsole. However try doing a "sync" -> "emergency unmount" -> "sync" -> "reboot" SysRq sequence first. You never know.
But you may help too. I wrote to developers the next message, duplicate it here: "Hy dear. Next, I wanted to make a backup. Disconnected one drive of RAID because I did not have a free power connector. RAID continued to work fine. Then connect the other drive, which is defined as /dev/sdd. Then I made it XFS, mounted and tried to backup my array. Received this output in /var/log/messages: --- Oct 6 08:03:16 localhost kernel: INFO: rcu_bh_state detected stalls on CPUs / tasks: {0 1 2 4} (detected by 5, t = 60 002 jiffies) Oct 6 08:03:32 localhost kernel: INFO: rcu_preempt_state detected stalls on CPUs / tasks: {0 1 2 4} (detected by 5, t = 60 002 jiffies) --- All stuck on this console, but worked on other alt + Fx. I can enter my login, but password not. Magic buttons still work some time, but the /var/log/messages is no longer writes. Duane Griffin (bugs.gentoo.org) says that I need to try to "sync"->"emergency unmount"->"sync"->"reboot". But this is an other things. Next. I decided to remove the dump directly through # dd if=/dev/md127 of=/dev/sdd and so copy both partitions. Again, all hung after few times (about 1-2 minutes). Now, I concluded that the problem is not in the file system. And not even the hardware. Here's why: Then do a reset, but often the computer does not restart and I have to press and hold the power button to shutdown. Then on again. It's strange, but next. I connect back the third disc, but the raid did not take it back. Then I do: # mdadm --zero-superblock /dev/sdd1 # mdadm --manage /dev/md0 --add /dev/sdd1 All is ok. ATTENTION! Starts synchronization array. And all done without any problems. # cat /proc/mdstat --- Personalities : [raid6] [raid5] [raid4] [multipath] [faulty] md127 : active raid5 sdd1[3] sdb1[0] sdc1[1] 1465146368 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_] [===================>.] recovery = 99.5% (729613632/732573184) finish=0.9min speed=51623K/sec unused devices: <none> --- # cat /proc/mdstat --- Personalities : [raid6] [raid5] [raid4] [multipath] [faulty] md127 : active raid5 sdd1[3] sdb1[0] sdc1[1] 1465146368 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] unused devices: <none> --- Second - SMART system reports that the array disks in order. It's very strange! Then I concluded that problem is not in hardware. I would like to hear your opinion. Still have a few thoughts. 1. Also turns off the remaining disks in the array and try to sync again to eliminate the problem of disk drives. 2. Try copying between the disks out of the array. But apparently it's the same case as the command dd. 3. I have an old IDE disk that monted next lines: # IDE disk 160Gb /dev/sde1 /var reiserfs defaults,auto,noatime,nodiratime,notail 0 0 /dev/sde2 /usr/portage reiserfs defaults,auto,noatime,nodiratime,notail 0 0 /dev/sde3 /usr/src reiserfs defaults,auto,noatime,nodiratime,notail 0 0 /dev/sde4 none swap sw 0 0 It's because I have a solid-state drive /dev/sda mounted as root partition. So, this IDE drive has non-critical SMART errors listed at end of message by command smartctl --all /dev/sde. It is unclear how this might affect the command dd. In the next time I did it. And try to sync and emergency unmount to save the information in the log. If it does not save, I have to hand copy a screen or photograph. Then post the logs and screenshots. Sorry for my bad english, Google translator to help me. I want to help and I need your help. Thanks. # smartctl --all /dev/sde --smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0.4-gentoo-r1] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus Device Model: ST3160023A Serial Number: 4JS0JGZ4 Firmware Version: 8.01 User Capacity: 160 040 803 840 bytes [160 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2 Local Time is: Sat Oct 8 12:42:29 2011 NOVT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 111) minutes. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 054 048 006 Pre-fail Always - 120037243 3 Spin_Up_Time 0x0003 097 096 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 106 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 086 060 030 Pre-fail Always - 410368363 9 Power_On_Hours 0x0032 069 069 000 Old_age Always - 27769 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 098 098 020 Old_age Always - 2760 194 Temperature_Celsius 0x0022 048 061 000 Old_age Always - 48 195 Hardware_ECC_Recovered 0x001a 054 047 000 Old_age Always - 120037243 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 192 000 Old_age Always - 95 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 6 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 6 occurred at disk power-on lifetime: 612 hours (25 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 01 f6 5f 39 e0 Error: ICRC, ABRT 1 sectors at LBA = 0x00395ff6 = 3760118 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 80 77 5f 39 e0 00 00:57:36.606 READ DMA EXT 25 00 80 77 5f 39 e0 00 00:57:36.596 READ DMA EXT 25 00 80 f7 5e 39 e0 00 00:57:36.588 READ DMA EXT 25 00 80 77 5e 39 e0 00 00:57:36.573 READ DMA EXT 25 00 58 3f 77 39 e0 00 00:57:36.572 READ DMA EXT Error 5 occurred at disk power-on lifetime: 612 hours (25 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 01 f6 5f 39 e0 Error: ICRC, ABRT 1 sectors at LBA = 0x00395ff6 = 3760118 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 80 77 5f 39 e0 00 00:57:36.606 READ DMA EXT 25 00 80 f7 5e 39 e0 00 00:57:36.596 READ DMA EXT 25 00 80 77 5e 39 e0 00 00:57:36.588 READ DMA EXT 25 00 58 3f 77 39 e0 00 00:57:36.573 READ DMA EXT 25 00 80 f7 5d 39 e0 00 00:57:36.572 READ DMA EXT Error 4 occurred at disk power-on lifetime: 612 hours (25 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 01 76 5e 39 e0 Error: ICRC, ABRT 1 sectors at LBA = 0x00395e76 = 3759734 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 80 f7 5d 39 e0 00 00:57:34.469 READ DMA EXT 25 00 80 f7 5d 39 e0 00 00:57:34.454 READ DMA EXT 25 00 80 77 5d 39 e0 00 00:57:34.445 READ DMA EXT 25 00 80 f7 5c 39 e0 00 00:57:34.444 READ DMA EXT 25 00 80 f7 5c 39 e0 00 00:57:34.440 READ DMA EXT Error 3 occurred at disk power-on lifetime: 612 hours (25 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 01 76 5e 39 e0 Error: ICRC, ABRT 1 sectors at LBA = 0x00395e76 = 3759734 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 80 f7 5d 39 e0 00 00:57:34.469 READ DMA EXT 25 00 80 77 5d 39 e0 00 00:57:34.454 READ DMA EXT 25 00 80 f7 5c 39 e0 00 00:57:34.445 READ DMA EXT 25 00 80 f7 5c 39 e0 00 00:57:34.444 READ DMA EXT 25 00 80 bf 76 39 e0 00 00:57:34.440 READ DMA EXT Error 2 occurred at disk power-on lifetime: 612 hours (25 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 01 76 5d 39 e0 Error: ICRC, ABRT 1 sectors at LBA = 0x00395d76 = 3759478 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 80 f7 5c 39 e0 00 00:57:34.469 READ DMA EXT 25 00 80 bf 76 39 e0 00 00:57:34.454 READ DMA EXT 25 00 80 77 5c 39 e0 00 00:57:34.445 READ DMA EXT 25 00 80 5f c1 38 e0 00 00:57:34.444 READ DMA EXT 25 00 28 4f 5b 39 e0 00 00:57:34.440 READ DMA EXT SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 27642 - # 2 Short offline Completed without error 00% 27345 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. --
Created attachment 289217 [details] /var/log/messages with debugging options set when the kernel gets stuck Happens simply when I do du -sh <large-directory> where the large-directory is on a ext3 filesystem
"du -sh" also fails on an older kernel linux-2.6.38-gentoo-r7. I opened the box and checked the SATA cable. They were attached properly. Looks like it is a different bug than Paramaonov. Should I file a separate bug?
(In reply to comment #43) > "du -sh" also fails on an older kernel linux-2.6.38-gentoo-r7. I opened the box > and checked the SATA cable. They were attached properly. Looks like it is a > different bug than Paramaonov. Should I file a separate bug? I thing this the same bug. I have detected stall on CPU! And renamed topic.
Really noticed that copying a large directory from a large (/dev/md127p2 - 1,4Tb) to a small partition (/dev/md127p1 - 50Gb) is successful. From small to small too well. But if I copy to large partition - hangs. Yesterday wanted to do a backup, created a partition XFS on 1TB too hung up. Regardless of the file system. This is a key! Sushant Sinha, нou should also write the kernel developers to confirm the bug, or they regard as a hardware issue a single person. And this is basically correct. I failed to get the output in /var/log/meessages because they do not have time to write to the disk before the lock. You can help solve the problem of sending a report to the developers. Because bugzilla.kernel.org is still down I wrote here: linux-kernel@vger.kernel.org neilb@suse.de linux-raid@vger.kernel.org axboe@kernel.dk duaneg@dghda.com Do not forget to indicate in the subject that detected stalls on CPUs. I wrote the same subject. Give them a link to this thread. Thanks.
Another thought. We have the same chipset driver. You have motherboard GA-MA790GP, and I have a GA-870-UD3, both Gigabyte on AMD.
Paramonov, Sent a mail to the kernel devs with the /var/log/messages. You are right that we have the same motherboard and the same CPU. Also I was trying to copy roughly 150GB from the smaller partition (200GB) to larger one (ext4, 2TB).
(In reply to comment #47) > Paramonov, > > Sent a mail to the kernel devs with the /var/log/messages. You are right that > we have the same motherboard and the same CPU. Also I was trying to copy > roughly 150GB from the smaller partition (200GB) to larger one (ext4, 2TB). Thanks. I sent your /var/log/messages too :) How you are got output to /var/log/messages? I get output to screen.. And SysReq-sync is not helps me. Maybe I did not include all the debug options in kernel configuration.
I copy a folder with different files (ttf, mp3, avi about 1,5 Gb). Folder size 21Gb.
(In reply to comment #48) > Thanks. I sent your /var/log/messages too :) How you are got output to > /var/log/messages? I get output to screen.. And SysReq-sync is not helps me. > Maybe I did not include all the debug options in kernel configuration. I set loglevel=9 in the kernel line in bootup as kernel /boot/kernel-3.0.4-gentoo-debug root=/dev/sda3 loglevel=9 I tried it with my old kernel 2.6.33 and getting hung there also. So either: 1. hardware is buggy 2. hardware specific bug 3. old hanging bug in linux kernel I doubt that option (3) is a possibility. I think option (2) is more likelihood. Going to check if there is any firmware upgrade available for my h/w.
I think the problem is the driver for the chipset AMD (SATA part). "AMD Big FS Write BUG!"
(In reply to comment #51) > I think the problem is the driver for the chipset AMD (SATA part). "AMD Big FS > Write BUG!" That is a possibility. But it seems strange that there has been a bug for more than 2 years and not yet fixed.
May be it is amd+swraid specific bug. We look forward to what the developers say.
OK, so first thing: Sushant, it doesn't look like you are using md, right? So, assuming this is the same bug (and let's remember that hasn't been fully established yet), we can scratch that possibility. There are reports of problems with the ATI SB700/SB800 SATA controller, but they were from long ago and seem to have been fixed. The ATA maintainer is Jeff Garzik, I'll reply to your latest email and CC him and the ATA list. Paramonov, to get a full log from your machine the easiest option is probably to setup netconsole. Do you have another linux machine available you can use for this?
(In reply to comment #54) > OK, so first thing: Sushant, it doesn't look like you are using md, right? So, > assuming this is the same bug (and let's remember that hasn't been fully > established yet), we can scratch that possibility. > > There are reports of problems with the ATI SB700/SB800 SATA controller, but > they were from long ago and seem to have been fixed. The ATA maintainer is Jeff > Garzik, I'll reply to your latest email and CC him and the ATA list. > Yes. I was mistaken when he wrote the above post. Correctly amd_sata + big_fs = bug. > Paramonov, to get a full log from your machine the easiest option is probably > to setup netconsole. Do you have another linux machine available you can use > for this? Yes, one more computer I have. Now I read how to use it. Soon lay the logs. Thank you.
I am not using any md. I enabled the "S.M.A.R.T" h/d in bios and used smartctl to look for warnings. It shows everything is fine. I also ran "fsck -f" on the partition to see if there is any fs inconsistency. But that shows everything is fine. fsck on the partition runs fine and does not stall the kernel.
Created attachment 289295 [details] The console output through netcat #nc -u -l -p 6969 Ok. I set up a network console and got the output # nc -u -l -p 6969
(In reply to comment #54) > OK, so first thing: Sushant, it doesn't look like you are using md, right? So, > assuming this is the same bug (and let's remember that hasn't been fully > established yet), we can scratch that possibility. > > There are reports of problems with the ATI SB700/SB800 SATA controller, but > they were from long ago and seem to have been fixed. The ATA maintainer is Jeff > Garzik, I'll reply to your latest email and CC him and the ATA list. > > Paramonov, to get a full log from your machine the easiest option is probably > to setup netconsole. Do you have another linux machine available you can use > for this? Do you may to send my log too? I do not know Jeff's email address. Thx.
Good work getting netconsole setup! If you can, next time it would be good if you could capture the full kernel log from bootup to hang, however hold off sending more info for now. We don't want to spam people with too much detail. Let's wait and see what they say.
Well, wait. It's just that my work is idle. What statistics? How long does it usually takes removal of a critical error? Why the status UNCONFIRMED? Just this is the first case it is not possible to use the computer for a long time. I need to work:)
I've got no idea how long bugs stay unresolved in general. I doubt the severity makes any difference at all. If you play around with the "reports" link above you might be able to find out. Regardless, I don't think it will be useful for estimating how long this specific bug will take to fix. I know it is frustrating, but please keep in mind that Gentoo is a volunteer-run distro. And while the kernel maintainers are generally paid to do their jobs, you don't have a support contract with them or the companies they work for, so it works similarly. Some linux people are very good about responding to users, some are not so good. Even the best will not react immediately to every report they get. In the meantime there is something else you can do: test with some earlier kernels. Don't bother trying every minor release, just try each latest stable release, e.g. start by trying 2.6.39.4, 2.6.38.8 and then 2.6.37.6. If you find it was working in an earlier kernel then it is a regression and it makes it much easier to find the bug (and also provides much more incentive for the maintainers to look at it). Report the results here and I'll advise you on whether and how to send them upstream. Finally, I'm not sure what the policy on updating the status field is. I wouldn't worry about it though; I don't think it will affect the quality of service you get ;-)
It hangs with kernel 2.6.33 for me.
Created attachment 289571 [details] Current working config Yesterday at night I ran the world update. In the morning I updated kernel to version 3.0.6. It works! But I also changed the kernel option for drivers SATA / PATA. Maybe something else changed, but I do not remember exactly. Lay out the current working config just in case. If something will change, it announced. Thank you all.
So, as I said, fine copy now goes in two directions (tried 3 times) .. I copied the same folder on the 21Gb. Moreover, in parallel with downloading torrents with preservation of the NSF (3 downloads and 5 seeding). At night, removed the restriction on seeding and set amount of downloads at the same time to 5, again stuck, when worked only torrents. Tonight I will try another location at 100-200Gb.
(In reply to comment #64) > So, as I said, fine copy now goes in two directions (tried 3 times) .. I copied > the same folder on the 21Gb. Moreover, in parallel with downloading torrents > with preservation of the NSF (3 downloads and 5 seeding). That is great! It would be really helpful if you could identify the config option (if that is what it was) that made the change. Perhaps you still have a .config.old file with the old settings lying around? To confirm it was a config option, not the upgrade, you could try recompiling 3.0.4 with the same config you are using now.
Still hangs with me for kernel 3.0.6.
Last night I left work torrents. 3 leecher and 5 seeder. Not hung. But I think the root of the problem is not solved. Obtained if multiple threads, then hangs. I have not had time to test, so check with a large folder today. Also check the current configuration on the old kernel.
Still hangs :(
I did not see any updates from the devs. Has anyone? I guess this problem may be affecting only a very small group of people.
I think we need to write driver maintainers SATA, AMD chipset and controllers JMicron. The primary mailing list was incorrect.
Please take this upstream at https://bugzilla.kernel.org/ and post the url back here.
One thing is for sure! The error occurs only when the data from the torrent client from another computer, stored or read through the NFS server. In other words, the server hung from the other machine.