385047 – BUG: spinlock lockup on CPU #N and INFO: detected stalls on CPUs / tasks

Bug 385047 - BUG: spinlock lockup on CPU #N and INFO: detected stalls on CPUs / tasks

Summary: BUG: spinlock lockup on CPU #N and INFO: detected stalls on CPUs / tasks

Status:	RESOLVED UPSTREAM

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	AMD64 Linux

Importance:	Normal critical
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-09-30 08:30 UTC by Paramonov Valeriy
Modified:	2012-06-22 18:53 UTC (History)
CC List:	2 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
emerge --info, part of /etc/fstab, lspci, mdstat (attachment.tar.bz2,4.83 KB, text/plain) 2011-09-30 08:30 UTC, Paramonov Valeriy	Details
emerge --info (emerge.info,7.03 KB, text/plain) 2011-10-03 02:01 UTC, Paramonov Valeriy	Details
/etc/fstab part (fstab,258 bytes, text/plain) 2011-10-03 02:01 UTC, Paramonov Valeriy	Details
lspci (lspci,2.79 KB, text/plain) 2011-10-03 02:02 UTC, Paramonov Valeriy	Details
mdstat (mdstat,218 bytes, text/plain) 2011-10-03 02:02 UTC, Paramonov Valeriy	Details
tgz file with dmesg, lscpi, emerge --info, fstab (kernel3.0.4.stuck.tgz,19.93 KB, application/octet-stream) 2011-10-03 16:58 UTC, Sushant Sinha	Details
part of /var/log/messages about disk errors (messages1,2.52 KB, text/plain) 2011-10-04 15:14 UTC, Paramonov Valeriy	Details
part of /var/log/messages about BUG: spinlock lockup on CPU #N (messages,22.28 KB, text/plain) 2011-10-05 13:32 UTC, Paramonov Valeriy	Details
kernel configuration file .config even without CONFIG_LOCKDEP (.config,71.18 KB, text/plain) 2011-10-07 02:35 UTC, Paramonov Valeriy	Details
"full" /var/log/messages (messages,69.07 KB, text/plain) 2011-10-07 12:34 UTC, Paramonov Valeriy	Details
/var/log/messages with debugging options set when the kernel gets stuck Happens simply when I do du -sh <large-directory> where the large-directory is on a ext3 filesystem (messages,112.38 KB, text/plain) 2011-10-08 13:00 UTC, Sushant Sinha	Details
The console output through netcat #nc -u -l -p 6969 (nc.out,14.80 KB, text/plain) 2011-10-09 05:25 UTC, Paramonov Valeriy	Details
Current working config (.config,69.16 KB, text/plain) 2011-10-11 15:04 UTC, Paramonov Valeriy	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Paramonov Valeriy 2011-09-30 08:30:21 UTC

Created attachment 288333 [details]
emerge --info, part of /etc/fstab, lspci, mdstat

Hi dear!

Faced with this problem. There are RAID5, assembled by mdadm (/dev/md127), which is divided into 2 partitions (md127p1 and md127p2). In both reiserfs. The second section is exported via NFS. Everything works, the array is intact and fully synchronized. SMART says the drive is in order. And even all the copies, but when copying a file about 700 MB all hangs a stake, even the mouse somewhere in the 80 percent and saves only the reset. At the same time in the logs was complete silence. After a reset of course runs fsck, and then synchronize the array. With hdparm not played, so all the defaults. Has anyone encountered this problem? In what direction to dig?

At gentoo.ru says that it may be reiserfs. Here is some attached information about the system.

Thank you.

Comment 1 Paramonov Valeriy 2011-10-01 03:24:48 UTC

Next.

Unmounted the partition and ran the test, which resulted in an error is detected. Why fsck at boot after a reset is not correct?

# reiserfsck --check /dev/md127p2
------------------------------------------------------------------------------
reiserfsck 3.6.21 (2009 www.namesys.com)
...
Will read-only check consistency of the filesystem on /dev/md127p2
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Fri Sep 30 21:04:04 2011
###########
Replaying journal: Done.
Reiserfs journal '/dev/md127p2' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. finished                                
Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.
Checking Semantic tree:
... 11 - FLAC)/63 - Various_-_Ibiza_2011_(continuous_DJ_mix_3_Underground).flacvpf-10670: The file [277582 277645] has the wrong size in the StatData (0), should be (4096)
vpf-10680: The file [277582 277645] has the wrong block count in the StatData (8), should be (0)
finished                                                                       
4 found corruptions can be fixed when running with --fix-fixable
###########
reiserfsck finished at Fri Sep 30 21:12:10 2011
###########
------------------------------------------------------------------------------


#reiserfsck --fix-fixable /dev/md127p2
------------------------------------------------------------------------------
reiserfsck 3.6.21 (2009 www.namesys.com)
...
Will check consistency of the filesystem on /dev/md127p2
and will fix what can be fixed without --rebuild-tree
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --fix-fixable started at Fri Sep 30 21:14:00 2011
###########
Replaying journal: Done.
Reiserfs journal '/dev/md127p2' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. finished                                
Comparing bitmaps..vpf-10630: The on-disk and the correct bitmaps differs. Will be fixed later.
Checking Semantic tree:
... 11 - FLAC)/63 - Various_-_Ibiza_2011_(continuous_DJ_mix_3_Underground).flacvpf-10670: The file [277582 277645] has the wrong size in the StatData (0) - corrected to (4096)
vpf-10680: The file [277582 277645] has the wrong block count in the StatData (8) - corrected to (0)
finished                                                                       
No corruptions found
There are on the filesystem:
        Leaves 196965
        Internal nodes 1248
        Directories 8781
        Other files 250705
        Data block pointers 187358213 (34266825 of them are zero)
        Safe links 0
###########
reiserfsck finished at Fri Sep 30 21:23:10 2011
###########
------------------------------------------------------------------------------

After checking again tried to copy - exact same song. I decided to check for bad sectors. Now comes the test drive of the runlevel 1 with badblocks-nvs /dev/sdX. Remaining 20 hours.

In the Russian community (gentoo.ru), said that another problem may be due to export to NFS, but I have not tried to copy without starting NFS daemon. After the scan is finished try.

Comment 2 Paramonov Valeriy 2011-10-01 03:26:46 UTC

The first partition (/dev/md127p1) contains no errors.

Comment 3 Paramonov Valeriy 2011-10-02 09:08:01 UTC

Next:

Surface scan with 'badblocks -nvs' revealed no bad sectors.

Then I switch on runlevel 3 and check out the reiserfs partitions again, but uses --rebuild-tree.

Moreover, it found that if you copy files over a 5-10MB (mp3) 10-20 pieces, then everything is OK, but if I charge up 100 of these files, then hangs up after a while, unable to complete backup. NFS has been stopped.
What's next? I understand it to debug the kernel and enable verbose output?

# reiserfsck --rebuild-tree /dev/md0p2
------------------------------------------------------------------------------
reiserfsck 3.6.21 (2009 www.namesys.com)...                        
Will rebuild the filesystem (/dev/md0p2) tree
Will put log info to'stdout'                                                                 
Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
Replaying journal: Done.
Reiserfs journal '/dev/md0p2' in blocks [18..8211]: 0 transactions replayed
###########
reiserfsck --rebuild-tree started at Sat Oct  1 12:23:52 2011
###########                                                                                    
Pass 0:
####### Pass 0 #######
Loading on-disk bitmap .. ok, 153308617 blocks marked used
Skipping 19016 blocks (super block, journal, bitmaps) 153289601 blocks will be read
0%block 6736204: The number of items (4096) is incorrect, should be (1) - corrected
block 6736204: The free space (51968) is incorrect, should be (4048) - corrected
pass0: vpf-10110: block 6736204, item (0): Unknown item type found [217129472 3204448256 0x3 ??? (15)] - deleted
....20%..block 118354322: The number of items (4505) is incorrect, should be (1) - corrected
block 118354322: The free space (39168) is incorrect, should be (4048) - corrected
pass0: vpf-10110: block 118354322, item (0): Unknown item type found [0 2566914448 0x99001100 ??? (9)] - deleted
..40%....60%....80%....100%                       left 0, 47151 /sec
259485 directory entries were hashed with "r5" hash.
"r5" hash is selected
Flushing..finished
Read blocks (but not data blocks) 153289601
Leaves among those 196968
- leaves all contents of which could not be saved and deleted 3
Objectids found260611                                                         
Pass 1 (will try to insert 196965 leaves):
####### Pass 1 #######
Looking for allocable blocks .. finished
0%....20%....40%....60%....80%....100%                         left 0, 327 /sec
Flushing..finished
196965 leaves read
196860 inserted
105 not inserted
####### Pass 2 #######                                                                      
Pass 2:
0%....20%....40%....60%....80%....100%                         left 0, 210 /sec
Flushing..finished
Leaves inserted item by item 105
Pass 3 (semantic):
####### Pass 3 #########
Flushing..finished
Files found: 250129
Directories found: 8782
Symlinks found: 576
Pass 3a (looking for lost dir/files):
####### Pass 3a (lost+found pass) #########
Looking for lost directories:
Flushing..finished, 2283 /sec
Pass 4 - finishedone 135475, 6451 /sec
Flushing..finished
Syncing..finished
###########
reiserfsck finished at Sat Oct  1 13:34:50 2011
###########
------------------------------------------------------------------------------

Comment 4 Paramonov Valeriy 2011-10-02 09:10:29 UTC

I forgot to mention that the check with --rebuild-tree did not help.

Comment 5 Duane Griffin 2011-10-02 21:12:59 UTC

1) What version of the kernel are you using? That info should be in your attachment, but I can't read it: what character encoding is it using?

2) If you aren't using the latest vanilla kernel (3.0.4 at time of writing), please try that and confirm the problem still exists.

Anyway, you're right, next step is recompiling the kernel with debugging enabled. You'll want at least the following "Kernel Hacking" options enabled:

Magic SysRq key
Kernel debugging
Detect Hung Tasks
RT Mutex debugging, deadlock detection
Spinlock and rw-lock debugging: basic checks
Mutex debugging: basic checks
Lock debugging: detect incorrect freeing of live locks
Lock debugging: prove locking correctness
Compile the kernel with debug info

Make sure you're running the latest vanilla kernel with those options, reproduce the lockup *from the console, not in X*, and see if any useful messages are dumped.

Comment 6 Paramonov Valeriy 2011-10-03 02:01:14 UTC

Created attachment 288625 [details]
emerge --info

Comment 7 Paramonov Valeriy 2011-10-03 02:01:52 UTC

Created attachment 288627 [details]
/etc/fstab part

Comment 8 Paramonov Valeriy 2011-10-03 02:02:17 UTC

Created attachment 288629 [details]
lspci

Comment 9 Paramonov Valeriy 2011-10-03 02:02:59 UTC

Created attachment 288631 [details]
mdstat

Comment 10 Paramonov Valeriy 2011-10-03 02:03:54 UTC

Comment on attachment 288333 [details]
emerge --info, part of /etc/fstab, lspci, mdstat

look next attached. this is wrong.

Comment 11 Paramonov Valeriy 2011-10-03 02:09:24 UTC

I packed an attachment to tar.bz2. Now lay out properly. I use kernel-3.0.4 from gentoo-sources, world's default. I will build a kernel with debugging in the evening and tell you the result.

Comment 12 Sushant Sinha 2011-10-03 16:50:40 UTC

I am also running gentoo-sources-3.0.4 and my kernel is also getting stuck when I copy a large sized directory from an ext3 partition to an ext4 partition on a different disk. It happens when I do "mv <src> <dst>" or when I do "rsync -a <src> <dst>". Happens everytime I do it.

$ uname -a
Linux freehit 3.0.4-gentoo-r2 #1 SMP Mon Oct 3 20:19:55 IST 2011 x86_64 AMD Phenom(tm) II X4 940 Processor AuthenticAMD GNU/Linux


I will attach my system config as well. But the behavior seems very similar to the one reported here. If you think this is a different one, I can open a separate bz.

Comment 13 Sushant Sinha 2011-10-03 16:58:45 UTC

Created attachment 288695 [details]
tgz file with dmesg, lscpi, emerge --info, fstab

When I mv rsync or mv big sized dir from /dev/sda4 to /dev/sdb1 I find that the kernel gets stuck.

Comment 14 Stratos Psomadakis (RETIRED) gentoo-dev

2011-10-03 17:51:40 UTC

Can you reproduce this with gentoo-sources-3.0.3? There are some reports at the LKML about block-related lockups with 3.0.4 (not resolved yet). 

If you can't reproduce this with 3.0.3, maybe you could try git bisect and find the commit that's causing this regression.

Thanks.

Comment 15 Paramonov Valeriy 2011-10-03 18:04:08 UTC

So first, I updated the world. Now I have kernel-3.0.4-gentoo-r1.

Then I recompiled the kernel with debug. The problem disappeared. Again, I recompiled the kernel without debug - it works again. Then I noticed that the compiler to pass additional options with CFLAGS_KERNEL.

This is my optimize-build script:
---------------------------------------------------------------------------------
#!/bin/bash
cd ./linux
export CFLAGS_KERNEL="-march=amdfam10 -O3 -pipe"
genkernel --no-mrproper --menuconfig --clean --splash=livecd-2007.0 --symlink --splash-res=1024x768 --install --disklabel --mdadm all
---------------------------------------------------------------------------------

After changing CFLAGS_KERNEL to "-march=amdfam10 -O2" everything works fine.

It is worth noting that O3 has been a long time and I forgot about it, because that is all worked fine. Problems began after moving to software RAID.

But maybe it worked because of the renovation of the world, including the kernel to version 3.0.4-gentoo-r1. Nevertheless, thanks for your support. If necessary, I can roll back to 3.0.4 again and try once more to the purity of the experiment.

Tomorrow should be testing on a production server with the same configuration. In case I missed something.

THX.

Comment 16 Paramonov Valeriy 2011-10-03 18:08:34 UTC

(In reply to comment #14)
> Can you reproduce this with gentoo-sources-3.0.3? There are some reports at the
> LKML about block-related lockups with 3.0.4 (not resolved yet). 
> 
> If you can't reproduce this with 3.0.3, maybe you could try git bisect and find
> the commit that's causing this regression.
> 
> Thanks.

Give please link, I want to reproduce the situation.

Comment 17 Stratos Psomadakis (RETIRED) gentoo-dev

2011-10-03 21:07:31 UTC

Well, I was referring to your issue. Can you reproduce the lockup/hang with gentoo-sources-3.0.3?

Comment 18 Paramonov Valeriy 2011-10-04 00:54:03 UTC

Oops.. Hung by night. Worked torent client on the remote machine, saving on my RAID :(

May be 3.0.4?

Comment 19 Paramonov Valeriy 2011-10-04 02:17:20 UTC

Many smoked and did not immediately understand. Yes I can. In the evening after work (from 12 to 18 UTC).

Comment 20 Duane Griffin 2011-10-04 10:06:09 UTC

Paramonov and Sushant, can you confirm this is a kernel regression (i.e. it was working before then broke after a kernel upgrade), and if so what was the last kernel that worked?

Sushant, when you say "the kernel gets stuck" are you talking about a hard-lockup, where the system completely ceases responding?

If so then please ensure the debugging options I mentioned are enabled in your kernel config, reproduce the lockup from the console and let us know whether there are any interesting messages produced.

Although you and Paramonov are seeing the same symptoms you have very different setups with basically nothing in common above the block layer. At this point we can't be sure whether it is the same bug or not. Let's see how we go.

Comment 21 Paramonov Valeriy 2011-10-04 11:49:04 UTC

Now I test a 3.0.3 with debug.

Comment 22 Paramonov Valeriy 2011-10-04 12:32:00 UTC

3.0.3 hangs too. How to use a debug options? How to use the magick key? /var/log/messages does not contain additional information :(

Comment 23 Paramonov Valeriy 2011-10-04 12:32:48 UTC

now i compile 3.0.1, then 3.0.0 .. etc

Comment 24 Paramonov Valeriy 2011-10-04 13:24:28 UTC

3.0.0 hangs too. May be it's cause hardware problem? I have Gigabyte 870A-UD3. The first processor was broken. Did not work the memory controller with 16 GB of RAM. The store was replaced with a new one. May overheat the chipset? Tell me how to use debugging and magic button?

Now I check with and without NFS export.

I also use tmpfs:
tmpfs / tmp tmpfs size = 2g, defaults, auto 0 0
tmpfs / var / tmp tmpfs size = 14g, defaults, auto 0 0

Comment 25 Paramonov Valeriy 2011-10-04 14:15:16 UTC

Without the NFS does not hang.

Comment 26 Paramonov Valeriy 2011-10-04 14:24:31 UTC

2 times, then hang 8(

Comment 27 Paramonov Valeriy 2011-10-04 15:11:45 UTC

So, managed to get some debugging information (kernel-3.0.4-gentoo-r1):

/var/log/messages
------------------------------------------------------------------------------
Oct  3 17:58:07 localhost kernel: ata4.00: exception Emask 0x0 SAct 0x20 SErr 0x0 action 0x6 frozen
Oct  3 17:58:07 localhost kernel: ata4.00: failed command: READ FPDMA QUEUED
Oct  3 17:58:07 localhost kernel: ata4.00: cmd 60/08:28:37:38:a5/00:00:02:00:00/40 tag 5 ncq 4096 in
Oct  3 17:58:07 localhost kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct  3 17:58:07 localhost kernel: ata4.00: status: { DRDY }
Oct  3 17:58:07 localhost kernel: ata4: hard resetting link
Oct  3 17:58:07 localhost kernel: ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Oct  3 17:58:12 localhost kernel: ata4.00: qc timeout (cmd 0xec)
Oct  3 17:58:12 localhost kernel: ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Oct  3 17:58:12 localhost kernel: ata4.00: revalidation failed (errno=-5)
Oct  3 17:58:12 localhost kernel: ata4: hard resetting link
Oct  3 17:58:13 localhost kernel: ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Oct  3 17:58:23 localhost kernel: ata4.00: qc timeout (cmd 0xec)
Oct  3 17:58:23 localhost kernel: ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Oct  3 17:58:23 localhost kernel: ata4.00: revalidation failed (errno=-5)
Oct  3 17:58:23 localhost kernel: ata4: limiting SATA link speed to 3.0 Gbps
Oct  3 17:58:23 localhost kernel: ata4: hard resetting link
Oct  3 17:58:23 localhost kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Oct  3 17:58:39 localhost kernel: INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 0 1 2} (detected by 3, t=60003 jiffies)
Oct  3 17:58:52 localhost kernel: INFO: rcu_bh_state detected stalls on CPUs/tasks: { 0 1 2} (detected by 3, t=60002 jiffies)
Oct  3 17:58:53 localhost kernel: ata4.00: qc timeout (cmd 0xec)
Oct  3 17:58:53 localhost kernel: ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Oct  3 17:58:53 localhost kernel: ata4.00: revalidation failed (errno=-5)
Oct  3 17:58:53 localhost kernel: ata4.00: disabled
Oct  3 17:58:53 localhost kernel: ata4.00: device reported invalid CHS sector 0
Oct  3 17:58:53 localhost kernel: ata4: hard resetting link
Oct  3 17:58:53 localhost kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Oct  3 17:58:53 localhost kernel: ata4: EH complete
Oct  3 17:58:53 localhost kernel: sd 3:0:0:0: [sdd] Unhandled error code
Oct  3 17:58:53 localhost kernel: sd 3:0:0:0: [sdd]  Result: hostbyte=0x04 driverbyte=0x00
Oct  3 17:58:53 localhost kernel: sd 3:0:0:0: [sdd] CDB: cdb[0]=0x28: 28 00 02 a5 38 37 00 00 08 00
Oct  3 17:58:53 localhost kernel: end_request: I/O error, dev sdd, sector 44382263
------------------------------------------------------------------------------

Comment 28 Paramonov Valeriy 2011-10-04 15:14:00 UTC

Created attachment 288777 [details]
part of /var/log/messages about disk errors

Comment 29 Paramonov Valeriy 2011-10-04 15:52:29 UTC

May be it's cause SATA cable at /dev/sdd getting old?

Comment 30 Duane Griffin 2011-10-04 21:27:23 UTC

I agree that it looks like a hardware issue at this point.

If you can reproduce this a few times it would be very useful to see if it is always the same drive that is failing. If so then either the that cable or drive itself is probably at fault. If more than one drive fails then the most likely hardware culprit is the power supply. Poor quality and/or underpowered power supplies can cause all sorts of strange issues.

Comment 31 Paramonov Valeriy 2011-10-05 08:31:47 UTC

I have a brand new computer. Sleaze is not set. Motherboard gigabyte 870-UD3, Power Supply FSP 700W, memory 16Gb Kingston, CPU Phenom II X6 1090T.

It is strange that SMART says disks are healthy. Full surface scan with badblocks-nvs finds no errors. It's more like overheating hard drive, not a problem with the cable. Tonight, replace the cable, add a fan to drive and check it out.

See you.

Comment 32 Paramonov Valeriy 2011-10-05 08:43:07 UTC

Western DIGITAL drives in the array. The root is placed on a solid-state drive.

Comment 33 Paramonov Valeriy 2011-10-05 13:30:34 UTC

I replaced the cable and tested again. Now have BUG spinlock lockup.

Nov 26 13:34:46 localhost kernel: BUG: spinlock lockup on CPU#2, mc/7609, ffff880419c37200
Oct  4 15:55:50 localhost kernel: BUG: spinlock lockup on CPU#3, flush-9:127/2391, ffff880419c37200

See details in an attachment

Comment 34 Paramonov Valeriy 2011-10-05 13:32:09 UTC

Created attachment 288861 [details]
part of /var/log/messages about BUG: spinlock lockup on CPU #N

Comment 35 Duane Griffin 2011-10-05 23:29:22 UTC

OK, so it looks like you have found a genuine kernel bug after all -- although it looks like it was already hanging before the spinlock lockup occurred, so maybe there is more than one thing going on. Anyway, next step is to report it upstream.

Normally we encourage people to report bugs via the kernel bugzilla, however that is still down following the recent kernel.org outage. Instead you should report it to the relevant mailing list(s) and developer(s) directly.

First thing is to figure out who to report it to. I'm not sure exactly where the bug is, so I'd suggest to start by sending it to LKML (linux-kernel@vger.kernel.org) and to CC the md and block layer folks (specifically: Neil Brown <neilb@suse.de>, linux-raid@vger.kernel.org, Jens Axboe <axboe@kernel.dk>). Also CC me (duaneg@dghda.com).

Use a concise descriptive subject line such as "BUG: spinlock lockup while performing FS operations". Give a brief description of your configuration (i.e. running reiserfs on RAID5) and that you've been seeing hangs during heavy FS operations. Also mention it is a new machine so you don't have a previously working configuration.

Include the *full* dmesg output from after a failure (inline within the email, not as an attachment). Also give a link to this bugzilla entry.

Please let me know if anything is unclear.

Comment 36 Paramonov Valeriy 2011-10-06 12:20:37 UTC

Ok. Thanks.

Comment 37 Paramonov Valeriy 2011-10-07 02:35:34 UTC

Created attachment 289037 [details]
kernel configuration file .config even without CONFIG_LOCKDEP

Comment 38 Paramonov Valeriy 2011-10-07 02:42:02 UTC

Since bгgzilla.kernel.org still does not work to answer many letters here.

Nothing I have not changed in /var/log/messages. The clock is really jumping around. The rows are in order. I myself was surprised.
File truncated command echo "0"> /var/log/messages before copying.

Alexander Beregalov <a.beregalov@gmail.com> says that I need to compile a kernel with CONFIG_LOCKDEP.

Tonight (from 12 to 18 UTC) I did this and lay out a complete log with CONFIG_LOCKDEP is switched on.

I have a new computer. It's my first RAID array.

Now the old config attached, without CONFIG_LOCKDEP

See you.

Comment 39 Paramonov Valeriy 2011-10-07 12:34:51 UTC

Created attachment 289067 [details]
"full" /var/log/messages

Hi!

I need help. Tell me how to get the kernel to write to /var/log/messages for further information? It is displayed on the screen, but in protocols after the restart the silence?

I use keyboard shortcuts Alt + SysReq + <key> but only prints on the screen.

Now I have next output before all hangs:

..
Oct  6 08:01:07 localhost kernel: SysRq : Changing Loglevel
Oct  6 08:01:07 localhost kernel: Loglevel set to 9
Oct  6 08:03:16 localhost kernel: INFO: rcu_bh_state detected stalls on CPUs/tasks: { 0 1 2 4} (detected by 5, t=60002 jiffies)
Oct  6 08:03:32 localhost kernel: INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 0 1 2 4} (detected by 5, t=60002 jiffies)
...
Then begins new syslog..

Comment 40 Duane Griffin 2011-10-07 20:28:45 UTC

Just quickly: you should reply to Neil and Dan via email. Kernel developers often dislike reading bugzilla entries ;-) Do a reply-to-all on one of their messages (make sure to add Dan to the CC list if replying to Neil's).

Next time please try to keep full logs, don't truncate them. BTW, for kernel logs I recommend using the "dmesg" command instead of the system log.

Lastly, regarding how to capture the logs: if the SysRq output isn't being written to disk then the short answer is netconsole. However try doing a "sync" -> "emergency unmount" -> "sync" -> "reboot" SysRq sequence first. You never know.

Comment 41 Paramonov Valeriy 2011-10-08 06:09:13 UTC

But you may help too. I wrote to developers the next message, duplicate it here:

"Hy dear.

Next, I wanted to make a backup. Disconnected one drive of RAID because I did not have a free power connector. RAID continued to work fine. Then connect the other drive, which is defined as /dev/sdd. Then I made it XFS, mounted and tried to backup my array. Received this output in /var/log/messages:

---
Oct 6 08:03:16 localhost kernel: INFO: rcu_bh_state detected stalls on CPUs / tasks: {0 1 2 4} (detected by 5, t = 60 002 jiffies)
Oct 6 08:03:32 localhost kernel: INFO: rcu_preempt_state detected stalls on CPUs / tasks: {0 1 2 4} (detected by 5, t = 60 002 jiffies)
---

All stuck on this console, but worked on other alt + Fx. I can enter my login, but password not. Magic buttons still work some time, but the /var/log/messages is no longer writes. Duane Griffin (bugs.gentoo.org) says that I need to try to "sync"->"emergency unmount"->"sync"->"reboot". But this is an other things.

Next. I decided to remove the dump directly through


# dd if=/dev/md127 of=/dev/sdd


and so copy both partitions. Again, all hung after few times (about 1-2 minutes).

Now, I concluded that the problem is not in the file system. And not even the hardware. Here's why:

Then do a reset, but often the computer does not restart and I have to press and hold the power button to shutdown. Then on again. It's strange, but next.

I connect back the third disc, but the raid did not take it back. Then I do:


# mdadm --zero-superblock /dev/sdd1
# mdadm --manage /dev/md0 --add /dev/sdd1


All is ok. ATTENTION! Starts synchronization array. And all done without any problems.

# cat /proc/mdstat
---
Personalities : [raid6] [raid5] [raid4] [multipath] [faulty]
md127 : active raid5 sdd1[3] sdb1[0] sdc1[1]
      1465146368 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
      [===================>.]  recovery = 99.5% (729613632/732573184) finish=0.9min speed=51623K/sec

unused devices: <none>
---

# cat /proc/mdstat
---
Personalities : [raid6] [raid5] [raid4] [multipath] [faulty]
md127 : active raid5 sdd1[3] sdb1[0] sdc1[1]
      1465146368 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

unused devices: <none>
---

Second - SMART system reports that the array disks in order. It's very strange! Then I concluded that problem is not in hardware. I would like to hear your opinion.

Still have a few thoughts.

1. Also turns off the remaining disks in the array and try to sync again to eliminate the problem of disk drives.
2. Try copying between the disks out of the array. But apparently it's the same case as the command dd.
3. I have an old IDE disk that monted next lines:

# IDE disk 160Gb
/dev/sde1 /var reiserfs defaults,auto,noatime,nodiratime,notail    0 0
/dev/sde2 /usr/portage reiserfs defaults,auto,noatime,nodiratime,notail    0 0
/dev/sde3 /usr/src reiserfs defaults,auto,noatime,nodiratime,notail    0 0
/dev/sde4 none swap sw 0 0

It's because I have a solid-state drive /dev/sda mounted as root partition.

So, this IDE drive has non-critical SMART errors listed at end of message by command smartctl --all /dev/sde. It is unclear how this might affect the command dd.


In the next time I did it. And try to sync and emergency unmount to save the information in the log. If it does not save, I have to hand copy a screen or photograph. Then post the logs and screenshots.

Sorry for my bad english, Google translator to help me.
I want to help and I need your help. Thanks.

# smartctl --all /dev/sde
--smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0.4-gentoo-r1] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.7 and 7200.7 Plus
Device Model:     ST3160023A
Serial Number:    4JS0JGZ4
Firmware Version: 8.01
User Capacity:    160 040 803 840 bytes [160 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 2
Local Time is:    Sat Oct  8 12:42:29 2011 NOVT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  430) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 111) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   054   048   006    Pre-fail Always       -       120037243
  3 Spin_Up_Time            0x0003   097   096   000    Pre-fail Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age Always       -       106
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail Always       -       0
  7 Seek_Error_Rate         0x000f   086   060   030    Pre-fail Always       -       410368363
  9 Power_On_Hours          0x0032   069   069   000    Old_age Always       -       27769
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail Always       -       0
 12 Power_Cycle_Count       0x0032   098   098   020    Old_age Always       -       2760
194 Temperature_Celsius     0x0022   048   061   000    Old_age   Always       -       48
195 Hardware_ECC_Recovered  0x001a   054   047   000    Old_age   Always       -       120037243
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   192   000    Old_age   Always       -       95
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 6 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 6 occurred at disk power-on lifetime: 612 hours (25 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 f6 5f 39 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00395ff6 = 3760118

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 80 77 5f 39 e0 00      00:57:36.606  READ DMA EXT
  25 00 80 77 5f 39 e0 00      00:57:36.596  READ DMA EXT
  25 00 80 f7 5e 39 e0 00      00:57:36.588  READ DMA EXT
  25 00 80 77 5e 39 e0 00      00:57:36.573  READ DMA EXT
  25 00 58 3f 77 39 e0 00      00:57:36.572  READ DMA EXT

Error 5 occurred at disk power-on lifetime: 612 hours (25 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 f6 5f 39 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00395ff6 = 3760118

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 80 77 5f 39 e0 00      00:57:36.606  READ DMA EXT
  25 00 80 f7 5e 39 e0 00      00:57:36.596  READ DMA EXT
  25 00 80 77 5e 39 e0 00      00:57:36.588  READ DMA EXT
  25 00 58 3f 77 39 e0 00      00:57:36.573  READ DMA EXT
  25 00 80 f7 5d 39 e0 00      00:57:36.572  READ DMA EXT

Error 4 occurred at disk power-on lifetime: 612 hours (25 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 76 5e 39 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00395e76 = 3759734

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 80 f7 5d 39 e0 00      00:57:34.469  READ DMA EXT
  25 00 80 f7 5d 39 e0 00      00:57:34.454  READ DMA EXT
  25 00 80 77 5d 39 e0 00      00:57:34.445  READ DMA EXT
  25 00 80 f7 5c 39 e0 00      00:57:34.444  READ DMA EXT
  25 00 80 f7 5c 39 e0 00      00:57:34.440  READ DMA EXT

Error 3 occurred at disk power-on lifetime: 612 hours (25 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 76 5e 39 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00395e76 = 3759734

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 80 f7 5d 39 e0 00      00:57:34.469  READ DMA EXT
  25 00 80 77 5d 39 e0 00      00:57:34.454  READ DMA EXT
  25 00 80 f7 5c 39 e0 00      00:57:34.445  READ DMA EXT
  25 00 80 f7 5c 39 e0 00      00:57:34.444  READ DMA EXT
  25 00 80 bf 76 39 e0 00      00:57:34.440  READ DMA EXT

Error 2 occurred at disk power-on lifetime: 612 hours (25 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 76 5d 39 e0  Error: ICRC, ABRT 1 sectors at LBA = 0x00395d76 = 3759478

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 80 f7 5c 39 e0 00      00:57:34.469  READ DMA EXT
  25 00 80 bf 76 39 e0 00      00:57:34.454  READ DMA EXT
  25 00 80 77 5c 39 e0 00      00:57:34.445  READ DMA EXT
  25 00 80 5f c1 38 e0 00      00:57:34.444  READ DMA EXT
  25 00 28 4f 5b 39 e0 00      00:57:34.440  READ DMA EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     27642      -
# 2  Short offline       Completed without error       00%     27345      -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
--

Comment 42 Sushant Sinha 2011-10-08 13:00:17 UTC

Created attachment 289217 [details]
/var/log/messages with debugging options set when the kernel gets stuck

Happens simply when I do 
du -sh <large-directory>
where the large-directory is on a ext3 filesystem

Comment 43 Sushant Sinha 2011-10-08 13:04:28 UTC

"du -sh" also fails on an older kernel linux-2.6.38-gentoo-r7. I opened the box and checked the SATA cable. They were attached properly. Looks like it is a different bug than Paramaonov. Should I file a separate bug?

Comment 44 Paramonov Valeriy 2011-10-08 13:33:06 UTC

(In reply to comment #43)
> "du -sh" also fails on an older kernel linux-2.6.38-gentoo-r7. I opened the box
> and checked the SATA cable. They were attached properly. Looks like it is a
> different bug than Paramaonov. Should I file a separate bug?

I thing this the same bug. I have detected stall on CPU! And renamed topic.

Comment 45 Paramonov Valeriy 2011-10-08 15:14:29 UTC

Really noticed that copying a large directory from a large (/dev/md127p2 - 1,4Tb) to a small partition (/dev/md127p1 - 50Gb) is successful. From small to small too well. But if I copy to large partition - hangs. Yesterday wanted to do a backup, created a partition XFS on 1TB too hung up. Regardless of the file system.

This is a key!

Sushant Sinha, нou should also write the kernel developers to confirm the bug, or they regard as a hardware issue a single person. And this is basically correct.

I failed to get the output in /var/log/meessages because they do not have time to write to the disk before the lock. You can help solve the problem of sending a report to the developers. Because bugzilla.kernel.org is still down I wrote here:

linux-kernel@vger.kernel.org
neilb@suse.de
linux-raid@vger.kernel.org
axboe@kernel.dk
duaneg@dghda.com

Do not forget to indicate in the subject that detected stalls on CPUs. I wrote the same subject. Give them a link to this thread.

Thanks.

Comment 46 Paramonov Valeriy 2011-10-08 15:30:12 UTC

Another thought. We have the same chipset driver. You have motherboard GA-MA790GP, and I have a GA-870-UD3, both Gigabyte on AMD.

Comment 47 Sushant Sinha 2011-10-08 15:42:44 UTC

Paramonov,

Sent a mail to the kernel devs with the /var/log/messages. You are right that we have the same motherboard and the same CPU. Also I was trying to copy roughly 150GB from the smaller partition (200GB) to larger one (ext4, 2TB).

Comment 48 Paramonov Valeriy 2011-10-08 16:46:22 UTC

(In reply to comment #47)
> Paramonov,
> 
> Sent a mail to the kernel devs with the /var/log/messages. You are right that
> we have the same motherboard and the same CPU. Also I was trying to copy
> roughly 150GB from the smaller partition (200GB) to larger one (ext4, 2TB).

Thanks. I sent your /var/log/messages too :) How you are got output to /var/log/messages? I get output to screen.. And SysReq-sync is not helps me. Maybe I did not include all the debug options in kernel configuration.

Comment 49 Paramonov Valeriy 2011-10-08 16:50:28 UTC

I copy a folder with different files (ttf, mp3, avi about 1,5 Gb). Folder size 21Gb.

Comment 50 Sushant Sinha 2011-10-08 16:56:37 UTC

(In reply to comment #48)

> Thanks. I sent your /var/log/messages too :) How you are got output to
> /var/log/messages? I get output to screen.. And SysReq-sync is not helps me.
> Maybe I did not include all the debug options in kernel configuration.

I set loglevel=9 in the kernel line in bootup as

kernel /boot/kernel-3.0.4-gentoo-debug root=/dev/sda3 loglevel=9

I tried it with my old kernel 2.6.33 and getting hung there also. So either:
1. hardware is buggy 
2. hardware specific bug
3. old hanging bug in linux kernel

I doubt that option (3) is a possibility. I think option (2) is more likelihood. Going to check if there is any firmware upgrade available for my h/w.

Comment 51 Paramonov Valeriy 2011-10-08 17:03:19 UTC

I think the problem is the driver for the chipset AMD (SATA part). "AMD Big FS Write BUG!"

Comment 52 Sushant Sinha 2011-10-08 17:12:48 UTC

(In reply to comment #51)
> I think the problem is the driver for the chipset AMD (SATA part). "AMD Big FS
> Write BUG!"

That is a possibility. But it seems strange that there has been a bug for more than 2 years and not yet fixed.

Comment 53 Paramonov Valeriy 2011-10-08 17:34:52 UTC

May be it is amd+swraid specific bug. We look forward to what the developers say.

Comment 54 Duane Griffin 2011-10-08 23:34:30 UTC

OK, so first thing: Sushant, it doesn't look like you are using md, right? So, assuming this is the same bug (and let's remember that hasn't been fully established yet), we can scratch that possibility.

There are reports of problems with the ATI SB700/SB800 SATA controller, but they were from long ago and seem to have been fixed. The ATA maintainer is Jeff Garzik, I'll reply to your latest email and CC him and the ATA list.

Paramonov, to get a full log from your machine the easiest option is probably to setup netconsole. Do you have another linux machine available you can use for this?

Comment 55 Paramonov Valeriy 2011-10-09 04:05:56 UTC

(In reply to comment #54)
> OK, so first thing: Sushant, it doesn't look like you are using md, right? So,
> assuming this is the same bug (and let's remember that hasn't been fully
> established yet), we can scratch that possibility.
> 
> There are reports of problems with the ATI SB700/SB800 SATA controller, but
> they were from long ago and seem to have been fixed. The ATA maintainer is Jeff
> Garzik, I'll reply to your latest email and CC him and the ATA list.
> 

Yes. I was mistaken when he wrote the above post. Correctly amd_sata + big_fs = bug. 

> Paramonov, to get a full log from your machine the easiest option is probably
> to setup netconsole. Do you have another linux machine available you can use
> for this?

Yes, one more computer I have. Now I read how to use it. Soon lay the logs.

Thank you.

Comment 56 Sushant Sinha 2011-10-09 04:40:01 UTC

I am not using any md. I enabled the "S.M.A.R.T" h/d in bios and used smartctl to look for warnings. It shows everything is fine. 

I also ran "fsck -f" on the partition to see if there is any fs inconsistency. But that shows everything is fine. fsck on the partition runs fine and does not stall the kernel.

Comment 57 Paramonov Valeriy 2011-10-09 05:25:55 UTC

Created attachment 289295 [details]
The console output through netcat #nc -u -l -p 6969

Ok. I set up a network console and got the output

# nc -u -l -p 6969

Comment 58 Paramonov Valeriy 2011-10-09 05:34:51 UTC

(In reply to comment #54)
> OK, so first thing: Sushant, it doesn't look like you are using md, right? So,
> assuming this is the same bug (and let's remember that hasn't been fully
> established yet), we can scratch that possibility.
> 
> There are reports of problems with the ATI SB700/SB800 SATA controller, but
> they were from long ago and seem to have been fixed. The ATA maintainer is Jeff
> Garzik, I'll reply to your latest email and CC him and the ATA list.
> 
> Paramonov, to get a full log from your machine the easiest option is probably
> to setup netconsole. Do you have another linux machine available you can use
> for this?

Do you may to send my log too? I do not know Jeff's email address.

Thx.

Comment 59 Duane Griffin 2011-10-09 20:37:13 UTC

Good work getting netconsole setup!

If you can, next time it would be good if you could capture the full kernel log from bootup to hang, however hold off sending more info for now. We don't want to spam people with too much detail. Let's wait and see what they say.

Comment 60 Paramonov Valeriy 2011-10-10 01:10:47 UTC

Well, wait. It's just that my work is idle. 

What statistics? How long does it usually takes removal of a critical error? Why the status UNCONFIRMED?

Just this is the first case it is not possible to use the computer for a long time. I need to work:)

Comment 61 Duane Griffin 2011-10-11 00:14:43 UTC

I've got no idea how long bugs stay unresolved in general. I doubt the severity makes any difference at all. If you play around with the "reports" link above you might be able to find out. Regardless, I don't think it will be useful for estimating how long this specific bug will take to fix.

I know it is frustrating, but please keep in mind that Gentoo is a volunteer-run distro. And while the kernel maintainers are generally paid to do their jobs, you don't have a support contract with them or the companies they work for, so it works similarly. Some linux people are very good about responding to users, some are not so good. Even the best will not react immediately to every report they get.

In the meantime there is something else you can do: test with some earlier kernels. Don't bother trying every minor release, just try each latest stable release, e.g. start by trying 2.6.39.4, 2.6.38.8 and then 2.6.37.6. If you find it was working in an earlier kernel then it is a regression and it makes it much easier to find the bug (and also provides much more incentive for the maintainers to look at it). Report the results here and I'll advise you on whether and how to send them upstream.

Finally, I'm not sure what the policy on updating the status field is. I wouldn't worry about it though; I don't think it will affect the quality of service you get ;-)

Comment 62 Sushant Sinha 2011-10-11 12:35:48 UTC

It hangs with kernel 2.6.33 for me.

Comment 63 Paramonov Valeriy 2011-10-11 15:04:05 UTC

Created attachment 289571 [details]
Current working config

Yesterday at night I ran the world update. In the morning I updated kernel to version 3.0.6. It works! But I also changed the kernel option for drivers SATA / PATA. Maybe something else changed, but I do not remember exactly. Lay out the current working config just in case.

If something will change, it announced.

Thank you all.

Comment 64 Paramonov Valeriy 2011-10-12 03:40:53 UTC

So, as I said, fine copy now goes in two directions (tried 3 times) .. I copied the same folder on the 21Gb. Moreover, in parallel with downloading torrents with preservation of the NSF (3 downloads and 5 seeding).

At night, removed the restriction on seeding and set amount of downloads at the same time to 5, again stuck, when worked only torrents. Tonight I will try another location at 100-200Gb.

Comment 65 Duane Griffin 2011-10-12 14:08:05 UTC

(In reply to comment #64)
> So, as I said, fine copy now goes in two directions (tried 3 times) .. I copied
> the same folder on the 21Gb. Moreover, in parallel with downloading torrents
> with preservation of the NSF (3 downloads and 5 seeding).

That is great! It would be really helpful if you could identify the config option (if that is what it was) that made the change. Perhaps you still have a .config.old file with the old settings lying around?

To confirm it was a config option, not the upgrade, you could try recompiling 3.0.4 with the same config you are using now.

Comment 66 Sushant Sinha 2011-10-12 17:20:10 UTC

Still hangs with me for kernel 3.0.6.

Comment 67 Paramonov Valeriy 2011-10-13 02:47:59 UTC

Last night I left work torrents. 3 leecher and 5 seeder. Not hung. But I think the root of the problem is not solved. Obtained if multiple threads, then hangs. I have not had time to test, so check with a large folder today. Also check the current configuration on the old kernel.

Comment 68 Paramonov Valeriy 2011-10-16 14:17:21 UTC

Still hangs :(

Comment 69 Sushant Sinha 2011-10-16 18:20:14 UTC

I did not see any updates from the devs. Has anyone? I guess this problem may be affecting only a very small group of people.

Comment 70 Paramonov Valeriy 2011-10-17 06:28:06 UTC

I think we need to write driver maintainers SATA, AMD chipset and controllers JMicron. The primary mailing list was incorrect.

Comment 71 Mike Pagano gentoo-dev

2012-03-04 21:04:27 UTC

Please take this upstream at https://bugzilla.kernel.org/ and post the url back here.

Comment 72 Paramonov Valeriy 2012-03-07 14:32:07 UTC

One thing is for sure! The error occurs only when the data from the torrent client from another computer, stored or read through the NFS server. In other words, the server hung from the other machine.