Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 439502 - Progressive ext4 data corruption regression introduced in Linux 3.4
Summary: Progressive ext4 data corruption regression introduced in Linux 3.4
Status: RESOLVED UPSTREAM
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: All Linux
: Normal critical (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL: https://lkml.org/lkml/2012/10/28/309
Whiteboard: [linux >=3.4 <3.4.18] [linux >=3.5 <=...
Keywords: REGRESSION
Depends on:
Blocks:
 
Reported: 2012-10-24 12:00 UTC by Richard Yao
Modified: 2013-10-06 15:30 UTC (History)
34 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
Upstream patch (jbd_patch,987 bytes, patch)
2012-10-25 03:30 UTC, Dmitry Suloev
Details | Diff
linux-2.6.git-ffb5387.patch (linux-2.6.git-ffb5387.patch,3.03 KB, patch)
2012-11-02 12:41 UTC, Kerin Millar
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Yao gentoo-dev 2012-10-24 12:00:47 UTC
A serious data corruption issue was introduced in the "post-3.6.1 ext4 patches" that were backported to 3.5. As far as I can tell, the following kernels are affected:

=sys-kernel/gentoo-sources-3.5.7
=sys-kernel/gentoo-sources-3.6.2
=sys-kernel/hardened-sources-3.6.2
=sys-kernel/vanilla-sources-3.5.7
=sys-kernel/vanilla-sources-3.6.2
=sys-kernel/vanilla-sources-3.6.3

I am masking all of them to try to keep our users from upgrading to affected versions.
Comment 1 Richard Yao gentoo-dev 2012-10-24 12:33:24 UTC
It looks like Linux 3.4.14 and 3.4.15 are also affected. I have masked sys-kernel/vanilla-sources-3.4.14 as well. That is the only 3.4.x ebuild affected by this that we have in the tree.

It does not appear that the regression was backported to branches older than 3.4.x.
Comment 2 Richard Yao gentoo-dev 2012-10-24 12:49:10 UTC
I checked the lesser used kernel source packages in portage. The following are also affected:

=sys-kernel/mips-sources-3.5.7
=sys-kernel/mips-sources-3.6.2
=sys-kernel/pf-sources-3.6.3
=sys-kernel/pf-sources-3.6.4

I have extended the mask to include them. All affected kernel source packages in the tree are now masked.
Comment 3 Markos Chandras (RETIRED) gentoo-dev 2012-10-24 13:05:05 UTC
Please DO NOT mask unsupported kernels. pf-sources are not supported by Gentoo so please lift the mask ASAP. Did you get permission from the kernel team to mask all these packages?
Comment 4 Richard Yao gentoo-dev 2012-10-24 14:08:16 UTC
(In reply to comment #3)
> Please DO NOT mask unsupported kernels. pf-sources are not supported by
> Gentoo so please lift the mask ASAP. Did you get permission from the kernel
> team to mask all these packages?

None of the kernel team members were available in IRC. I have talked to mpagano in the past about helping. However, the reality of being a student has restricted my ability to do much. In this case, we have a critical issue that merited immediate action. I decided to mask affected kernel source packages after discussing it with others in #gentoo-dev on freenode when the kernel team could not be reached.

I was under the impression that the security team did not support various kernel source packages, but their maintainers did. If the package maintainers are unwilling to support them, then they should either be given to those that are or be masked for removal.

With that said, you are one of the sys-kernel/pf-sources maintainers. I cannot dictate to you that affected sys-kernel/pf-sources versions must be masked on the basis of this bug. I have lifted the mask on sys-kernel/pf-sources, but I will not lift the mask on packages that you do not maintain until I am in contact with their maintainers. In the future, I will leave sys-kernel/pf-sources alone should a similar situation occur.
Comment 5 Markos Chandras (RETIRED) gentoo-dev 2012-10-24 14:16:54 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > Please DO NOT mask unsupported kernels. pf-sources are not supported by
> > Gentoo so please lift the mask ASAP. Did you get permission from the kernel
> > team to mask all these packages?
> 
> None of the kernel team members were available in IRC. I have talked to
> mpagano in the past about helping. However, the reality of being a student
> has restricted my ability to do much. In this case, we have a critical issue
> that merited immediate action. I decided to mask affected kernel source
> packages after discussing it with others in #gentoo-dev on freenode when the
> kernel team could not be reached.
> 
> I was under the impression that the security team did not support various
> kernel source packages, but their maintainers did. If the package
> maintainers are unwilling to support them, then they should either be given
> to those that are or be masked for removal.
> 
> With that said, you are one of the sys-kernel/pf-sources maintainers. I
> cannot dictate to you that affected sys-kernel/pf-sources versions must be
> masked on the basis of this bug. I have lifted the mask on
> sys-kernel/pf-sources, but I will not lift the mask on packages that you do
> not maintain until I am in contact with their maintainers. In the future, I
> will leave sys-kernel/pf-sources alone should a similar situation occur.

I am just saying that the pf-sources is not marked stable for a reason. We commit what upstream maintainer releases and there is a big fat warning that this kernel patchset is unsupported in the sense that we don't apply custom or security patches on top of it unless really necessary. I'd rather wait for an upstream patch instead of forcing everyone downgrading its kernel. Thanks for unmasking it.
Comment 6 Richard Yao gentoo-dev 2012-10-24 14:26:40 UTC
(In reply to comment #5)
> I am just saying that the pf-sources is not marked stable for a reason. We
> commit what upstream maintainer releases and there is a big fat warning that
> this kernel patchset is unsupported in the sense that we don't apply custom
> or security patches on top of it unless really necessary. I'd rather wait
> for an upstream patch instead of forcing everyone downgrading its kernel.
> Thanks for unmasking it.

The LKML post by Ted T'so that I referenced in the URL field has a patch for this. I expect gregkh to tag new stable kernel after it has been reviewed by other upstream developers.
Comment 7 Small_Penguin 2012-10-24 17:36:37 UTC
Besides, doesn't that corruption happen only when the journal starts at block 0? This would imply one had to use special options when creating the filesystem, which the typical user usually doesn't do?
Comment 8 Joshua Kinard gentoo-dev 2012-10-24 18:24:26 UTC
It looks like the trigger conditions for this are a bit remote.  While that doesn't reduce the severity of a corruption issue in one of the most commonly-used filesystems out there, I think suddenly masking everything is a bit of a jump.  This should be brought up on the -dev mailing list.  Some kernel people may not be on IRC, but they'll see the discussion there.
Comment 9 Bruce Hill 2012-10-24 18:34:17 UTC
https://lkml.org/lkml/2012/10/23/741
Comment 10 Anthony Basile gentoo-dev 2012-10-24 20:24:31 UTC
  24 Oct 2012; Anthony G. Basile <blueness@gentoo.org>
  -hardened-sources-3.6.2.ebuild:
  Removed because of bug #439502
Comment 11 David Flogeras 2012-10-24 20:27:34 UTC
Please consider https://bugs.gentoo.org/show_bug.cgi?id=439546 . The 3.4.9 kernel has a different ext4 related bug which I have hit.
Comment 12 James Bowlin 2012-10-24 20:36:43 UTC
(In reply to comment #7)
> Besides, doesn't that corruption happen only when the journal starts at
> block 0? This would imply one had to use special options when creating the
> filesystem, which the typical user usually doesn't do?

The problem occurs when you umount before the journal buffer has wrapped.  For example a /home partition will be more susceptible to this bug than a root partition.  You will also bump into it when you are unmounting soon, for example when you are debugging something or if you are chrooting into a system to fix it.  IMO it makes perfect sense to mask any kernel that has this bug until that kernel is fixed.

A data corruption bug that does not happen all the time is far more dangerous than one that happens all the time.
Comment 13 Small_Penguin 2012-10-24 21:35:58 UTC
Sorry, please ignore my previous comment #7 about the journal starting at block 0. I've been terribly misled by a very awful translation of the original text.

I'm trying to imagine how the journal buffer 'wraps' itself. Weird idea. I won't be affected by that very soon, though, because I usually only use suspend & resume, and this way there's not so much mounting/unmounting involved. I've been using 3.6.2 happily now for a few days, and fscking half an hour ago didn't turn up anything bad. Just to be sure, I applied T'so's patch, in the hope it doesn't make anything worse. Let's hope they fix that up pretty soon. In the worst case... Backups are always a good idea.

But yes, masking it is probably very sane, with the most important aspect being notification of the users about the issue.
Comment 14 jannis 2012-10-24 21:54:05 UTC
(In reply to comment #13)
> In the worst case... Backups are always a good idea.

Yea, cool that my NAS (where I do store my backups) uses ext4 and 3.5.7 :/
Comment 15 Alex Alexander (RETIRED) gentoo-dev 2012-10-24 22:25:38 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > (In reply to comment #3)
> > > Please DO NOT mask unsupported kernels. pf-sources are not supported by
> > > Gentoo so please lift the mask ASAP. Did you get permission from the kernel
> > > team to mask all these packages?
> > 
> > With that said, you are one of the sys-kernel/pf-sources maintainers. I
> > cannot dictate to you that affected sys-kernel/pf-sources versions must be
> > masked on the basis of this bug. I have lifted the mask on
> > sys-kernel/pf-sources, but I will not lift the mask on packages that you do
> > not maintain until I am in contact with their maintainers. In the future, I
> > will leave sys-kernel/pf-sources alone should a similar situation occur.
> 
> I am just saying that the pf-sources is not marked stable for a reason. We
> commit what upstream maintainer releases and there is a big fat warning that
> this kernel patchset is unsupported in the sense that we don't apply custom
> or security patches on top of it unless really necessary. I'd rather wait
> for an upstream patch instead of forcing everyone downgrading its kernel.
> Thanks for unmasking it.

The ext4 bug is critical enough to warrant a downgrade. Heck, I'd even consider a News item. Data loss is a nightmare we must protect our users from at all costs.

I'm re-masking the affected versions of pf.
Richard, thanks for the quick first mask :)
Comment 16 Markos Chandras (RETIRED) gentoo-dev 2012-10-24 22:46:55 UTC
(In reply to comment #15)
> (In reply to comment #5)
> > (In reply to comment #4)
> > > (In reply to comment #3)
> > > > Please DO NOT mask unsupported kernels. pf-sources are not supported by
> > > > Gentoo so please lift the mask ASAP. Did you get permission from the kernel
> > > > team to mask all these packages?
> > > 
> > > With that said, you are one of the sys-kernel/pf-sources maintainers. I
> > > cannot dictate to you that affected sys-kernel/pf-sources versions must be
> > > masked on the basis of this bug. I have lifted the mask on
> > > sys-kernel/pf-sources, but I will not lift the mask on packages that you do
> > > not maintain until I am in contact with their maintainers. In the future, I
> > > will leave sys-kernel/pf-sources alone should a similar situation occur.
> > 
> > I am just saying that the pf-sources is not marked stable for a reason. We
> > commit what upstream maintainer releases and there is a big fat warning that
> > this kernel patchset is unsupported in the sense that we don't apply custom
> > or security patches on top of it unless really necessary. I'd rather wait
> > for an upstream patch instead of forcing everyone downgrading its kernel.
> > Thanks for unmasking it.
> 
> The ext4 bug is critical enough to warrant a downgrade. Heck, I'd even
> consider a News item. Data loss is a nightmare we must protect our users
> from at all costs.
> 
> I'm re-masking the affected versions of pf.
> Richard, thanks for the quick first mask :)

whatever. we never masked pf before. just waited for a proper upstream commit. anyway do what you want
Comment 17 Mike Pagano gentoo-dev 2012-10-25 00:01:21 UTC
As I my daytime access is now extremely limited, I thank Richard for handling the issue. It's no secret I've been reaching out for some assistance and he did the right thing.

Thanks again, Richard
Comment 18 Dmitry Suloev 2012-10-25 03:30:05 UTC
Created attachment 327372 [details, diff]
Upstream patch

https://lkml.org/lkml/2012/10/23/690
Comment 19 Graham Murray 2012-10-25 07:23:18 UTC
Later messages (https://lkml.org/lkml/2012/10/24/535) on lkml indicate that this problem is not as serious as initially thought and that people should not panic.
Comment 20 Richard Yao gentoo-dev 2012-10-25 15:13:47 UTC
(In reply to comment #19)
> Later messages (https://lkml.org/lkml/2012/10/24/535) on lkml indicate that
> this problem is not as serious as initially thought and that people should
> not panic.

Additional comments by those debugging this suggest that it became more profound in the 3.6.2 patches.

http://phoronix.com/forums/showthread.php?74697-EXT4-Data-Corruption-Bug-Hits-Stable-Linux-Kernels&p=293446#post293446

Anyway, I made a judgment call based on information Ted T'so provided, which he is now saying was 100% right. In lieu of a full retraction of the notion that this became more serious in recent kernels, I plan to wait for Ted to figure out exactly what is wrong.
Comment 21 Richard Yao gentoo-dev 2012-10-25 15:17:03 UTC
(In reply to comment #20)
> Anyway, I made a judgment call based on information Ted T'so provided, which
> he is now saying was 100% right. In lieu of a full retraction of the notion
> that this became more serious in recent kernels, I plan to wait for Ted to
> figure out exactly what is wrong.

That is a typo. To be clear, I meant to say that Ted is now saying that the information he provided on the LKML was not 100% right.
Comment 22 Markos Chandras (RETIRED) gentoo-dev 2012-10-25 16:44:37 UTC
pf-sources fixed in 3.6.5. Nothing else for us to do here. old versions removed
Comment 23 Kyle Sanderson 2012-10-25 16:50:57 UTC
In regards to Comment 22.

Awesome! Could you please share the patch so we can use it upstream? Ted, as far as I'm aware, is still trying to diagnose the problem.
Comment 24 Pacho Ramos gentoo-dev 2012-10-25 19:02:52 UTC
From my point of view, data corruption is a problem major enough to deserve hardmasking of all kernels in the tree... but, as I can't talk for kernels not maintained by me, I can only see that you are more than welcome to touch tuxonice-sources (I think they are not affected :/)
Comment 25 Ulenrich 2012-10-25 20:39:39 UTC
@Kyle, (Comment#23) 
Ted just released the revert patch of linux-3.6.2 
jbd2-don-t-write-superblock-when-if-its-empty.patch
called 
ext4 revert: jbd2-don-t-write-superblock-when-if-its-empty.patch"
as fix!
Comment 26 Richard Yao gentoo-dev 2012-10-25 20:58:25 UTC
(In reply to comment #25)
> @Kyle, (Comment#23) 
> Ted just released the revert patch of linux-3.6.2 
> jbd2-don-t-write-superblock-when-if-its-empty.patch
> called 
> ext4 revert: jbd2-don-t-write-superblock-when-if-its-empty.patch"
> as fix!

Thanks for the update. That is actually one of two patches:

http://www.spinics.net/lists/linux-ext4/msg34669.html
http://www.spinics.net/lists/linux-ext4/msg34670.html

I am going to recommend that we wait for the next stable release from gregkh, which I expect to see shortly. That will prevent the situation where he does his tag soon after we issue ebuild revisions.
Comment 27 Ulenrich 2012-10-25 21:23:18 UTC
@Richard, yes, the second patch addresses the issue which should have been fixed by the original - now reverted - jbd2 patch. But this second patch now mainly does a:
"printk(KERN_ERR ...."
warning, In essence: Kernel maintainers searching the bugs ... 

Recently I see many reverts: nohz,cgroups,hugeblocks
Something gone toooo complex with linux?
Comment 28 Ulenrich 2012-10-25 21:27:42 UTC
I just wanted to say:
If you are comfortable running linux-3.6.1 you also could move to linux-3.6.3 with this jbd2-revert-patch ...
Comment 29 Kyle Sanderson 2012-10-25 21:29:22 UTC
(In reply to comment #28)
While Ted is sure this is the problem, he's waiting for a response from the two original reporters before shipping it off. I'd honestly wait to confirm it's fixed, but it's up to you.
Comment 30 Richard Yao gentoo-dev 2012-10-25 22:19:00 UTC
The dust has not had time to settle and I am not convinced that I have full facts. I have emailed gregkh asking him for his plans. I am willing to wait a few days to be certain that Ted is certain and to avoid reintroducing these source package versions only to have gregkh do a tag within a few days. Doing frequent kernel source package updates is annoys users. I do not want our efforts to ensure the integrity of user data to cause that.

I cannot stop maintainers from doing otherwise, but I strongly recommend that people wait until all of the facts are known. In particular, we should wait for Ted to verify that he is certain. We should also wait to verify that gregkh is not going to release a new tag shortly after we make changes to the tree. I emailed gregkh to inquire about his plans. I expect a reply soon.
Comment 31 Richard Yao gentoo-dev 2012-10-25 22:24:54 UTC
Ignore what I just said. gregkh replied to my email sooner than I expected. His response was less than conclusive (i.e. he doesn't know when he will do another stable release, but he would like to get started on it soon). Ted's patches will not be accepted into stable until they are in Linus' tree.

With that in mind, I still recommend waiting to make sure that Ted is certain before we do anything else. A few days should not be a problem.
Comment 32 Sergei Trofimovich gentoo-dev 2012-10-26 08:12:27 UTC
Migth be not as scary in the end. Ted's post about the matter:

    https://plus.google.com/117091380454742934025/posts/Wcc5tMiCgq7
Comment 33 Stefan Behte (RETIRED) gentoo-dev Security 2012-10-26 09:57:59 UTC
Can be unmasked agin, issue is very likely only happening on the guy's box who reported that bug...
Comment 34 Chí-Thanh Christopher Nguyễn gentoo-dev 2012-10-26 11:24:51 UTC
+1 for unmasking

It apparently only happens on unclean shutdown in combination with the nobarrier mount option. And users who use nobarrier should expect filesystem corruption after unclean shutdown anyway.
Comment 35 Richard Yao gentoo-dev 2012-10-26 13:26:07 UTC
After sleeping on it, I am going to concur with others. This is far less serious than Ted's original email suggested and it does not make sense to keep the masks. In addition, gregkh informed me that he has not started working on the next set of stable kernel tags yet, so there should be minimal risk of having him tag new kernels within days of the masks being lifted.

With both things in mind, I have lifted the masks.
Comment 36 Kerin Millar 2012-11-02 12:41:29 UTC
Created attachment 328068 [details, diff]
linux-2.6.git-ffb5387.patch

"ext4: fix unjournaled inode bitmap modification" patch from Eric Sandeen.
Comment 37 Kerin Millar 2012-11-02 13:28:49 UTC
(In reply to comment #34)
> It apparently only happens on unclean shutdown in combination with the
> nobarrier mount option.

It transpires that using nobarrier does not trigger the bug. Instead, using journal_checksum or journal_async_commit is sufficient to put one at risk. The parent of the post referenced in the URL explains further.

> And users who use nobarrier should expect filesystem
> corruption after unclean shutdown anyway.

Even those of us using a battery-backed write cache (including Nix)? Not to mention that ext3 users managed rather well up until the point at which barriers were enabled by default in 2011. The talk of "esoteric" mount options amounted to little more than a smoke screen.
Comment 38 Kerin Millar 2012-11-02 16:22:33 UTC
I'm putting the affected versions in the whiteboard. Note that the references to the as-yet unreleased 3.4.18 and 3.6.6 kernels are correct because the patch is in the stable queue for these branches. Fortunately, journal_checksum isn't a default option so few users should be affected.