423521 – kernel 3.4.3 + e2fsprogs 1.42 + hdparm-9.39 : Raid-1 : complete data loss

Bug 423521 - kernel 3.4.3 + e2fsprogs 1.42 + hdparm-9.39 : Raid-1 : complete data loss

Summary: kernel 3.4.3 + e2fsprogs 1.42 + hdparm-9.39 : Raid-1 : complete data loss

Status:	RESOLVED NEEDINFO

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	AMD64 Linux

Importance:	Normal normal
Assignee:	Gentoo Linux bug wranglers

URL:	https://bugzilla.kernel.org/show_bug....
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-06-25 19:11 UTC by Manfred Knick
Modified:	2012-06-27 01:50 UTC (History)
CC List:	0 users

See Also:	https://bugzilla.kernel.org/show_bug.cgi?id=43791
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Manfred Knick 2012-06-25 19:11:04 UTC

--------------
Short version:
--------------

Needing more space on my (quicker) HW-Raid-10,
I wanted to transfer ~850 GiB to a (slower) SW-Raid1.

Creation succeded, Transfer succeeded,
Compare (diff) succeded,
even Reboot succeded;
but after Power Off / Boot: all data lost,
filesystem only offering an empty "lost+found".

----- cite ----- [PM]
... there is no evidence of a problem with RAID.  The filesystem has
lost its contents.  So it *looks* like an error with "rm" or "mkfs".  It
possibly isn't that simple but it doesn't look at all like a RAID problem.
NeilBrown
----- /cite -----
[@Neil: Thank you for looking at it and for your comment]

I am just continung on additional experiments,
but unfortunately they take soo loong ...

I am prepared to provide additional information as requested;
proposals how to debug this strange coincidence are welcome.

Reproducible: Couldn't Reproduce

Comment 1 Manfred Knick 2012-06-25 19:16:02 UTC

-------------
Long version:
-------------

Hardware involved:

AMD Phenom(tm) 9950 Quad-Core
8 GiB RAM
ASUS M2N-SLI Deluxe

Source: HW-Raid-10:
# lspci -s 02:00.0 -v
02:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09)
        Subsystem: Adaptec ASR-2405

Destination: SW-Raid-1:
hdparm -i /dev/sdb
.   Model=ST31500341AS, FwRev=CC1H, ...
hdparm -i /dev/sdc
.   Model=ST31500341AS, FwRev=CC1H, ...

These two are mounted upon an Adaptec 1220SA:
# lspci -s 03:00.0 -v
03:00.0 RAID bus controller: Silicon Image, Inc. Device 0242 (rev 01)                                                                     
        Subsystem: Adaptec Device 0242

Kernel:
Running on 3.2.16, having noticed that
. - the problem with the radix-tree iterators was being fixed in 3.4.2 and
. - that Neil Brown's RAID fix had arrived in [3.4, 3.3.4, or 3.2.17],
I upgraded the kernel to 3.4.3 first.

To be cautious, I deleted the old Raid-1:
.  ddrescue -f /dev/zero /dev/sdb -b 4096
.  ddrescue -f /dev/zero /dev/sdc -b 4096
.  < after Reboot: no md any more >

confirmed TLER settings:
.  smartctl -l scterc,70,70 /dev/sdb
.  smartctl -l scterc,70,70 /dev/sdc

and built it a-new:
mdadm --create --verbose --metadata=1.2 /dev/md/ST-21 --level=mirror --raid-devices=2 /dev/sdb /dev/sdc

$ equery belongs mdadm
. sys-fs/mdadm-3.1.5 (/sbin/mdadm)

By purpose, I gave md some hours to complete syncing from /dev/sdb to /dev/sdc,
before even starting partitioning:
.  parted  -a optimal  /dev/md/ST-21
.        mklabel msdos
.        mkpart primary ext2 4096 -1

and creating the filesystem:
.  mkfs.ext4  -L ST-21-P1  -E lazy_itable_init=0,lazy_journal_init=0  /dev/md/ST-21p1
( -E : in order of being sure that no pending operations were left open)

$ equery belongs mkfs.ext4
. sys-fs/e2fsprogs-1.42 (/sbin/mkfs.ext4)

Notabene:
. "E2fsprogs 1.42 (November 29, 2011)
.  This release of e2fsprogs has support for file systems > 16TB."
and:
. "E2fsprogs 1.42.4 (June 12, 2012)
.  Fixed more 64-bit block number bugs (which could end up corrupting file systems!) in e2fsck, debugfs, and libext2fs."

/etc/fstab:
.  LABEL=ST-21-P1  /Mammut/ST-21-P1 ext4   defaults,noatime 1 2

fdisk -l :
...
/dev/md127p1    1,4T     21G  1,3T    2% /Mammut/ST-21-P1
...


Because this was data I did not need permanent access to,
the Seagate drives were configured to spin down after 10' without access:

equery list hdparm:
[IP-] [  ] sys-apps/hdparm-9.39:0

/etc/config/hdparm:
...
sdb_args="-S120"
sdc_args="-S120"
...

Now I copied the respective directory tree T:
.  cp -a  /<Raid-10-mountpoint>/T  /Mammut/ST-21-P1/

and checked the result with
.  diff -R  /<Raid-10-mountpoint>/T  /Mammut/ST-21-P1/T
as successful.

I'm sorry that I have to become a little bit unprecise now:
As far as I remember,
there was a reboot first, the copy still readable,
then an automatic spin-down.
After another reboot at some stage,
the copy was not visible while in stand-by mode;
bringing up the two disks, it was visible again.

Anyway:
Completely Power Off during night -
Power On next morning:
the copied T was _gone_     !!!,
but an (empty) "lost+found" ???



What I get now is the following:

# mdadm -Evvvvs
mdadm: No md superblock detected on /dev/md/mammut:ST-21p1.
mdadm: No md superblock detected on /dev/md/mammut:ST-21.
...
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 16bd66f7:96a400f6:eb91f3c0:f5e58122
           Name : mammut:ST-21  (local to host mammut)
  Creation Time : Wed Jun 20 19:51:50 2012
     Raid Level : raid1
   Raid Devices : 2
 Avail Dev Size : 2930275120 (1397.26 GiB 1500.30 GB)
     Array Size : 2930274848 (1397.26 GiB 1500.30 GB)
  Used Dev Size : 2930274848 (1397.26 GiB 1500.30 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : a4cef825:a19980d2:285560d9:0c6da2af
    Update Time : Fri Jun 22 07:30:30 2012
       Checksum : e9b7551c - correct
         Events : 19
   Device Role : Active device 1
   Array State : AA ('A' == active, '.' == missing)
...
/dev/sdb:                                                                                                                                                                                                              
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 16bd66f7:96a400f6:eb91f3c0:f5e58122
           Name : mammut:ST-21  (local to host mammut)
  Creation Time : Wed Jun 20 19:51:50 2012
     Raid Level : raid1
   Raid Devices : 2
 Avail Dev Size : 2930275120 (1397.26 GiB 1500.30 GB)
     Array Size : 2930274848 (1397.26 GiB 1500.30 GB)
  Used Dev Size : 2930274848 (1397.26 GiB 1500.30 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : baa75e0c:424e949e:b15d863d:a5e31ef8

    Update Time : Fri Jun 22 07:30:30 2012
       Checksum : 48d361f4 - correct
         Events : 19
   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing)
...

# ls -algR /Mammut/ST-21-P1/

/Mammut/ST-21-P1/:
insgesamt 24
drwxr-xr-x 3 root  4096 21. Jun 19:06 .
drwxr-xr-x 5 root  4096 21. Mai 21:37 ..
drwx------ 2 root 16384 20. Jun 19:58 lost+found

/Mammut/ST-21-P1/lost+found:
insgesamt 20
drwx------ 2 root 16384 20. Jun 19:58 .
drwxr-xr-x 3 root  4096 21. Jun 19:06 ..

!----------------------------------!
! No  /Mammut/ST-21-P1/T  any more !
!----------------------------------!

Comment 2 Manfred Knick 2012-06-25 20:15:35 UTC

Might be of interest / perhaps related:

https://bugs.gentoo.org/show_bug.cgi?id=416353

Comment 3 Manfred Knick 2012-06-25 20:35:39 UTC

Information forwarded to

https://bugzilla.kernel.org/show_bug.cgi?id=43791

and   linux-raid@vger.kernel.org

Comment 4 Jeroen Roovers (RETIRED) gentoo-dev

2012-06-27 01:50:06 UTC

Please reopen this bug report when you have found a bug to report.