Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 282564 - sys-kernel/xen-sources-2.6.18-r12 - DATA CORRUPTION - A bio that has two or more vector entries, size less or equal than page size and that crosses stripe boundary is accepted by device mapper but not by the underlying raid device.
Summary: sys-kernel/xen-sources-2.6.18-r12 - DATA CORRUPTION - A bio that has two or m...
Status: RESOLVED WORKSFORME
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: All Linux
: High critical (vote)
Assignee: Gentoo Xen Devs
URL: https://bugzilla.redhat.com/attachmen...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-08-24 13:11 UTC by Max Hacking
Modified: 2011-03-26 11:40 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Max Hacking 2009-08-24 13:11:47 UTC
In Linux bio architecture, it is the responsibility of the caller that
he is not creating bio too large for the appropriate block device
driver.

There are several ways how bio size can be limited.
- There is q->max_hw_sectors that is the upper limit of total number of
  sectors.
- These are q->max_phys_segments and q->max_hw_segments that limit
  number of consecutive segments (before and after iommu merging).
- There is q->max_segment_size and q->seg_boundary_mask that determine
  how much data fits in a segment and at which points there are enforced
  segment boundaries (because some hardware have limitation on entries
  in its scatter-gather table)
- There is q->hardsect_size which determines the hardware sector size,
  and so all sector numbers and lengths must be aligned on this
  boundary.
- And there is q->merge_bvec_fn --- the process that constructs the bio
  can use this function to ask the device driver if the next vector
  entry will fit into the bio.

Additionally, by definition, it is always allowed to create a bio that
spans one page or less and has just one bio vector entry.

All of the above restrictions except q->merge_bvec_fn can be merged.
I.e. if you have several devices with different limitations, and you run
device mapper on the top of them, it is possible to combine the
limitations, take the lowest of the values (except for q->hardsect_size
where we take the highest value). If can be then assumed that the bio
submitted for device mapper (which satisfies the combined limitations)
will satisfy the limitations of every underlying device.

The problem is with q->merge_bvec_fn. If some of the underlying devices
in device mapper device set its q->merge_bvec_fn, device mapper has no
way to propagate it to its own limits. So in this case, the device
mapper sets its maximum request size to one page (because bios contained
within a page are allowed). Such small bios degrade performance but at
least it works.

And here comes the bug: raid0, raid1, raid10 and raid5 set
q->merge_bvec_fn in such a way that they reject bios crossing its
stripe. They accept bios with one vector entry crossing a stripe (they
must) and they split that bio - but they don't accept any other bios
crossing a stripe.

A bio that has two or more vector entries, size less or equal than page
size and that crosses stripe boundary is accepted by device mapper (it
conforms to all its limits) but not by the underlying raid device.

The fix is: if the device mapper set one-page maximum request size, it also
needs to set its own q->merge_bvec_fn that will reject any bios with
multiple vector entries that span more pages.


Reproducible: Always

Steps to Reproduce:
1.  Create RAID 0,1,5,10 set
2.  Create LVM PV on raid set
3.  Export LVM LV to Xen VM
4.  Watch the Xen VM trash its filesystem!

Actual Results:  
Messages on dom0 shows a lot lines like this:
Jan 23 01:17:01 myhost kernel: raid10_make_request bug: can't convert block
across chunks or bigger than 64k 998425343 3
Jan 23 01:17:01 myhost kernel: raid10_make_request bug: can't convert block
across chunks or bigger than 64k 998425467 4

On the DomU it shows lines like:
<3>Buffer I/O error on device xvda1, logical block 4903
<4>lost page write due to I/O error on xvda1
and then:
<4>end_request: I/O error, dev xvda, sector 9465
<4>end_request: I/O error, dev xvda, sector 9579
...
Later:
<3>Aborting journal on device xvda1.
<4>__journal_remove_journal_head: freeing b_committed_data
<4>__journal_remove_journal_head: freeing b_committed_data
<4>__journal_remove_journal_head: freeing b_committed_data
<4>__journal_remove_journal_head: freeing b_committed_data
<2>ext3_abort called.
<2>EXT3-fs error (device xvda1): ext3_journal_start_sb: Detected aborted journal
<2>Remounting filesystem read-only

Expected Results:  
Normal operation without filesystem corruption.

The attached URL contains a patch submitted by RedHat (which I have verified as working) although the patch is also attached here for convenience.
Comment 1 Max Hacking 2009-08-24 13:13:28 UTC
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

diff -u -r -p linux-2.6.18.x86_64.p21/drivers/md/dm-table.c linux-2.6.18.x86_64/drivers/md/dm-table.c
--- linux-2.6.18.x86_64.p21/drivers/md/dm-table.c	2009-04-22 14:09:12.000000000 +0200
+++ linux-2.6.18.x86_64/drivers/md/dm-table.c	2009-05-06 13:31:17.000000000 +0200
@@ -880,6 +880,25 @@ struct dm_target *dm_table_find_target(s
 	return &t->targets[(KEYS_PER_NODE * n) + k];
 }
 
+/*
+ * Some of our unrerlying devices provided a merge_bvec_fn.
+ *
+ * We can't call the device's merge_bvec_fn, so we must be conservative
+ * and not allow creating bio larger than one page.
+ */
+static int dm_max_one_bvec_entry(request_queue_t *q, struct bio *bio, struct bio_vec *biovec)
+{
+	/* If there was nothing in the bio, allow full page */
+	if (!bio->bi_vcnt)
+		return biovec->bv_len;
+
+	/* If there is just one page and we are appeding to it, allow it */
+	if (bio->bi_vcnt == 1 && biovec == &bio->bi_io_vec[0])
+		return biovec->bv_len;
+
+	return 0;
+}
+
 void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q)
 {
 	/*
@@ -887,6 +906,8 @@ void dm_table_set_restrictions(struct dm
 	 * restrictions.
 	 */
 	blk_queue_max_sectors(q, t->limits.max_sectors);
+	if (t->limits.max_sectors <= PAGE_SIZE >> 9)
+		blk_queue_merge_bvec(q, dm_max_one_bvec_entry);
 	q->max_phys_segments = t->limits.max_phys_segments;
 	q->max_hw_segments = t->limits.max_hw_segments;
 	q->hardsect_size = t->limits.hardsect_size;
Comment 2 Max Hacking 2010-02-10 17:18:33 UTC
This is still an issue on 2.6.18-xen-r12 which has since been marked stable!

This issue is resolved in 2.6.29-xen-r4 however which is still ~x86 ~amd64.
Comment 3 DEMAINE Benoît-Pierre, aka DoubleHP 2010-02-27 23:49:57 UTC
Isn't 2.6.31-r10 stable now ?
Comment 4 Max Hacking 2010-02-28 01:24:26 UTC
(In reply to comment #3)
> Isn't 2.6.31-r10 stable now ?
> 

Both 2.6.31-r10 and 2.6.31-r11 are still ~x86 and ~amd64 as far as I can see, unless they were updated in the last few hours.  
Comment 5 DEMAINE Benoît-Pierre, aka DoubleHP 2010-02-28 01:30:47 UTC
Then make this bug depend on #307129 and then you can close it :)
Comment 6 DEMAINE Benoît-Pierre, aka DoubleHP 2010-02-28 01:33:19 UTC
My mistake. sys-kernel/xen-sources-2.6.18-r12 is ~ any way; so, before bahing this bug, you must use ~ . And, when using ~, you got 2.6.31 anyway. => bug is deprecated !

uranus ~ # eix xen-source
[I] sys-kernel/xen-sources
     Available versions:
        (2.6.18-r11)    (~)2.6.18-r11!b!s
        (2.6.18-r12)    (~)2.6.18-r12!b!s
        (2.6.29-r4)     (~)2.6.29-r4!b!s
        (2.6.31-r10)    {M}(~)2.6.31-r10!b!s
        (2.6.31-r11)    [M]~2.6.31-r11!b!s

Comment 7 Max Hacking 2010-02-28 09:23:10 UTC
(In reply to comment #6)
> My mistake. sys-kernel/xen-sources-2.6.18-r12 is ~ any way; so, before bahing
> this bug, you must use ~ . And, when using ~, you got 2.6.31 anyway. => bug is
> deprecated !
> 
> uranus ~ # eix xen-source
> [I] sys-kernel/xen-sources
>      Available versions:
>         (2.6.18-r11)    (~)2.6.18-r11!b!s
>         (2.6.18-r12)    (~)2.6.18-r12!b!s
>         (2.6.29-r4)     (~)2.6.29-r4!b!s
>         (2.6.31-r10)    {M}(~)2.6.31-r10!b!s
>         (2.6.31-r11)    [M]~2.6.31-r11!b!s
> 

Actually I think you will find that 2.6.18-r12 is stable on x86...

# cat sys-kernel/xen-sources/xen-sources-2.6.18-r12.ebuild | grep KEYWORDS

KEYWORDS="~amd64 x86"
Comment 8 Alexey Shvetsov archtester gentoo-dev 2011-03-26 11:40:57 UTC
Xen 4.1 in tree. Please test with it and reopen if it doesnt work