summaryrefslogtreecommitdiff
path: root/drivers/md/raid1.c
AgeCommit message (Collapse)Author
2013-08-04md/raid1: fix bio handling problems in process_checks()NeilBrown
commit 30bc9b53878a9921b02e3b5bc4283ac1c6de102a upstream. Recent change to use bio_copy_data() in raid1 when repairing an array is faulty. The underlying may have changed the bio in various ways using bio_advance and these need to be undone not just for the 'sbio' which is being copied to, but also the 'pbio' (primary) which is being copied from. So perform the reset on all bios that were read from and do it early. This also ensure that the sbio->bi_io_vec[j].bv_len passed to memcmp is correct. This fixes a crash during a 'check' of a RAID1 array. The crash was introduced in 3.10 so this is suitable for 3.10-stable. Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-13Merge tag 'md-3.10-fixes' of git://neil.brown.name/mdLinus Torvalds
Pull md bugfixes from Neil Brown: "A few bugfixes for md Some tagged for -stable" * tag 'md-3.10-fixes' of git://neil.brown.name/md: md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place md/raid1,raid10: use freeze_array in place of raise_barrier in various places. md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it. md: md_stop_writes() should always freeze recovery.
2013-06-13md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in placeH. Peter Anvin
There are cases where the kernel will believe that the WRITE SAME command is supported by a block device which does not, in fact, support WRITE SAME. This currently happens for SATA drivers behind a SAS controller, but there are probably a hundred other ways that can happen, including drive firmware bugs. After receiving an error for WRITE SAME the block layer will retry the request as a plain write of zeroes, but mdraid will consider the failure as fatal and consider the drive failed. This has the effect that all the mirrors containing a specific set of data are each offlined in very rapid succession resulting in data loss. However, just bouncing the request back up to the block layer isn't ideal either, because the whole initial request-retry sequence should be inside the write bitmap fence, which probably means that md needs to do its own conversion of WRITE SAME to write zero. Until the failure scenario has been sorted out, disable WRITE SAME for raid1, raid5, and raid10. [neilb: added raid5] This patch is appropriate for any -stable since 3.7 when write_same support was added. Cc: stable@vger.kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-13md/raid1,raid10: use freeze_array in place of raise_barrier in various places.NeilBrown
Various places in raid1 and raid10 are calling raise_barrier when they really should call freeze_array. The former is only intended to be called from "make_request". The later has extra checks for 'nr_queued' and makes a call to flush_pending_writes(), so it is safe to call it from within the management thread. Using raise_barrier will sometimes deadlock. Using freeze_array should not. As 'freeze_array' currently expects one request to be pending (in handle_read_error - the only previous caller), we need to pass it the number of pending requests (extra) to ignore. The deadlock was made particularly noticeable by commits 050b66152f87c7 (raid10) and 6b740b8d79252f13 (raid1) which appeared in 3.4, so the fix is appropriate for any -stable kernel since then. This patch probably won't apply directly to some early kernels and will need to be applied by hand. Cc: stable@vger.kernel.org Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-13md/raid1: consider WRITE as successful only if at least one non-Faulty and ↵Alex Lyakas
non-rebuilding drive completed it. Without that fix, the following scenario could happen: - RAID1 with drives A and B; drive B was freshly-added and is rebuilding - Drive A fails - WRITE request arrives to the array. It is failed by drive A, so r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B succeeds in writing it, so the same r1_bio is marked as R1BIO_Uptodate. - r1_bio arrives to handle_write_finished, badblocks are disabled, md_error()->error() does nothing because we don't fail the last drive of raid1 - raid_end_bio_io() calls call_bio_endio() - As a result, in call_bio_endio(): if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) clear_bit(BIO_UPTODATE, &bio->bi_flags); this code doesn't clear the BIO_UPTODATE flag, and the whole master WRITE succeeds, back to the upper layer. So we returned success to the upper layer, even though we had written the data onto the rebuilding drive only. But when we want to read the data back, we would not read from the rebuilding drive, so this data is lost. [neilb - applied identical change to raid10 as well] This bug can result in lost data, so it is suitable for any -stable kernel. Cc: stable@vger.kernel.org Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-05-08Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block core updates from Jens Axboe: - Major bit is Kents prep work for immutable bio vecs. - Stable candidate fix for a scheduling-while-atomic in the queue bypass operation. - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging discard bios. - Tejuns changes to convert the writeback thread pool to the generic workqueue mechanism. - Runtime PM framework, SCSI patches exists on top of these in James' tree. - A few random fixes. * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits) relay: move remove_buf_file inside relay_close_buf partitions/efi.c: replace useless kzalloc's by kmalloc's fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read() block: fix max discard sectors limit blkcg: fix "scheduling while atomic" in blk_queue_bypass_start Documentation: cfq-iosched: update documentation help for cfq tunables writeback: expose the bdi_wq workqueue writeback: replace custom worker pool implementation with unbound workqueue writeback: remove unused bdi_pending_list aoe: Fix unitialized var usage bio-integrity: Add explicit field for owner of bip_buf block: Add an explicit bio flag for bios that own their bvec block: Add bio_alloc_pages() block: Convert some code to bio_for_each_segment_all() block: Add bio_for_each_segment_all() bounce: Refactor __blk_queue_bounce to not use bi_io_vec raid1: use bio_copy_data() pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage pktcdvd: use bio_copy_data() block: Add bio_copy_data() ...
2013-04-30MD: ignore discard request for hard disks of hybid raid1/raid10 arrayShaohua Li
In SSD/hard disk hybid storage, discard request should be ignored for hard disk. We used to be doing this way, but the unplug path forgets it. This is suitable for stable tree since v3.6. Cc: stable@vger.kernel.org Reported-and-tested-by: Markus <M4rkusXXL@web.de> Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-30md: raid1/raid10 md devices leak memory when stoppingHirokazu Takahashi
Hi. Raid1 and raid10 devices leak memory every time they stop. This is a patch for linux-3.9.0-rc7 to fix this problem. Thanks, Hirokazu Takahashi. Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp> Signed-off-by: NeilBrown <neilb@suse.de>
2013-03-23block: Add bio_alloc_pages()Kent Overstreet
More utility code to replace stuff that's getting open coded. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de>
2013-03-23block: Convert some code to bio_for_each_segment_all()Kent Overstreet
More prep work for immutable bvecs: A few places in the code were either open coding or using the wrong version - fix. After we introduce the bvec iter, it'll no longer be possible to modify the biovec through bio_for_each_segment_all() - it doesn't increment a pointer to the current bvec, you pass in a struct bio_vec (not a pointer) which is updated with what the current biovec would be (taking into account bi_bvec_done and bi_size). So because of that it's more worthwhile to be consistent about bio_for_each_segment()/bio_for_each_segment_all() usage. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Alexander Viro <viro@zeniv.linux.org.uk>
2013-03-23block: Add bio_for_each_segment_all()Kent Overstreet
__bio_for_each_segment() iterates bvecs from the specified index instead of bio->bv_idx. Currently, the only usage is to walk all the bvecs after the bio has been advanced by specifying 0 index. For immutable bvecs, we need to split these apart; bio_for_each_segment() is going to have a different implementation. This will also help document the intent of code that's using it - bio_for_each_segment_all() is only legal to use for code that owns the bio. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Neil Brown <neilb@suse.de> CC: Boaz Harrosh <bharrosh@panasas.com>
2013-03-23raid1: use bio_copy_data()Kent Overstreet
This doesn't really delete any code _yet_, but once immutable bvecs are done we can just delete the rest of the code in that loop. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de>
2013-03-23raid1: Refactor narrow_write_error() to not use bi_idxKent Overstreet
More bi_idx removal. This code was just open coding bio_clone(). This could probably be further improved by using bio_advance() instead of skipping over null pages, but that'd be a larger rework. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de>
2013-03-23raid1: use bio_reset()Kent Overstreet
Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de>
2013-03-23block: Add submit_bio_wait(), remove from mdKent Overstreet
Random cleanup - this code was duplicated and it's not really specific to md. Also added the ability to return the actual error code. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de> Acked-by: Tejun Heo <tj@kernel.org>
2013-03-23block: Use bio_sectors() more consistentlyKent Overstreet
Bunch of places in the code weren't using it where they could be - this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: "Ed L. Cashin" <ecashin@coraid.com> CC: Nick Piggin <npiggin@kernel.dk> CC: Jiri Kosina <jkosina@suse.cz> CC: Jim Paris <jim@jtan.com> CC: Geoff Levand <geoff@infradead.org> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Steven Rostedt <rostedt@goodmis.org> Acked-by: Ed Cashin <ecashin@coraid.com>
2013-03-23block: Add bio_end_sector()Kent Overstreet
Just a little convenience macro - main reason to add it now is preparing for immutable bio vecs, it'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Lars Ellenberg <drbd-dev@lists.linbit.com> CC: Jiri Kosina <jkosina@suse.cz> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Martin Schwidefsky <schwidefsky@de.ibm.com> CC: Heiko Carstens <heiko.carstens@de.ibm.com> CC: linux-s390@vger.kernel.org CC: Chris Mason <chris.mason@fusionio.com> CC: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2013-02-26md/raid1,raid10: fix deadlock with freeze_array()NeilBrown
When raid1/raid10 needs to fix a read error, it first drains all pending requests by calling freeze_array(). This calls flush_pending_writes() if it needs to sleep, but some writes may be pending in a per-process plug rather than in the per-array request queue. When raid1{,0}_unplug() moves the request from the per-process plug to the per-array request queue (from which flush_pending_writes() can flush them), it needs to wake up freeze_array(), or freeze_array() will never flush them and so it will block forever. So add the requires wake_up() calls. This bug was introduced by commit f54a9d0e59c4bea3db733921ca9147612a6f292c for raid1 and a similar commit for RAID10, and so has been present since linux-3.6. As the bug causes a deadlock I believe this fix is suitable for -stable. Cc: stable@vger.kernel.org (3.6.y 3.7.y 3.8.y) Reported-by: Tregaron Bayly <tbayly@bluehost.com> Tested-by: Tregaron Bayly <tbayly@bluehost.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26md: raid1,10: Handle REQ_WRITE_SAME flag in write biosJoe Lawrence
Set mddev queue's max_write_same_sectors to its chunk_sector value (before disk_stack_limits merges the underlying disk limits.) With that in place, be sure to handle writes coming down from the block layer that have the REQ_WRITE_SAME flag set. That flag needs to be copied into any newly cloned write bio. Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-12-17Merge branch 'for-3.8/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver update from Jens Axboe: "Now that the core bits are in, here are the driver bits for 3.8. The branch contains: - A huge pile of drbd bits that were dumped from the 3.7 merge window. Following that, it was both made perfectly clear that there is going to be no more over-the-wall pulls and how the situation on individual pulls can be improved. - A few cleanups from Akinobu Mita for drbd and cciss. - Queue improvement for loop from Lukas. This grew into adding a generic interface for waiting/checking an even with a specific lock, allowing this to be pulled out of md and now loop and drbd is also using it. - A few fixes for xen back/front block driver from Roger Pau Monne. - Partition improvements from Stephen Warren, allowing partiion UUID to be used as an identifier." * 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits) drbd: update Kconfig to match current dependencies drbd: Fix drbdsetup wait-connect, wait-sync etc... commands drbd: close race between drbd_set_role and drbd_connect drbd: respect no-md-barriers setting also when changed online via disk-options drbd: Remove obsolete check drbd: fixup after wait_even_lock_irq() addition to generic code loop: Limit the number of requests in the bio list wait: add wait_event_lock_irq() interface xen-blkfront: free allocated page xen-blkback: move free persistent grants code block: partition: msdos: provide UUIDs for partitions init: reduce PARTUUID min length to 1 from 36 block: store partition_meta_info.uuid as a string cciss: use check_signature() cciss: cleanup bitops usage drbd: use copy_highpage drbd: if the replication link breaks during handshake, keep retrying drbd: check return of kmalloc in receive_uuids drbd: Broadcast sync progress no more often than once per second drbd: don't try to clear bits once the disk has failed ...
2012-11-30wait: add wait_event_lock_irq() interfaceLukas Czerner
New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit moves the private wait_event_lock_irq() macro from MD to regular wait includes, introduces new macro wait_event_lock_irq_cmd() instead of using the old method with omitting cmd parameter which is ugly and makes a use of new macros in the MD. It also introduces the _interruptible_ variant. The use of new interface is when one have a special lock to protect data structures used in the condition, or one also needs to invoke "cmd" before putting it to sleep. All new macros are expected to be called with the lock taken. The lock is released before sleep and is reacquired afterwards. We will leave the macro with the lock held. Note to DM: IMO this should also fix theoretical race on waitqueue while using simultaneously wait_event_lock_irq() and wait_event() because of lack of locking around current state setting and wait queue removal. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Neil Brown <neilb@suse.de> Cc: David Howells <dhowells@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-27md/raid1{,0}: fix deadlock in bitmap_unplug.NeilBrown
If the raid1 or raid10 unplug function gets called from a make_request function (which is very possible) when there are bios on the current->bio_list list, then it will not be able to successfully call bitmap_unplug() and it could need to submit more bios and wait for them to complete. But they won't complete while current->bio_list is non-empty. So detect that case and handle the unplugging off to another thread just like we already do when called from within the scheduler. RAID1 version of bug was introduced in 3.6, so that part of fix is suitable for 3.6.y. RAID10 part won't apply. Cc: stable@vger.kernel.org Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com> Reported-by: Peter Maloney <peter.maloney@brockmann-consult.de> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-31md/raid1: Fix assembling of arrays containing Replacements.NeilBrown
setup_conf in raid1.c uses conf->raid_disks before assigning a value. It is used when including 'Replacement' devices. The consequence is that assembling an array which contains a replacement will misbehave and either not include the replacement, or not include the device being replaced. Though this doesn't lead directly to data corruption, it could lead to reduced data safety. So use mddev->raid_disks, which is initialised, instead. Bug was introduced by commit c19d57980b38a5bb613a898937a1cf85f422fb9b md/raid1: recognise replacements when assembling arrays. in 3.3, so fix is suitable for 3.3.y thru 3.6.y. Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid racesJianpeng Ma
Now that multiple threads can handle stripes, it is safer to use an atomic64_t for resync_mismatches, to avoid update races. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11md/raid1: Don't release reference to device while handling read error.NeilBrown
When we get a read error, we arrange for raid1d to handle it. Currently we release the reference on the device. This can result in conf->mirrors[read_disk].rdev being NULL in fix_read_error, if the device happens to get removed before the read error is handled. So instead keep the reference until the read error has been fully handled. Reported-by: hank <pyu@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11MD: change the parameter of md threadShaohua Li
Change the thread parameter, so the thread can carry extra info. Next patch will use it. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11md: raid 1 supports TRIMShaohua Li
This makes md raid 1 support TRIM. If one disk supports discard and another not, or one has discard_zero_data and another not, there could be inconsistent between data from such disks. But this should not matter, discarded data is useless. This will add extra copy in rebuild though. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-08-02Merge tag 'md-3.6' of git://neil.brown.name/mdLinus Torvalds
Pull additional md update from NeilBrown: "This contains a few patches that depend on plugging changes in the block layer so needed to wait for those. It also contains a Kconfig fix for the new RAID10 support in dm-raid." * tag 'md-3.6' of git://neil.brown.name/md: md/dm-raid: DM_RAID should select MD_RAID10 md/raid1: submit IO from originating thread instead of md thread. raid5: raid5d handle stripe in batch way raid5: make_request use batch stripe release
2012-08-02md/raid1: submit IO from originating thread instead of md thread.NeilBrown
queuing writes to the md thread means that all requests go through the one processor which may not be able to keep up with very high request rates. So use the plugging infrastructure to submit all requests on unplug. If a 'schedule' is needed, we fall back on the old approach of handing the requests to the thread for it to handle. Signed-off-by: NeilBrown <neilb@suse.de>
2012-08-01Merge branch 'for-3.6/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver changes from Jens Axboe: - Making the plugging support for drivers a bit more sane from Neil. This supersedes the plugging change from Shaohua as well. - The usual round of drbd updates. - Using a tail add instead of a head add in the request completion for ndb, making us find the most completed request more quickly. - A few floppy changes, getting rid of a duplicated flag and also running the floppy init async (since it takes forever in boot terms) from Andi. * 'for-3.6/drivers' of git://git.kernel.dk/linux-block: floppy: remove duplicated flag FD_RAW_NEED_DISK blk: pass from_schedule to non-request unplug functions. block: stack unplug blk: centralize non-request unplug handling. md: remove plug_cnt feature of plugging. block/nbd: micro-optimization in nbd request completion drbd: announce FLUSH/FUA capability to upper layers drbd: fix max_bio_size to be unsigned drbd: flush drbd work queue before invalidate/invalidate remote drbd: fix potential access after free drbd: call local-io-error handler early drbd: do not reset rs_pending_cnt too early drbd: reset congestion information before reporting it in /proc/drbd drbd: report congestion if we are waiting for some userland callback drbd: differentiate between normal and forced detach drbd: cleanup, remove two unused global flags floppy: Run floppy initialization asynchronous
2012-07-31md: remove plug_cnt feature of plugging.NeilBrown
This seemed like a good idea at the time, but after further thought I cannot see it making a difference other than very occasionally and testing to try to exercise the case it is most likely to help did not show any performance difference by removing it. So remove the counting of active plugs and allow 'pending writes' to be activated at any time, not just when no plugs are active. This is only relevant when there is a write-intent bitmap, and the updating of the bitmap will likely introduce enough delay that the single-threading of bitmap updates will be enough to collect large numbers of updates together. Removing this will make it easier to centralise the unplug code, and will clear the other for other unplug enhancements which have a measurable effect. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31md/RAID1: Add missing case for attempting to repair known bad blocks.Alexander Lyakas
When doing resync or repair, attempt to correct bad blocks, according to WriteErrorSeen policy Signed-off-by: Alex Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-31md/raid1: don't abort a resync on the first badblock.NeilBrown
If a resync of a RAID1 array with 2 devices finds a known bad block one device it will neither read from, or write to, that device for this block offset. So there will be one read_target (The other device) and zero write targets. This condition causes md/raid1 to abort the resync assuming that it has finished - without known bad blocks this would be true. When there are no write targets because of the presence of bad blocks we should only skip over the area covered by the bad block. RAID10 already gets this right, raid1 doesn't. Or didn't. As this can cause a 'sync' to abort early and appear to have succeeded it could lead to some data corruption, so it suitable for -stable. Cc: stable@vger.kernel.org Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-31md/raid1: prevent merging too large requestShaohua Li
For SSD, if request size exceeds specific value (optimal io size), request size isn't important for bandwidth. In such condition, if making request size bigger will cause some disks idle, the total throughput will actually drop. A good example is doing a readahead in a two-disk raid1 setup. So when should we split big requests? We absolutly don't want to split big request to very small requests. Even in SSD, big request transfer is more efficient. This patch only considers request with size above optimal io size. If all disks are busy, is it worth doing a split? Say optimal io size is 16k, two requests 32k and two disks. We can let each disk run one 32k request, or split the requests to 4 16k requests and each disk runs two. It's hard to say which case is better, depending on hardware. So only consider case where there are idle disks. For readahead, split is always better in this case. And in my test, below patch can improve > 30% thoughput. Hmm, not 100%, because disk isn't 100% busy. Such case can happen not just in readahead, for example, in directio. But I suppose directio usually will have bigger IO depth and make all disks busy, so I ignored it. Note: if the raid uses any hard disk, we don't prevent merging. That will make performace worse. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-31md/raid1: read balance chooses idlest disk for SSDShaohua Li
SSD hasn't spindle, distance between requests means nothing. And the original distance based algorithm sometimes can cause severe performance issue for SSD raid. Considering two thread groups, one accesses file A, the other access file B. The first group will access one disk and the second will access the other disk, because requests are near from one group and far between groups. In this case, read balance might keep one disk very busy but the other relative idle. For SSD, we should try best to distribute requests to as many disks as possible. There isn't spindle move penality anyway. With below patch, I can see more than 50% throughput improvement sometimes depending on workloads. The only exception is small requests can be merged to a big request which typically can drive higher throughput for SSD too. Such small requests are sequential reads. Unlike hard disk, sequential read which can't be merged (for example direct IO, or read without readahead) can be ignored for SSD. Again there is no spindle move penality. readahead dispatches small requests and such requests can be merged. Last patch can help detect sequential read well, at least if concurrent read number isn't greater than raid disk number. In that case, distance based algorithm doesn't work well too. V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for random IO too. This makes the algorithm generic for raid with SSD. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-31md/raid1: make sequential read detection per disk basedShaohua Li
Currently the sequential read detection is global wide. It's natural to make it per disk based, which can improve the detection for concurrent multiple sequential reads. And next patch will make SSD read balance not use distance based algorithm, where this change help detect truly sequential read for SSD. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-31MD: Move macros from raid1*.h to raid1*.cJonathan Brassow
MD RAID1/RAID10: Move some macros from .h file to .c file There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined in both raid1.h and raid10.h. They are only used in there respective .c files. However, if we wish to make RAID10 accessible to the device-mapper RAID target (dm-raid.c), then we need to move these macros into the .c files where they are used so that they do not conflict with each other. The macros from the two files are identical and could be moved into md.h, but I chose to leave the duplication and have them remain in the personality files. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-31MD RAID1: rename mirror_info structureJonathan Brassow
MD RAID1: Rename the structure 'mirror_info' to 'raid1_info' The same structure name ('mirror_info') is used by raid10. Each of these structures are defined in there respective header files. If dm-raid is to support both RAID1 and RAID10, the header files will be included and the structure names must not collide. While only one of these structure names needs to change, this patch adds consistency to the naming of the structure. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-19md/raid1: close some possible races on write errors during resyncNeilBrown
commit 4367af556133723d0f443e14ca8170d9447317cb md/raid1: clear bad-block record when write succeeds. Added a 'reschedule_retry' call possibility at the end of end_sync_write, but didn't add matching code at the end of sync_request_write. So if the writes complete very quickly, or scheduling makes it seem that way, then we can miss rescheduling the request and the resync could hang. Also commit 73d5c38a9536142e062c35997b044e89166e063b md: avoid races when stopping resync. Fix a race condition in this same code in end_sync_write but didn't make the change in sync_request_write. This patch updates sync_request_write to fix both of those. Patch is suitable for 3.1 and later kernels. Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Original-version-by: Alexander Lyakas <alex.bolshoy@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-09md/raid1: fix use-after-free bug in RAID1 data-check code.NeilBrown
This bug has been present ever since data-check was introduce in 2.6.16. However it would only fire if a data-check were done on a degraded array, which was only possible if the array has 3 or more devices. This is certainly possible, but is quite uncommon. Since hot-replace was added in 3.3 it can happen more often as the same condition can arise if not all possible replacements are present. The problem is that as soon as we submit the last read request, the 'r1_bio' structure could be freed at any time, so we really should stop looking at it. If the last device is being read from we will stop looking at it. However if the last device is not due to be read from, we will still check the bio pointer in the r1_bio, but the r1_bio might already be free. So use the read_targets counter to make sure we stop looking for bios to submit as soon as we have submitted them all. This fix is suitable for any -stable kernel since 2.6.16. Cc: stable@vger.kernel.org Reported-by: Arnold Schulz <arnysch@gmx.net> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03md: fix up plugging (again).NeilBrown
The value returned by "mddev_check_plug" is only valid until the next 'schedule' as that will unplug things. This could happen at any call to mempool_alloc. So just calling mddev_check_plug at the start doesn't really make sense. So call it just before, or just after, queuing things for the thread. As the action that happens at unplug is to wake the thread, this makes lots of sense. If we cannot add a plug (which requires a small GFP_ATOMIC alloc) we wake thread immediately. RAID5 is a bit different. Requests are queued for the thread and the thread is woken by release_stripe. So we don't need to wake the thread on failure. However the thread doesn't perform certain actions when there is any active plug, so it is important to install a plug before waking the thread. So for RAID5 we install the plug *before* queuing the request and waking the thread. Without this patch it is possible for raid1 or raid10 to queue a request without then waking the thread, resulting in the array locking up. Also change raid10 to only flush_pending_write when there are not active plugs, just like raid1. This patch is suitable for 3.0 or later. I plan to submit it to -stable, but I'll like to let it spend a few weeks in mainline first to be sure it is completely safe. Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03md/raid1: fix bug in read_balance introduced by hot-replaceNeilBrown
When we added hot_replace we doubled the number of devices that could be in a RAID1 array. So we doubled how far read_balance would search. Unfortunately we didn't double the point at which it looped back to the beginning - so it effectively loops over all non-replacement disks twice. This doesn't cause bad behaviour, but it pointless and means we never read from replacement devices. Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03md: make 'name' arg to md_register_thread non-optional.NeilBrown
Having the 'name' arg optional and defaulting to the current personality name is no necessary and leads to errors, as when changing the level of an array we can end up using the name of the old level instead of the new one. So make it non-optional and always explicitly pass the name of the level that the array will be. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-31md: raid1/raid10: fix problem with merge_bvec_fnNeilBrown
The new merge_bvec_fn which calls the corresponding function in subsidiary devices requires that mddev->merge_check_needed be set if any child has a merge_bvec_fn. However were were only setting that when a device was hot-added, not when a device was present from the start. This bug was introduced in 3.4 so patch is suitable for 3.4.y kernels. However that are conflicts in raid10.c so a separate patch will be needed for 3.4.y. Cc: stable@vger.kernel.org Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22MD RAID1: Further conditionalize 'fullsync'Jonathan Brassow
A RAID1 device does not necessarily need a fullsync if the bitmap can be used instead. Similar to commit d6b212f4b19da5301e6b6eca562e5c7a2a6e8c8d in raid5.c, if a raid1 device can be brought back (i.e. from a transient failure) it shouldn't need a complete resync. Provided the bitmap is not to old, it will have recorded the areas of the disk that need recovery. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22md: allow array to be resized while bitmap is present.NeilBrown
Now that bitmaps can be resized, we can allow an array to be resized while the bitmap is present. This only covers resizing that involves changing the effective size of member devices, not resizing that changes the number of devices. Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22md/raid1: allow fix_read_error to read from recovering device.majianpeng
When attempting to fix a read error, it is acceptable to read from a device that is recovering, provided the recovery has got past the place we are reading from. This makes the test for "can we read from here" the same as the test in read_balance. Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21md: add possibility to change data-offset for devices.NeilBrown
When reshaping we can avoid costly intermediate backup by changing the 'start' address of the array on the device (if there is enough room). So as a first step, allow such a change to be requested through sysfs, and recorded in v1.x metadata. (As we didn't previous check that all 'pad' fields were zero, we need a new FEATURE flag for this. A (belatedly) check that all remaining 'pad' fields are zero to avoid a repeat of this) The new data offset must be requested separately for each device. This allows each to have a different change in the data offset. This is not likely to be used often but as data_offset can be set per-device, new_data_offset should be too. This patch also removes the 'acknowledged' arg to rdev_set_badblocks as it is never used and never will be. At the same time we add a new arg ('in_new') which is currently always zero but will be used more soon. When a reshape finishes we will need to update the data_offset and rdev->sectors. So provide an exported function to do that. Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-12md/raid1,raid10: Fix calculation of 'vcnt' when processing error recovery.majianpeng
If r1bio->sectors % 8 != 0,then the memcmp and a later memcpy will omit the last bio_vec. This is suitable for any stable kernel since 3.1 when bad-block management was introduced. Cc: stable@vger.kernel.org Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03md/raid1,raid10: don't compare excess byte during consistency check.NeilBrown
When comparing two pages read from different legs of a mirror, only compare the bytes that were read, not the whole page. In most cases we read a whole page, but in some cases with bad blocks or odd sizes devices we might read fewer than that. This bug has been present "forever" but at worst it might cause a report of two many mismatches and generate a little bit extra resync IO, so there is no need to back-port to -stable kernels. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>