summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2009-01-18md: fix bitmap-on-external-file bug.NeilBrown
commit 538452700d95480c16e7aa6b10ff77cd937d33f4 upstream. commit a2ed9615e3222645007fc19991aedf30eed3ecfd fixed a bug with 'internal' bitmaps, but in the process broke 'in a file' bitmaps. So they are broken in 2.6.28 This fixes it, and needs to go in 2.6.28-stable. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18dm raid1: fix error countJonathan Brassow
commit d460c65a6a9ec9e0d284864ec3a9a2d1b73f0e43 upstream. Always increase the error count when I/O on a leg of a mirror fails. The error count is used to decide whether to select an alternative mirror leg. If the target doesn't use the "handle_errors" feature, the error count is not updated and the bio can get requeued forever by the read callback. Fix it by increasing error_count before the handle_errors feature checking. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18dm log: fix dm_io_client leak on error pathsTakahiro Yasui
commit c7a2bd19b7c1e0bd2c7604c53d2583e91e536948 upstream. In create_log_context function, dm_io_client_destroy function needs to be called, when memory allocation of disk_header, sync_bits and recovering_bits failed, but dm_io_client_destroy is not called. Signed-off-by: Takahiro Yasui <tyasui@redhat.com> Acked-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-14md: Don't read past end of bitmap when reading bitmap.NeilBrown
commit a2ed9615e3222645007fc19991aedf30eed3ecfd upstream. When we read the write-intent-bitmap off the device, we currently read a whole number of pages. When PAGE_SIZE is 4K, this works due to the alignment we enforce on the superblock and bitmap. When PAGE_SIZE is 64K, this case read past the end-of-device which causes an error. When we write the superblock, we ensure to clip the last page to just be the required size. Copy that code into the read path to just read the required number of sectors. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-11-20dm raid1: flush workqueue before destructionMikulas Patocka
commit 18776c7316545482a02bfaa2629a2aa1afc48357 upstream. We queue work on keventd queue --- so this queue must be flushed in the destructor. Otherwise, keventd could access mirror_set after it was freed. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-11-13md: fix bug in raid10 recovery.Neil Brown
commit a53a6c85756339f82ff19e001e90cfba2d6299a8 upstream Adding a spare to a raid10 doesn't cause recovery to start. This is due to an silly type in commit 6c2fce2ef6b4821c21b5c42c7207cb9cf8c87eda and so is a bug in 2.6.27 and .28-rc. Thanks to Thomas Backlund for bisecting to find this. Cc: Thomas Backlund <tmb@mandriva.org> Cc: George Spelvin <linux@horizon.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-11-13md: linear: Fix a division by zero bug for very small arrays.Andre Noll
commit f1cd14ae52985634d0389e934eba25b5ecf24565 upstream Date: Thu, 6 Nov 2008 19:41:24 +1100 Subject: md: linear: Fix a division by zero bug for very small arrays. We currently oops with a divide error on starting a linear software raid array consisting of at least two very small (< 500K) devices. The bug is caused by the calculation of the hash table size which tries to compute sector_div(sz, base) with "base" being zero due to the small size of the component devices of the array. Fix this by requiring the hash spacing to be at least one which implies that also "base" is non-zero. This bug has existed since about 2.6.14. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-10-25dm snapshot: fix primary_pe raceMikulas Patocka
commit 7c5f78b9d7f21937e46c26db82976df4b459c95c upstream Fix a race condition with primary_pe ref_count handling. put_pending_exception runs under dm_snapshot->lock, it does atomic_dec_and_test on primary_pe->ref_count, and later does atomic_read primary_pe->ref_count. __origin_write does atomic_dec_and_test on primary_pe->ref_count without holding dm_snapshot->lock. This opens the following race condition: Assume two CPUs, CPU1 is executing put_pending_exception (and holding dm_snapshot->lock). CPU2 is executing __origin_write in parallel. primary_pe->ref_count == 2. CPU1: if (primary_pe && atomic_dec_and_test(&primary_pe->ref_count)) origin_bios = bio_list_get(&primary_pe->origin_bios); .. decrements primary_pe->ref_count to 1. Doesn't load origin_bios CPU2: if (first && atomic_dec_and_test(&primary_pe->ref_count)) { flush_bios(bio_list_get(&primary_pe->origin_bios)); free_pending_exception(primary_pe); /* If we got here, pe_queue is necessarily empty. */ return r; } .. decrements primary_pe->ref_count to 0, submits pending bios, frees primary_pe. CPU1: if (!primary_pe || primary_pe != pe) free_pending_exception(pe); .. this has no effect. if (primary_pe && !atomic_read(&primary_pe->ref_count)) free_pending_exception(primary_pe); .. sees ref_count == 0 (written by CPU 2), does double free !! This bug can happen only if someone is simultaneously writing to both the origin and the snapshot. If someone is writing only to the origin, __origin_write will submit kcopyd request after it decrements primary_pe->ref_count (so it can't happen that the finished copy races with primary_pe->ref_count decrementation). If someone is writing only to the snapshot, __origin_write isn't invoked at all and the race can't happen. The race happens when someone writes to the snapshot --- this creates pending_exception with primary_pe == NULL and starts copying. Then, someone writes to the same chunk in the snapshot, and __origin_write races with termination of already submitted request in pending_complete (that calls put_pending_exception). This race may be reason for bugs: http://bugzilla.kernel.org/show_bug.cgi?id=11636 https://bugzilla.redhat.com/show_bug.cgi?id=465825 The patch fixes the code to make sure that: 1. If atomic_dec_and_test(&primary_pe->ref_count) returns false, the process must no longer dereference primary_pe (because someone else may free it under us). 2. If atomic_dec_and_test(&primary_pe->ref_count) returns true, the process is responsible for freeing primary_pe. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-10-25dm kcopyd: avoid queue shuffleKazuo Ito
commit b673c3a8192e28f13e2050a4b82c1986be92cc15 upstream Write throughput to LVM snapshot origin volume is an order of magnitude slower than those to LV without snapshots or snapshot target volumes, especially in the case of sequential writes with O_SYNC on. The following patch originally written by Kevin Jamieson and Jan Blunck and slightly modified for the current RCs by myself tries to improve the performance by modifying the behaviour of kcopyd, so that it pushes back an I/O job to the head of the job queue instead of the tail as process_jobs() currently does when it has to wait for free pages. This way, write requests aren't shuffled to cause extra seeks. I tested the patch against 2.6.27-rc5 and got the following results. The test is a dd command writing to snapshot origin followed by fsync to the file just created/updated. A couple of filesystem benchmarks gave me similar results in case of sequential writes, while random writes didn't suffer much. dd if=/dev/zero of=<somewhere on snapshot origin> bs=4096 count=... [conv=notrunc when updating] 1) linux 2.6.27-rc5 without the patch, write to snapshot origin, average throughput (MB/s) 10M 100M 1000M create,dd 511.46 610.72 11.81 create,dd+fsync 7.10 6.77 8.13 update,dd 431.63 917.41 12.75 update,dd+fsync 7.79 7.43 8.12 compared with write throughput to LV without any snapshots, all dd+fsync and 1000 MiB writes perform very poorly. 10M 100M 1000M create,dd 555.03 608.98 123.29 create,dd+fsync 114.27 72.78 76.65 update,dd 152.34 1267.27 124.04 update,dd+fsync 130.56 77.81 77.84 2) linux 2.6.27-rc5 with the patch, write to snapshot origin, average throughput (MB/s) 10M 100M 1000M create,dd 537.06 589.44 46.21 create,dd+fsync 31.63 29.19 29.23 update,dd 487.59 897.65 37.76 update,dd+fsync 34.12 30.07 26.85 Although still not on par with plain LV performance - cannot be avoided because it's copy on write anyway - this simple patch successfully improves throughtput of dd+fsync while not affecting the rest. Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: Kazuo Ito <ito.kazuo@oss.ntt.co.jp> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-10-22md: Fix rdev_size_store with size == 0Chris Webb
commit 7d3c6f8717ee6c2bf6cba5fa0bda3b28fbda6015 upstream Fix rdev_size_store with size == 0. size == 0 means to use the largest size allowed by the underlying device and is used when modifying an active array. This fixes a regression introduced by commit d7027458d68b2f1752a28016dcf2ffd0a7e8f567 Signed-off-by: Chris Webb <chris@arachsys.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-10-01dm mpath: add missing path switching lockingChandra Seetharaman
Moving the path activation to workqueue along with scsi_dh patches introduced a race. It is due to the fact that the current_pgpath (in the multipath data structure) can be modified if changes happen in any of the paths leading to the lun. If the changes lead to current_pgpath being set to NULL, then it leads to the invalid access which results in the panic below. This patch fixes that by storing the pgpath to activate in the multipath data structure and properly protecting it. Note that if activate_path is called twice in succession with different pgpath, with the second one being called before the first one is done, then activate path will be called twice for the second pgpath, which is fine. Unable to handle kernel paging request for data at address 0x00000020 Faulting instruction address: 0xd000000000aa1844 cpu 0x1: Vector: 300 (Data Access) at [c00000006b987a80] pc: d000000000aa1844: .activate_path+0x30/0x218 [dm_multipath] lr: c000000000087a2c: .run_workqueue+0x114/0x204 sp: c00000006b987d00 msr: 8000000000009032 dar: 20 dsisr: 40000000 current = 0xc0000000676bb3f0 paca = 0xc0000000006f3680 pid = 2528, comm = kmpath_handlerd enter ? for help [c00000006b987da0] c000000000087a2c .run_workqueue+0x114/0x204 [c00000006b987e40] c000000000088b58 .worker_thread+0x120/0x144 [c00000006b987f00] c00000000008ca70 .kthread+0x78/0xc4 [c00000006b987f90] c000000000027cc8 .kernel_thread+0x4c/0x68 Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-01dm: cope with access beyond end of device in dm_merge_bvecMikulas Patocka
If for any reason dm_merge_bvec() is given an offset beyond the end of the device, avoid an oops and always allow one page to be added to an empty bio. We'll reject the I/O later after the bio is submitted. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-01dm: always allow one page in dm_merge_bvecMikulas Patocka
Some callers assume they can always add at least one page to an empty bio, so dm_merge_bvec should not return 0 in this case: we'll reject the I/O later after the bio is submitted. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-09-19md: Don't wait UNINTERRUPTIBLE for other resync to finishNeilBrown
When two md arrays share some block device (e.g each uses different partitions on the one device), a resync of one array will wait for the resync on the other to finish. This can be a long time and as it currently waits TASK_UNINTERRUPTIBLE, the softlockup code notices and complains. So use TASK_INTERRUPTIBLE instead and make sure to flush signals before calling schedule. Signed-off-by: NeilBrown <neilb@suse.de>
2008-09-01Fix problem with waiting while holding rcu read lock in md/bitmap.cNeilBrown
A recent patch to protect the rdev list with rcu locking leaves us with a problem because we can sleep on memalloc while holding the rcu lock. The rcu lock is only needed while walking the linked list as uninteresting devices (failed or spares) can be removed at any time. So only take the rcu lock while actually walking the linked list. Take a refcount on the rdev during the time when we drop the lock and do the memalloc to start IO. When we return to the locked code, all the interesting devices on the list will not have moved, so we can simply use list_for_each_continue_rcu to pick up where we left off. Signed-off-by: NeilBrown <neilb@suse.de>
2008-09-01Remove invalidate_partition call from do_md_stop.NeilBrown
When stopping an md array, or just switching to read-only, we currently call invalidate_partition while holding the mddev lock. The main reason for this is probably to ensure all dirty buffers are flushed (invalidate_partition calls fsync_bdev). However if any dirty buffers are found, it will almost certainly cause a deadlock as starting writeout will require an update to the superblock, and performing that updates requires taking the mddev lock - which is already held. This deadlock can be demonstrated by running "reboot -f -n" with a root filesystem on md/raid, and some dirty buffers in memory. All other calls to stop an array should already happen after a flush. The normal sequence is to stop using the array (e.g. umount) which will cause __blkdev_put to call sync_blockdev. Then open the array and issue the STOP_ARRAY ioctl while the buffers are all still clean. So this invalidate_partition is normally a no-op, except for one case where it will cause a deadlock. So remove it. This patch possibly addresses the regression recored in http://bugzilla.kernel.org/show_bug.cgi?id=11460 and http://bugzilla.kernel.org/show_bug.cgi?id=11452 though it isn't yet clear how it ever worked. Signed-off-by: NeilBrown <neilb@suse.de>
2008-08-07md: cancel check/repair requests when recovery is neededDan Williams
If a 'repair' is requested when an array is in a position to 'recover' raid1 will perform the repair while md believes a recovery is happening. Address this at both ends, i.e. cancel check/repair requests upon detecting a recover condition and do not call ->spare_active after completing a check/repair. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2008-08-05Allow raid10 resync to happening in larger chunks.NeilBrown
The raid10 resync/recovery code currently limits the amount of in-flight resync IO to 2Meg. This was copied from raid1 where it seems quite adequate. However for raid10, some layouts require a bit of seeking to perform a resync, and allowing a larger buffer size means that the seeking can be significantly reduced. There is probably no real need to limit the amount of in-flight IO at all. Any shortage of memory will naturally reduce the amount of buffer space available down to a set minimum, and any concurrent normal IO will quickly cause resync IO to back off. The only problem would be that normal IO has to wait for all resync IO to finish, so a very large amount of resync IO could cause unpleasant latency when normal IO starts up. So: increase RESYNC_DEPTH to allow 32Meg of buffer (if memory is available) which seems to be a good amount. Also reduce the amount of memory reserved as there is no need to keep 2Meg just for resync if memory is tight. Thanks to Keld for the suggestion. Cc: Keld Jørn Simonsen <keld@dkuug.dk> Signed-off-by: NeilBrown <neilb@suse.de>
2008-08-05Allow faulty devices to be removed from a readonly array.NeilBrown
Removing faulty devices from an array is a two stage process. First the device is moved from being a part of the active array to being similar to a spare device. Then it can be removed by a request from user space. The first step is currently not performed for read-only arrays, so the second step can never succeed. So allow readonly arrays to remove failed devices (which aren't blocked). Signed-off-by: NeilBrown <neilb@suse.de>
2008-08-05Don't let a blocked_rdev interfere with read request in raid5/6NeilBrown
When we have externally managed metadata, we need to mark a failed device as 'Blocked' and not allow any writes until that device have been marked as faulty in the metadata and the Blocked flag has been removed. However it is perfectly OK to allow read requests when there is a Blocked device, and with a readonly array, there may not be any metadata-handler watching for blocked devices. So in raid5/raid6 only allow a Blocked device to interfere with Write request or resync. Read requests go through untouched. raid1 and raid10 already differentiate between read and write properly. Signed-off-by: NeilBrown <neilb@suse.de>
2008-08-05Fail safely when trying to grow an array with a write-intent bitmap.NeilBrown
We cannot currently change the size of a write-intent bitmap. So if we change the size of an array which has such a bitmap, it tries to set bits beyond the end of the bitmap. For now, simply reject any request to change the size of an array which has a bitmap. mdadm can remove the bitmap and add a new one after the array has changed size. Signed-off-by: NeilBrown <neilb@suse.de>
2008-08-05Restore force switch of md array to readonly at reboot time.NeilBrown
A recent patch allowed do_md_stop to know whether it was being called via an ioctl or not, and thus where to allow for an extra open file descriptor when checking if it is in use. This broke then switch to readonly performed by the shutdown notifier, which needs to work even when the array is still (apparently) active (as md doesn't get told when the filesystem becomes readonly). So restore this feature by pretending that there can be lots of file descriptors open, but we still want do_md_stop to switch to readonly. Signed-off-by: NeilBrown <neilb@suse.de>
2008-08-05Make writes to md/safe_mode_delay immediately effective.NeilBrown
If we reduce the 'safe_mode_delay', it could still wait for the old delay to completely expire before doing anything about safe_mode. Thus the effect if the change is delayed. To make the effect more immediate, run the timeout function immediately if the delay was reduced. This may cause it to run slightly earlier that required, but that is the safer option. Signed-off-by: NeilBrown <neilb@suse.de>
2008-08-01Merge branch 'for-linus' of git://neil.brown.name/mdLinus Torvalds
* 'for-linus' of git://neil.brown.name/md: md: raid10: wake up frozen array md: do not count blocked devices as spares md: do not progress the resync process if the stripe was blocked md: delay notification of 'active_idle' to the recovery thread md: fix merge error md: move async_tx_issue_pending_all outside spin_lock_irq
2008-08-01Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: md: the bitmap code needs to use blk_plug_device_unlocked() block: add a blk_plug_device_unlocked() that grabs the queue lock
2008-08-01md: the bitmap code needs to use blk_plug_device_unlocked()Jens Axboe
It doesn't hold the queue lock, so it's both racey on the queue flags and thus spews a warning. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-08-01[PATCH] switch mtd and dm-table to lookup_bdev()Al Viro
No need to open-code it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-08-01md: raid10: wake up frozen arrayArthur Jones
When rescheduling a bio in raid10, we wake up the md thread, but if the array is frozen, this will have no effect. This causes the array to remain frozen for eternity. We add a wake_up to allow the array to de-freeze. This code is nearly identical to the raid1 code, which has this fix already. Signed-off-by: Arthur Jones <ajones@riverbed.com> Signed-off-by: NeilBrown <neilb@suse.de>
2008-07-28md: do not count blocked devices as sparesDan Williams
remove_and_add_spares() assumes that failed devices have been hot-removed from the array. Removal is skipped in the 'blocked' case so do not count a device in this state as 'spare'. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2008-07-28md: do not progress the resync process if the stripe was blockedDan Williams
handle_stripe will take no action on a stripe when waiting for userspace to unblock the array, so do not report completed sectors. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2008-07-26[SCSI] scsi_dh: attach to hardware handler from dm-mpathHannes Reinecke
multipath keeps a separate device table which may be more current than the built-in one. So we should make sure to always call ->attach whenever a multipath map with hardware handler is instantiated. And we should call ->detach on removal, too. [sekharan: update as per comments from agk] Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2008-07-23md: delay notification of 'active_idle' to the recovery threadDan Williams
sysfs_notify might sleep, so do not call it from md_safemode_timeout. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2008-07-23md: fix merge errorDan Williams
The original STRIPE_OP_IO removal patch had the following hunk: - for (i = conf->raid_disks; i--; ) { + for (i = conf->raid_disks; i--; ) set_bit(R5_Wantwrite, &sh->dev[i].flags); - if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending)) - sh->ops.count++; - } However it appears the hunk became broken after merging: - for (i = conf->raid_disks; i--; ) { + for (i = conf->raid_disks; i--; ) set_bit(R5_Wantwrite, &sh->dev[i].flags); set_bit(R5_LOCKED, &dev->flags); s.locked++; - if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending)) - sh->ops.count++; - } Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2008-07-23md: move async_tx_issue_pending_all outside spin_lock_irqDan Williams
Some dma drivers need to call spin_lock_bh in their device_issue_pending routines. This change avoids: WARNING: at kernel/softirq.c:136 local_bh_enable_ip+0x3a/0x85() Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2008-07-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dmLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: dm crypt: add merge dm table: remove merge_bvec sector restriction dm: linear add merge dm: introduce merge_bvec_fn dm snapshot: use per device mempools dm snapshot: fix race during exception creation dm snapshot: track snapshot reads dm mpath: fix test for reinstate_path dm mpath: return parameter error dm io: remove struct padding dm log: make dm_dirty_log init and exit static dm mpath: free path selector on invalid args
2008-07-21Merge branch 'for-linus' of git://neil.brown.name/mdLinus Torvalds
* 'for-linus' of git://neil.brown.name/md: (52 commits) md: Protect access to mddev->disks list using RCU md: only count actual openers as access which prevent a 'stop' md: linear: Make array_size sector-based and rename it to array_sectors. md: Make mddev->array_size sector-based. md: Make super_type->rdev_size_change() take sector-based sizes. md: Fix check for overlapping devices. md: Tidy up rdev_size_store a bit: md: Remove some unused macros. md: Turn rdev->sb_offset into a sector-based quantity. md: Make calc_dev_sboffset() return a sector count. md: Replace calc_dev_size() by calc_num_sectors(). md: Make update_size() take the number of sectors. md: Better control of when do_md_stop is allowed to stop the array. md: get_disk_info(): Don't convert between signed and unsigned and back. md: Simplify restart_array(). md: alloc_disk_sb(): Return proper error value. md: Simplify sb_equal(). md: Simplify uuid_equal(). md: sb_equal(): Fix misleading printk. md: Fix a typo in the comment to cmd_match(). ...
2008-07-21dm crypt: add mergeMilan Broz
This patch implements biovec merge function for crypt target. If the underlying device has merge function defined, call it. If not, keep precomputed value. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21dm table: remove merge_bvec sector restrictionMilan Broz
Remove max_sector restriction - merge function replaced it. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21dm: linear add mergeMilan Broz
This patch implements biovec merge function for linear target. If the underlying device has merge function defined, call it. If not, keep precomputed value. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21dm: introduce merge_bvec_fnMilan Broz
Introduce a bvec merge function for device mapper devices for dynamic size restrictions. This code ensures the requested biovec lies within a single target and then calls a target-specific function to check against any constraints imposed by underlying devices. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21dm snapshot: use per device mempoolsMikulas Patocka
Change snapshot per-module mempool to per-device mempool. Per-module mempools could cause a deadlock if multiple snapshot devices are stacked above each other. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21dm snapshot: fix race during exception creationMikulas Patocka
Fix a race condition that returns incorrect data when a write causes an exception to be allocated whilst a read is still in flight. The race condition happens as follows: * A read to non-reallocated sector in the snapshot is submitted so that the read is routed to the original device. * A write to the original device is submitted. The write causes an exception that reallocates the block. The write proceeds. * The original read is dequeued and reads the wrong data. This race can be triggered with CFQ scheduler and one thread writing and multiple threads reading simultaneously. (This patch relies upon the earlier dm-kcopyd-per-device.patch to avoid a deadlock.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21dm snapshot: track snapshot readsMikulas Patocka
Whenever a snapshot read gets mapped through to the origin, track it in a per-snapshot hash table indexed by chunk number, using memory allocated from a new per-snapshot mempool. We need to track these reads to avoid race conditions which will be fixed by patches that follow. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21dm mpath: fix test for reinstate_pathAlasdair G Kergon
Fix test for reinstate_path method before attempting to use it. Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Julia Lawall <julia@diku.dk>
2008-07-21dm mpath: return parameter errorMikulas Patocka
Return a specific error message if there are an invalid number of multipath arguments. This invalid command returns an "Unknown error" because the ti->error field is not set dmsetup create --table '0 2 multipath 0 0 1 1 round-robin 0 1 1 /dev/sdh' mpath0 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21dm io: remove struct paddingRichard Kennedy
Rearrange struct dm_io. Shrinks size from 40 -> 32 allowing more objects/slab. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21dm log: make dm_dirty_log init and exit staticAdrian Bunk
dm_dirty_log_{init,exit}() can now become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21dm mpath: free path selector on invalid argsMikulas Patocka
Free path selector if the arguments are invalid. This command (note that it is invalid) causes reference leak on module "dm_round_robin" and prevents the module from being removed. dmsetup create --table '0 2 multipath 0 0 1 1 round-robin /dev/sdh' mpath0 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-07-21md: Protect access to mddev->disks list using RCUNeilBrown
All modifications and most access to the mddev->disks list are made under the reconfig_mutex lock. However there are three places where the list is walked without any locking. If a reconfig happens at this time, havoc (and oops) can ensue. So use RCU to protect these accesses: - wrap them in rcu_read_{,un}lock() - use list_for_each_entry_rcu - add to the list with list_add_rcu - delete from the list with list_del_rcu - delay the 'free' with call_rcu rather than schedule_work Note that export_rdev did a list_del_init on this list. In almost all cases the entry was not in the list anymore so it was a no-op and so safe. It is no longer safe as after list_del_rcu we may not touch the list_head. An audit shows that export_rdev is called: - after unbind_rdev_from_array, in which case the delete has already been done, - after bind_rdev_to_array fails, in which case the delete isn't needed. - before the device has been put on a list at all (e.g. in add_new_disk where reading the superblock fails). - and in autorun devices after a failure when the device is on a different list. So remove the list_del_init call from export_rdev, and add it back immediately before the called to export_rdev for that last case. Note also that ->same_set is sometimes used for lists other than mddev->list (e.g. candidates). In these cases rcu is not needed. Signed-off-by: NeilBrown <neilb@suse.de>
2008-07-21md: only count actual openers as access which prevent a 'stop'NeilBrown
Open isn't the only thing that increments ->active. e.g. reading /proc/mdstat will increment it briefly. So to avoid false positives in testing for concurrent access, introduce a new counter that counts just the number of times the md device it open. Signed-off-by: NeilBrown <neilb@suse.de>