summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2013-06-19md/raid1: consider WRITE as successful only if at least one non-Faulty and ↵Alex Lyakas
non-rebuilding drive completed it. commit 3056e3aec8d8ba61a0710fb78b2d562600aa2ea7 upstream. Without that fix, the following scenario could happen: - RAID1 with drives A and B; drive B was freshly-added and is rebuilding - Drive A fails - WRITE request arrives to the array. It is failed by drive A, so r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B succeeds in writing it, so the same r1_bio is marked as R1BIO_Uptodate. - r1_bio arrives to handle_write_finished, badblocks are disabled, md_error()->error() does nothing because we don't fail the last drive of raid1 - raid_end_bio_io() calls call_bio_endio() - As a result, in call_bio_endio(): if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) clear_bit(BIO_UPTODATE, &bio->bi_flags); this code doesn't clear the BIO_UPTODATE flag, and the whole master WRITE succeeds, back to the upper layer. So we returned success to the upper layer, even though we had written the data onto the rebuilding drive only. But when we want to read the data back, we would not read from the rebuilding drive, so this data is lost. [neilb - applied identical change to raid10 as well] This bug can result in lost data, so it is suitable for any -stable kernel. Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de> [bwh: Backported to 3.2: for raid10, s/rdev/conf->mirrors[dev].rdev/] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-05-30dm bufio: avoid a possible __vmalloc deadlockMikulas Patocka
commit 502624bdad3dba45dfaacaf36b7d83e39e74b2d2 upstream. This patch uses memalloc_noio_save to avoid a possible deadlock in dm-bufio. (it could happen only with large block size, at most PAGE_SIZE << MAX_ORDER (typically 8MiB). __vmalloc doesn't fully respect gfp flags. The specified gfp flags are used for allocation of requested pages, structures vmap_area, vmap_block and vm_struct and the radix tree nodes. However, the kernel pagetables are allocated always with GFP_KERNEL. Thus the allocation of pagetables can recurse back to the I/O layer and cause a deadlock. This patch uses the function memalloc_noio_save to set per-process PF_MEMALLOC_NOIO flag and the function memalloc_noio_restore to restore it. When this flag is set, all allocations in the process are done with implied GFP_NOIO flag, thus the deadlock can't happen. This should be backported to stable kernels, but they don't have the PF_MEMALLOC_NOIO flag and memalloc_noio_save/memalloc_noio_restore functions. So, PF_MEMALLOC should be set and restored instead. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> [bwh: Backported to 3.2 as recommended] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-05-30dm snapshot: fix error return code in snapshot_ctrWei Yongjun
commit 09e8b813897a0f85bb401435d009228644c81214 upstream. Return -ENOMEM instead of success if unable to allocate pending exception mempool in snapshot_ctr. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-05-13md: bad block list should default to disabled.NeilBrown
commit 486adf72ccc0c235754923d47a2270c5dcb0c98b upstream. Maintenance of a bad-block-list currently defaults to 'enabled' and is then disabled when it cannot be supported. This is backwards and causes problem for dm-raid which didn't know to disable it. So fix the defaults, and only enabled for v1.x metadata which explicitly has bad blocks enabled. The problem with dm-raid has been present since badblock support was added in v3.1, so this patch is suitable for any -stable from 3.1 onwards. Reported-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-03-27dm thin: fix discard corruptionJoe Thornber
commit f046f89a99ccfd9408b94c653374ff3065c7edb3 upstream. Fix a bug in dm_btree_remove that could leave leaf values with incorrect reference counts. The effect of this was that removal of a shared block could result in the space maps thinking the block was no longer used. More concretely, if you have a thin device and a snapshot of it, sending a discard to a shared region of the thin could corrupt the snapshot. Thinp uses a 2-level nested btree to store it's mappings. This first level is indexed by thin device, and the second level by logical block. Often when we're removing an entry in this mapping tree we need to rebalance nodes, which can involve shadowing them, possibly creating a copy if the block is shared. If we do create a copy then children of that node need to have their reference counts incremented. In this way reference counts percolate down the tree as shared trees diverge. The rebalance functions were incrementing the children at the appropriate time, but they were always assuming the children were internal nodes. This meant the leaf values (in our case packed block/flags entries) were not being incremented. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> [bwh: Backported to 3.2: bump target version numbers from 1.0.1 to 1.0.2] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-03-20dm snapshot: add missing module aliasesMikulas Patocka
commit 23cb21092eb9dcec9d3604b68d95192b79915890 upstream. Add module aliases so that autoloading works correctly if the user tries to activate "snapshot-origin" or "snapshot-merge" targets. Reference: https://bugzilla.redhat.com/889973 Reported-by: Chao Yang <chyang@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-03-20dm: fix truncated status stringsMikulas Patocka
commit fd7c092e711ebab55b2688d3859d95dfd0301f73 upstream. Avoid returning a truncated table or status string instead of setting the DM_BUFFER_FULL_FLAG when the last target of a table fills the buffer. When processing a table or status request, the function retrieve_status calls ti->type->status. If ti->type->status returns non-zero, retrieve_status assumes that the buffer overflowed and sets DM_BUFFER_FULL_FLAG. However, targets don't return non-zero values from their status method on overflow. Most targets returns always zero. If a buffer overflow happens in a target that is not the last in the table, it gets noticed during the next iteration of the loop in retrieve_status; but if a buffer overflow happens in the last target, it goes unnoticed and erroneously truncated data is returned. In the current code, the targets behave in the following way: * dm-crypt returns -ENOMEM if there is not enough space to store the key, but it returns 0 on all other overflows. * dm-thin returns errors from the status method if a disk error happened. This is incorrect because retrieve_status doesn't check the error code, it assumes that all non-zero values mean buffer overflow. * all the other targets always return 0. This patch changes the ti->type->status function to return void (because most targets don't use the return code). Overflow is detected in retrieve_status: if the status method fills up the remaining space completely, it is assumed that buffer overflow happened. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> [bwh: Backported to 3.2: - Adjust context - dm_status_fn doesn't take a status_flags parameter - Bump the last component of each current version (verified not to match any version used in mainline) - Drop changes to dm-verity] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-03-20md: raid0: fix error return from create_stripe_zones.NeilBrown
commit 58ebb34c49fcfcaa029e4b1c1453d92583900f9a upstream. Create_stripe_zones returns an error slightly differently to raid0_run and to raid0_takeover_*. The error returned used by the second was wrong and an error would result in mddev->private being set to NULL and sooner or later a crash. So never return NULL, return ERR_PTR(err), not NULL from create_stripe_zones. This bug has been present since 2.6.35 so the fix is suitable for any kernel since then. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-03-20md: fix two bugs when attempting to resize RAID0 array.NeilBrown
commit a64685399181780998281fe07309a94b25dd24c3 upstream. You cannot resize a RAID0 array (in terms of making the devices bigger), but the code doesn't entirely stop you. So: disable setting of the available size on each device for RAID0 and Linear devices. This must not change as doing so can change the effective layout of data. Make sure that the size that raid0_size() reports is accurate, but rounding devices sizes to chunk sizes. As the device sizes cannot change now, this isn't so important, but it is best to be safe. Without this change: mdadm --grow /dev/md0 -z max mdadm --grow /dev/md0 -Z max then read to the end of the array can cause a BUG in a RAID0 array. These bugs have been present ever since it became possible to resize any device, which is a long time. So the fix is suitable for any -stable kerenl. Signed-off-by: NeilBrown <neilb@suse.de> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-03-20md: protect against crash upon fsync on ro arraySebastian Riemer
commit bbfa57c0f2243a7c31fd248d22e9861a2802cad5 upstream. If an fsync occurs on a read-only array, we need to send a completion for the IO and may not increment the active IO count. Otherwise, we hit a bug trace and can't stop the MD array anymore. By advice of Christoph Hellwig we return success upon a flush request but we return -EROFS for other writes. We detect flush requests by checking if the bio has zero sectors. This patch is suitable to any -stable kernel to which it applies. Cc: Christoph Hellwig <hch@infradead.org> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: NeilBrown <neilb@suse.de> Signed-off-by: Sebastian Riemer <sebastian.riemer@profitbricks.com> Reported-by: Ben Hutchings <ben@decadent.org.uk> Acked-by: Paul Menzel <paulepanter@users.sourceforge.net> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-01-03dm ioctl: prevent unsafe change to dm_ioctl data_sizeAlasdair G Kergon
commit e910d7ebecd1aac43125944a8641b6cb1a0dfabe upstream. Abort dm ioctl processing if userspace changes the data_size parameter after we validated it but before we finished copying the data buffer from userspace. The dm ioctl parameters are processed in the following sequence: 1. ctl_ioctl() calls copy_params(); 2. copy_params() makes a first copy of the fixed-sized portion of the userspace parameters into the local variable "tmp"; 3. copy_params() then validates tmp.data_size and allocates a new structure big enough to hold the complete data and copies the whole userspace buffer there; 4. ctl_ioctl() reads userspace data the second time and copies the whole buffer into the pointer "param"; 5. ctl_ioctl() reads param->data_size without any validation and stores it in the variable "input_param_size"; 6. "input_param_size" is further used as the authoritative size of the kernel buffer. The problem is that userspace code could change the contents of user memory between steps 2 and 4. In particular, the data_size parameter can be changed to an invalid value after the kernel has validated it. This lets userspace force the kernel to access invalid kernel memory. The fix is to ensure that the size has not changed at step 4. This patch shouldn't have a security impact because CAP_SYS_ADMIN is required to run this code, but it should be fixed anyway. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-01-03dm persistent data: rename node to btree_nodeMikulas Patocka
commit 550929faf89e2e2cdb3e9945ea87d383989274cf upstream. This patch fixes a compilation failure on sparc32 by renaming struct node. struct node is already defined in include/linux/node.h. On sparc32, it happens to be included through other dependencies and persistent-data doesn't compile because of conflicting declarations. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-12-06dm: fix deadlock with request based dm and queue request_fn recursionJens Axboe
commit a8c32a5c98943d370ea606a2e7dc04717eb92206 upstream. Request based dm attempts to re-run the request queue off the request completion path. If used with a driver that potentially does end_io from its request_fn, we could deadlock trying to recurse back into request dispatch. Fix this by punting the request queue run to kblockd. Tested to fix a quickly reproducible deadlock in such a scenario. Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-12-06md: Avoid write invalid address if read_seqretry returned true.majianpeng
commit 35f9ac2dcec8f79d7059ce174fd7b7ee3290d620 upstream. If read_seqretry returned true and bbp was changed, it will write invalid address which can cause some serious problem. This bug was introduced by commit v3.0-rc7-130-g2699b67. So fix is suitable for 3.0.y thru 3.6.y. Reported-by: zhuwenfeng@kedacom.com Tested-by: zhuwenfeng@kedacom.com Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-12-06md: Reassigned the parameters if read_seqretry returned true in func ↵majianpeng
md_is_badblock. commit ab05613a0646dcc11049692d54bae76ca9ffa910 upstream. This bug was introduced by commit(v3.0-rc7-126-g2230dfe). So fix is suitable for 3.0.y thru 3.6.y. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-10-30md/raid10: use correct limit variableDan Carpenter
commit 91502f099dfc5a1e8812898e26ee280713e1d002 upstream. Clang complains that we are assigning a variable to itself. This should be using bad_sectors like the similar earlier check does. Bug has been present since 3.1-rc1. It is minor but could conceivably cause corruption or other bad behaviour. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-10-10md/raid10: fix "enough" function for detecting if array is failed.NeilBrown
commit 80b4812407c6b1f66a4f2430e69747a13f010839 upstream. The 'enough' function is written to work with 'near' arrays only in that is implicitly assumes that the offset from one 'group' of devices to the next is the same as the number of copies. In reality it is the number of 'near' copies. So change it to make this number explicit. This bug makes it possible to run arrays without enough drives present, which is dangerous. It is appropriate for an -stable kernel, but will almost certainly need to be modified for some of them. Reported-by: Jakub Husák <jakub@gooseman.cz> Signed-off-by: NeilBrown <neilb@suse.de> [bwh: Backported to 3.2: s/geo->/conf->/] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-10-10dm table: clear add_random unless all devices have it setMilan Broz
commit c3c4555edd10dbc0b388a0125b9c50de5e79af05 upstream. Always clear QUEUE_FLAG_ADD_RANDOM if any underlying device does not have it set. Otherwise devices with predictable characteristics may contribute entropy. QUEUE_FLAG_ADD_RANDOM specifies whether or not queue IO timings contribute to the random pool. For bio-based targets this flag is always 0 because such devices have no real queue. For request-based devices this flag was always set to 1 by default. Now set it according to the flags on underlying devices. If there is at least one device which should not contribute, set the flag to zero: If a device, such as fast SSD storage, is not suitable for supplying entropy, a request-based queue stacked over it will not be either. Because the checking logic is exactly same as for the rotational flag, share the iteration function with device_is_nonrot(). Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-10-10dm: handle requests beyond end of device instead of using BUG_ONMike Snitzer
commit ba1cbad93dd47223b1f3b8edd50dd9ef2abcb2ed upstream. The access beyond the end of device BUG_ON that was introduced to dm_request_fn via commit 29e4013de7ad950280e4b2208 ("dm: implement REQ_FLUSH/FUA support for request-based dm") was an overly drastic (but simple) response to this situation. I have received a report that this BUG_ON was hit and now think it would be better to use dm_kill_unmapped_request() to fail the clone and original request with -EIO. map_request() will assign the valid target returned by dm_table_find_target to tio->ti. But when the target isn't valid tio->ti is never assigned (because map_request isn't called); so add a check for tio->ti != NULL to dm_done(). Reported-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-09-12md: Don't truncate size at 4TB for RAID0 and LinearNeilBrown
commit 667a5313ecd7308d79629c0738b0db588b0b0a4e upstream. commit 27a7b260f71439c40546b43588448faac01adb93 md: Fix handling for devices from 2TB to 4TB in 0.90 metadata. changed 0.90 metadata handling to truncated size to 4TB as that is all that 0.90 can record. However for RAID0 and Linear, 0.90 doesn't need to record the size, so this truncation is not needed and causes working arrays to become too small. So avoid the truncation for RAID0 and Linear This bug was introduced in 3.1 and is suitable for any stable kernels from then onwards. As the offending commit was tagged for 'stable', any stable kernel that it was applied to should also get this patch. That includes at least 2.6.32, 2.6.33 and 3.0. (Thanks to Ben Hutchings for providing that list). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10md/raid1: don't abort a resync on the first badblock.NeilBrown
commit b7219ccb33aa0df9949a60c68b5e9f712615e56f upstream. If a resync of a RAID1 array with 2 devices finds a known bad block one device it will neither read from, or write to, that device for this block offset. So there will be one read_target (The other device) and zero write targets. This condition causes md/raid1 to abort the resync assuming that it has finished - without known bad blocks this would be true. When there are no write targets because of the presence of bad blocks we should only skip over the area covered by the bad block. RAID10 already gets this right, raid1 doesn't. Or didn't. As this can cause a 'sync' to abort early and appear to have succeeded it could lead to some data corruption, so it suitable for -stable. Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10dm thin: fix memory leak in process_prepared_mapping error pathsJoe Thornber
commit 905386f82d08f66726912f303f3e6605248c60a3 upstream. Fix memory leak in process_prepared_mapping by always freeing the dm_thin_new_mapping structs from the mapping_pool mempool on the error paths. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10dm thin: reduce endio_hook pool sizeAlasdair G Kergon
commit 7768ed33ccdc02801c4483fc5682dc66ace14aea upstream. Reduce the slab size used for the dm_thin_endio_hook mempool. Allocation has been seen to fail on machines with smaller amounts of memory due to fragmentation. lvm: page allocation failure. order:5, mode:0xd0 device-mapper: table: 253:38: thin-pool: Error creating pool's endio_hook mempool Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25dm raid1: set discard_zeroes_data_unsupportedMikulas Patocka
commit 7c8d3a42fe1c58a7e8fd3f6a013e7d7b474ff931 upstream. We can't guarantee that REQ_DISCARD on dm-mirror zeroes the data even if the underlying disks support zero on discard. So this patch sets ti->discard_zeroes_data_unsupported. For example, if the mirror is in the process of resynchronizing, it may happen that kcopyd reads a piece of data, then discard is sent on the same area and then kcopyd writes the piece of data to another leg. Consequently, the data is not zeroed. The flag was made available by commit 983c7db347db8ce2d8453fd1d89b7a4bb6920d56 (dm crypt: always disable discard_zeroes_data). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25dm raid1: fix crash with mirror recovery and discardMikulas Patocka
commit 751f188dd5ab95b3f2b5f2f467c38aae5a2877eb upstream. This patch fixes a crash when a discard request is sent during mirror recovery. Firstly, some background. Generally, the following sequence happens during mirror synchronization: - function do_recovery is called - do_recovery calls dm_rh_recovery_prepare - dm_rh_recovery_prepare uses a semaphore to limit the number simultaneously recovered regions (by default the semaphore value is 1, so only one region at a time is recovered) - dm_rh_recovery_prepare calls __rh_recovery_prepare, __rh_recovery_prepare asks the log driver for the next region to recover. Then, it sets the region state to DM_RH_RECOVERING. If there are no pending I/Os on this region, the region is added to quiesced_regions list. If there are pending I/Os, the region is not added to any list. It is added to the quiesced_regions list later (by dm_rh_dec function) when all I/Os finish. - when the region is on quiesced_regions list, there are no I/Os in flight on this region. The region is popped from the list in dm_rh_recovery_start function. Then, a kcopyd job is started in the recover function. - when the kcopyd job finishes, recovery_complete is called. It calls dm_rh_recovery_end. dm_rh_recovery_end adds the region to recovered_regions or failed_recovered_regions list (depending on whether the copy operation was successful or not). The above mechanism assumes that if the region is in DM_RH_RECOVERING state, no new I/Os are started on this region. When I/O is started, dm_rh_inc_pending is called, which increases reg->pending count. When I/O is finished, dm_rh_dec is called. It decreases reg->pending count. If the count is zero and the region was in DM_RH_RECOVERING state, dm_rh_dec adds it to the quiesced_regions list. Consequently, if we call dm_rh_inc_pending/dm_rh_dec while the region is in DM_RH_RECOVERING state, it could be added to quiesced_regions list multiple times or it could be added to this list when kcopyd is copying data (it is assumed that the region is not on any list while kcopyd does its jobs). This results in memory corruption and crash. There already exist bypasses for REQ_FLUSH requests: REQ_FLUSH requests do not belong to any region, so they are always added to the sync list in do_writes. dm_rh_inc_pending does not increase count for REQ_FLUSH requests. In mirror_end_io, dm_rh_dec is never called for REQ_FLUSH requests. These bypasses avoid the crash possibility described above. These bypasses were improperly implemented for REQ_DISCARD when the mirror target gained discard support in commit 5fc2ffeabb9ee0fc0e71ff16b49f34f0ed3d05b4 (dm raid1: support discard). In do_writes, REQ_DISCARD requests is always added to the sync queue and immediately dispatched (even if the region is in DM_RH_RECOVERING). However, dm_rh_inc and dm_rh_dec is called for REQ_DISCARD resusts. So it violates the rule that no I/Os are started on DM_RH_RECOVERING regions, and causes the list corruption described above. This patch changes it so that REQ_DISCARD requests follow the same path as REQ_FLUSH. This avoids the crash. Reference: https://bugzilla.redhat.com/837607 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25md/raid1: close some possible races on write errors during resyncNeilBrown
commit 58e94ae18478c08229626daece2fc108a4a23261 upstream. commit 4367af556133723d0f443e14ca8170d9447317cb md/raid1: clear bad-block record when write succeeds. Added a 'reschedule_retry' call possibility at the end of end_sync_write, but didn't add matching code at the end of sync_request_write. So if the writes complete very quickly, or scheduling makes it seem that way, then we can miss rescheduling the request and the resync could hang. Also commit 73d5c38a9536142e062c35997b044e89166e063b md: avoid races when stopping resync. Fix a race condition in this same code in end_sync_write but didn't make the change in sync_request_write. This patch updates sync_request_write to fix both of those. Patch is suitable for 3.1 and later kernels. Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Original-version-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25md: avoid crash when stopping md array races with closing other open fds.NeilBrown
commit a05b7ea03d72f36edb0cec05e8893803335c61a0 upstream. md will refuse to stop an array if any other fd (or mounted fs) is using it. When any fs is unmounted of when the last open fd is closed all pending IO will be flushed (e.g. sync_blockdev call in __blkdev_put) so there will be no pending IO to worry about when the array is stopped. However in order to send the STOP_ARRAY ioctl to stop the array one must first get and open fd on the block device. If some fd is being used to write to the block device and it is closed after mdadm open the block device, but before mdadm issues the STOP_ARRAY ioctl, then there will be no last-close on the md device so __blkdev_put will not call sync_blockdev. If this happens, then IO can still be in-flight while md tears down the array and bad things can happen (use-after-free and subsequent havoc). So in the case where do_md_stop is being called from an open file descriptor, call sync_block after taking the mutex to ensure there will be no new openers. This is needed when setting a read-write device to read-only too. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25md/raid1: fix use-after-free bug in RAID1 data-check code.NeilBrown
commit 2d4f4f3384d4ef4f7c571448e803a1ce721113d5 upstream. This bug has been present ever since data-check was introduce in 2.6.16. However it would only fire if a data-check were done on a degraded array, which was only possible if the array has 3 or more devices. This is certainly possible, but is quite uncommon. Since hot-replace was added in 3.3 it can happen more often as the same condition can arise if not all possible replacements are present. The problem is that as soon as we submit the last read request, the 'r1_bio' structure could be freed at any time, so we really should stop looking at it. If the last device is being read from we will stop looking at it. However if the last device is not due to be read from, we will still check the bio pointer in the r1_bio, but the r1_bio might already be free. So use the read_targets counter to make sure we stop looking for bios to submit as soon as we have submitted them all. This fix is suitable for any -stable kernel since 2.6.16. Reported-by: Arnold Schulz <arnysch@gmx.net> Signed-off-by: NeilBrown <neilb@suse.de> [bwh: Backported to 3.2: no doubling of conf->raid_disks; we don't have hot-replace support] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25raid5: delayed stripe fixShaohua Li
commit fab363b5ff502d1b39ddcfec04271f5858d9f26e upstream. There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but the two bits have relationship. A delayed stripe can be moved to hold list only when preread active stripe count is below IO_THRESHOLD. If a stripe has both the bits set, such stripe will be in delayed list and preread count not 0, which will make such stripe never leave delayed list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-12dm persistent data: fix allocation failure in space map checker initMike Snitzer
commit b0239faaf87c38bb419c9264bf20817438ddc3a9 upstream. If CONFIG_DM_DEBUG_SPACE_MAPS is enabled and memory is fragmented and a sufficiently-large metadata device is used in a thin pool then the space map checker will fail to allocate the memory it requires. Switch from kmalloc to vmalloc to allow larger virtually contiguous allocations for the space map checker's internal count arrays. Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-12dm persistent data: handle space map checker creation failureMike Snitzer
commit 62662303e7f590fdfbb0070ab820a0ad4267c119 upstream. If CONFIG_DM_DEBUG_SPACE_MAPS is enabled and dm_sm_checker_create() fails, dm_tm_create_internal() would still return success even though it cleaned up all resources it was supposed to have created. This will lead to a kernel crash: general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC ... RIP: 0010:[<ffffffff81593659>] [<ffffffff81593659>] dm_bufio_get_block_size+0x9/0x20 Call Trace: [<ffffffff81599bae>] dm_bm_block_size+0xe/0x10 [<ffffffff8159b8b8>] sm_ll_init+0x78/0xd0 [<ffffffff8159c1a6>] sm_ll_new_disk+0x16/0xa0 [<ffffffff8159c98e>] dm_sm_disk_create+0xfe/0x160 [<ffffffff815abf6e>] dm_pool_metadata_open+0x16e/0x6a0 [<ffffffff815aa010>] pool_ctr+0x3f0/0x900 [<ffffffff8158d565>] dm_table_add_target+0x195/0x450 [<ffffffff815904c4>] table_load+0xe4/0x330 [<ffffffff815917ea>] ctl_ioctl+0x15a/0x2c0 [<ffffffff81591963>] dm_ctl_ioctl+0x13/0x20 [<ffffffff8116a4f8>] do_vfs_ioctl+0x98/0x560 [<ffffffff8116aa51>] sys_ioctl+0x91/0xa0 [<ffffffff81869f52>] system_call_fastpath+0x16/0x1b Fix the space map checker code to return an appropriate ERR_PTR and have dm_sm_disk_create() and dm_tm_create_internal() check for it with IS_ERR. Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-12dm persistent data: fix shadow_info_leak on dm_tm_destroyMike Snitzer
commit 25d7cd6faa7ae6ed2565617c3ee2500ccb8a9f7f upstream. Cleanup the shadow table before destroying the transaction manager. Reference: leak was identified with kmemleak when running test_discard_random_sectors in the thinp-test-suite. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-12md/raid10: fix failure when trying to repair a read error.NeilBrown
commit 055d3747dbf00ce85c6872ecca4d466638e80c22 upstream. commit 58c54fcca3bac5bf9290cfed31c76e4c4bfbabaf md/raid10: handle further errors during fix_read_error better. in 3.1 added "r10_sync_page_io" which takes an IO size in sectors. But we were passing the IO size in bytes!!! This resulting in bio_add_page failing, and empty request being sent down, and a consequent BUG_ON in scsi_lib. [fix missing space in error message at same time] This fix is suitable for 3.1.y and later. Reported-by: Christian Balzer <chibi@gol.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-12md/raid5: In ops_run_io, inc nr_pending before calling md_wait_for_blocked_rdevmajianpeng
commit 1850753d2e6d9ca7856581ca5d3cf09521e6a5d7 upstream. In ops_run_io(), the call to md_wait_for_blocked_rdev will decrement nr_pending so we lose the reference we hold on the rdev. So atomic_inc it first to maintain the reference. This bug was introduced by commit 73e92e51b7969ef5477d md/raid5. Don't write to known bad block on doubtful devices. which appeared in 3.0, so patch is suitable for stable kernels since then. Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-12md/raid5: Do not add data_offset before call to is_badblockmajianpeng
commit 6c0544e255dd6582a9899572e120fb55d9f672a4 upstream. In chunk_aligned_read() we are adding data_offset before calling is_badblock. But is_badblock also adds data_offset, so that is bad. So move the addition of data_offset to after the call to is_badblock. This bug was introduced by commit 31c176ecdf3563140e639 md/raid5: avoid reading from known bad blocks. which first appeared in 3.0. So that patch is suitable for any -stable kernel from 3.0.y onwards. However it will need minor revision for most of those (as the comment didn't appear until recently). Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> [bwh: Backported to 3.2: ignored missing comment] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-12md/raid10: Don't try to recovery unmatched (and unused) chunks.NeilBrown
commit fc448a18ae6219af9a73257b1fbcd009efab4a81 upstream. If a RAID10 has an odd number of chunks - as might happen when there are an odd number of devices - the last chunk has no pair and so is not mirrored. We don't store data there, but when recovering the last device in an array we retry to recover that last chunk from a non-existent location. This results in an error, and the recovery aborts. When we get to that last chunk we should just stop - there is nothing more to do anyway. This bug has been present since the introduction of RAID10, so the patch is appropriate for any -stable kernel. Reported-by: Christian Balzer <chibi@gol.com> Tested-by: Christian Balzer <chibi@gol.com> Signed-off-by: NeilBrown <neilb@suse.de> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-04dm thin: reinstate missing mempool_free in cell_release_singletonMike Snitzer
commit 03aaae7cdc71bc306888440b1f569d463e917b6d upstream. Fix a significant memory leak inadvertently introduced during simplification of cell_release_singleton() in commit 6f94a4c45a6f744383f9f695dde019998db3df55 ("dm thin: fix stacked bi_next usage"). A cell's hlist_del() must be accompanied by a mempool_free(). Use __cell_release() to do this, like before. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-31md: using GFP_NOIO to allocate bio for flush requestShaohua Li
commit b5e1b8cee7ad58a15d2fa79bcd7946acb592602d upstream. A flush request is usually issued in transaction commit code path, so using GFP_KERNEL to allocate memory for flush request bio falls into the classic deadlock issue. This is suitable for any -stable kernel to which it applies as it avoids a possible deadlock. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-20MD: Add del_timer_sync to mddev_suspend (fix nasty panic)Jonathan Brassow
commit 0d9f4f135eb6dea06bdcb7065b1e4ff78274a5e9 upstream. Use del_timer_sync to remove timer before mddev_suspend finishes. We don't want a timer going off after an mddev_suspend is called. This is especially true with device-mapper, since it can call the destructor function immediately following a suspend. This results in the removal (kfree) of the structures upon which the timer depends - resulting in a very ugly panic. Therefore, we add a del_timer_sync to mddev_suspend to prevent this. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-20dm mpath: check if scsi_dh module already loaded before trying to loadMike Snitzer
commit 510193a2d3d2e03ae53b95c0ae4f33cdff02cbf8 upstream. If the requested scsi_dh module is already loaded then skip request_module(). Multipath table loads can hang in an unnecessary __request_module. Reported-by: Ben Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-11md: fix possible corruption of array metadata on shutdown.NeilBrown
commit 30b8aa9172dfeaac6d77897c67ee9f9fc574cdbb upstream. commit c744a65c1e2d59acc54333ce8 md: don't set md arrays to readonly on shutdown. removed the possibility of a 'BUG' when data is written to an array that has just been switched to read-only, but also introduced the possibility that the array metadata could be corrupted. If, when md_notify_reboot gets the mddev lock, the array is in a state where it is assembled but hasn't been started (as can happen if the personality module is not available, or in other unusual situations), then incorrect metadata will be written out making it impossible to re-assemble the array. So only call __md_stop_writes() if the array has actually been activated. This patch is needed for any stable kernel which has had the above commit applied. Reported-by: Christoph Nelles <evilazrael@evilazrael.de> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-04-22md/bitmap: prevent bitmap_daemon_work running while initialising bitmapNeilBrown
commit afbaa90b80b1ec66e5137cc3824746bfdf559b18 upstream. If a bitmap is added while the array is active, it is possible for bitmap_daemon_work to run while the bitmap is being initialised. This is particularly a problem if bitmap_daemon_work sees bitmap->filemap as non-NULL before it has been filled in properly. So hold bitmap_info.mutex while filling in ->filemap to prevent problems. This patch is suitable for any -stable kernel, though it might not apply cleanly before about 3.1. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02dm thin: fix stacked bi_next usageJoe Thornber
commit 6f94a4c45a6f744383f9f695dde019998db3df55 upstream. Avoid using the bi_next field for the holder of a cell when deferring bios because a stacked device below might change it. Store the holder in a new field in struct cell instead. When a cell is created, the bio that triggered creation (the holder) was added to the same bio list as subsequent bios. In some cases we pass this holder bio directly to devices underneath. If those devices use the bi_next field there will be trouble... This also simplifies some code that had to work out which bio was the holder. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02dm persistent data: fix btree rebalancing after removeJoe Thornber
commit b0988900bae9ecf968a8a8d086a9eec671a9517a upstream. When we remove an entry from a node we sometimes rebalance with it's two neighbours. This wasn't being done correctly; in some cases entries have to move all the way from the right neighbour to the left neighbour, or vice versa. This patch pretty much re-writes the balancing code to fix it. This code is barely used currently; only when you delete a thin device, and then only if you have hundreds of them in the same pool. Once we have discard support, which removes mappings, this will be used much more heavily. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02dm exception store: fix init error pathAndrei Warkentin
commit aadbe266f2f89ccc68b52f4effc7b3a8b29521ef upstream. Call the correct exit function on failure in dm_exception_store_init. Signed-off-by: Andrei Warkentin <andrey.warkentin@gmail.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02dm crypt: add missing error handlingMikulas Patocka
commit 72c6e7afc43e19f68a31dea204fc366624d6eee9 upstream. Always set io->error to -EIO when an error is detected in dm-crypt. There were cases where an error code would be set only if we finish processing the last sector. If there were other encryption operations in flight, the error would be ignored and bio would be returned with success as if no error happened. This bug is present in kcryptd_crypt_write_convert, kcryptd_crypt_read_convert and kcryptd_async_done. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02dm crypt: fix mempool deadlockMikulas Patocka
commit aeb2deae2660a1773c83d3c6e9e6575daa3855d6 upstream. This patch fixes a possible deadlock in dm-crypt's mempool use. Currently, dm-crypt reserves a mempool of MIN_BIO_PAGES reserved pages. It allocates first MIN_BIO_PAGES with non-failing allocation (the allocation cannot fail and waits until the mempool is refilled). Further pages are allocated with different gfp flags that allow failing. Because allocations may be done in parallel, this code can deadlock. Example: There are two processes, each tries to allocate MIN_BIO_PAGES and the processes run simultaneously. It may end up in a situation where each process allocates (MIN_BIO_PAGES / 2) pages. The mempool is exhausted. Each process waits for more pages to be freed to the mempool, which never happens. To avoid this deadlock scenario, this patch changes the code so that only the first page is allocated with non-failing gfp mask. Allocation of further pages may fail. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02md: fix clearing of the 'changed' flags for the bad blocks list.NeilBrown
commit d0962936bff659d20522555b517582a2715fd23f upstream. In super_1_sync (the first hunk) we need to clear 'changed' before checking read_seqretry(), otherwise we might race with other code adding a bad block and so won't retry later. In md_update_sb (the second hunk), in the case where there is no metadata (neither persistent nor external), we treat any bad blocks as an error. However we need to clear the 'changed' flag before calling md_ack_all_badblocks, else it won't do anything. This patch is suitable for -stable release 3.0 and later. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02md/raid1,raid10: avoid deadlock during resync/recovery.NeilBrown
commit d6b42dcb995e6acd7cc276774e751ffc9f0ef4bf upstream. If RAID1 or RAID10 is used under LVM or some other stacking block device, it is possible to enter a deadlock during resync or recovery. This can happen if the upper level block device creates two requests to the RAID1 or RAID10. The first request gets processed, blocks recovery and queue requests for underlying requests in current->bio_list. A resync request then starts which will wait for those requests and block new IO. But then the second request to the RAID1/10 will be attempted and it cannot progress until the resync request completes, which cannot progress until the underlying device requests complete, which are on a queue behind that second request. So allow that second request to proceed even though there is a resync request about to start. This is suitable for any -stable kernel. Reported-by: Ray Morris <support@bettercgi.com> Tested-by: Ray Morris <support@bettercgi.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02md: don't set md arrays to readonly on shutdown.NeilBrown
commit c744a65c1e2d59acc54333ce80a5b0702a98010b upstream. It seems that with recent kernel, writeback can still be happening while shutdown is happening, and consequently data can be written after the md reboot notifier switches all arrays to read-only. This causes a BUG. So don't switch them to read-only - just mark them clean and set 'safemode' to '2' which mean that immediately after any write the array will be switch back to 'clean'. This could result in the shutdown happening when array is marked dirty, thus forcing a resync on reboot. However if you reboot without performing a "sync" first, you get to keep both halves. This is suitable for any stable kernel (though there might be some conflicts with obvious fixes in earlier kernels). Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>