path: root/block
AgeCommit message (Collapse)Author
2008-12-29cfq-iosched: fix race between exiting queue and exiting taskJens Axboe
Original patch from Nikanth Karthikesan <> When a queue exits the queue lock is taken and cfq_exit_queue() would free all the cic's associated with the queue. But when a task exits, cfq_exit_io_context() gets cic one by one and then locks the associated queue to call __cfq_exit_single_io_context. It looks like between getting a cic from the ioc and locking the queue, the queue might have exited on another cpu. Fix this by rechecking the cfq_io_context queue key inside the queue lock again, and not calling into __cfq_exit_single_io_context() if somebody beat us to it. Signed-off-by: Jens Axboe <>
2008-12-29Get rid of CONFIG_LSFJens Axboe
We have two seperate config entries for large devices/files. One is CONFIG_LBD that guards just the devices, the other is CONFIG_LSF that handles large files. This doesn't make a lot of sense, you typically want both or none. So get rid of CONFIG_LSF and change CONFIG_LBD wording to indicate that it covers both. Acked-by: Jean Delvare <> Signed-off-by: Jens Axboe <>
2008-12-29block: make blk_softirq_init() staticRoel Kluin
Sparse asked whether these could be static. Signed-off-by: Roel Kluin <> Signed-off-by: Jens Axboe <>
2008-12-29block: use min_not_zero in blk_queue_stack_limitsFUJITA Tomonori
zero is invalid for max_phys_segments, max_hw_segments, and max_segment_size. It's better to use use min_not_zero instead of min. min() works though (because the commit 0e435ac makes sure that these values are set to the default values, non zero, if a queue is initialized properly). With this patch, blk_queue_stack_limits does the almost same thing that dm's combine_restrictions_low() does. I think that it's easy to remove dm's combine_restrictions_low. Signed-off-by: FUJITA Tomonori <> Signed-off-by: Jens Axboe <>
2008-12-29block: add one-hit cache for disk partition lookupJens Axboe
disk_map_sector_rcu() returns a partition from a sector offset, which we use for IO statistics on a per-partition basis. The lookup itself is an O(N) list lookup, where N is the number of partitions. This actually hurts performance quite a bit, even on the lower end partitions. On higher numbered partitions, it can get pretty bad. Solve this by adding a one-hit cache for partition lookup. This makes the lookup O(1) for the case where we do most IO to one partition. Even for mixed partition workloads, amortized cost is pretty close to O(1) since the natural IO batching makes the one-hit cache last for lots of IOs. Signed-off-by: Jens Axboe <>
2008-12-29cfq-iosched: remove limit of dispatch depth of max 4 times quantumJens Axboe
This basically limits the hardware queue depth to 4*quantum at any point in time, which is 16 with the default settings. As CFQ uses other means to shrink the hardware queue when necessary in the first place, there's really no need for this extra heuristic. Additionally, it ends up hurting performance in some cases. Signed-off-by: Jens Axboe <>
2008-12-29block: get rid of elevator_t typedefJens Axboe
Just use struct elevator_queue everywhere instead. Signed-off-by: Jens Axboe <>
2008-12-29block: don't use plugging on SSD devicesJens Axboe
We just want to hand the first bits of IO to the device as fast as possible. Gains a few percent on the IOPS rate. Signed-off-by: Jens Axboe <>
2008-12-29block: fix empty barrier on write-through w/ ordered tagTejun Heo
Empty barrier on write-through (or no cache) w/ ordered tag has no command to execute and without any command to execute ordered tag is never issued to the device and the ordering is never achieved. Force draining for such cases. Signed-off-by: Tejun Heo <> Signed-off-by: Jens Axboe <>
2008-12-29block: simplify empty barrier implementationTejun Heo
Empty barrier required special handling in __elv_next_request() to complete it without letting the low level driver see it. With previous changes, barrier code is now flexible enough to skip the BAR step using the same barrier sequence selection mechanism. Drop the special handling and mask off q->ordered from start_ordered(). Remove blk_empty_barrier() test which now has no user. Signed-off-by: Tejun Heo <> Signed-off-by: Jens Axboe <>
2008-12-29block: make barrier completion more robustTejun Heo
Barrier completion had the following assumptions. * start_ordered() couldn't finish the whole sequence properly. If all actions are to be skipped, q->ordseq is set correctly but the actual completion was never triggered thus hanging the barrier request. * Drain completion in elv_complete_request() assumed that there's always at least one request in the queue when drain completes. Both assumptions are true but these assumptions need to be removed to improve empty barrier implementation. This patch makes the following changes. * Make start_ordered() use blk_ordered_complete_seq() to mark skipped steps complete and notify __elv_next_request() that it should fetch the next request if the whole barrier has completed inside start_ordered(). * Make drain completion path in elv_complete_request() check whether the queue is empty. Empty queue also indicates drain completion. * While at it, convert 0/1 return from blk_do_ordered() to false/true. Signed-off-by: Tejun Heo <> Signed-off-by: Jens Axboe <>
2008-12-29block: make every barrier action optionalTejun Heo
In all barrier sequences, the barrier write itself was always assumed to be issued and thus didn't have corresponding control flag. This patch adds QUEUE_ORDERED_DO_BAR and unify action mask handling in start_ordered() such that any barrier action can be skipped. This patch doesn't introduce any visible behavior changes. Signed-off-by: Tejun Heo <> Signed-off-by: Jens Axboe <>
2008-12-29block: remove duplicate or unused barrier/discard error pathsTejun Heo
* Because barrier mode can be changed dynamically, whether barrier is supported or not can be determined only when actually issuing the barrier and there is no point in checking it earlier. Drop barrier support check in generic_make_request() and __make_request(), and update comment around the support check in blk_do_ordered(). * There is no reason to check discard support in both generic_make_request() and __make_request(). Drop the check in __make_request(). While at it, move error action block to the end of the function and add unlikely() to q existence test. * Barrier request, be it empty or not, is never passed to low level driver and thus it's meaningless to try to copy back req->sector to bio->bi_sector on error. In addition, the notion of failed sector doesn't make any sense for empty barrier to begin with. Drop the code block from __end_that_request_first(). Signed-off-by: Tejun Heo <> Signed-off-by: Jens Axboe <>
2008-12-29block: reorganize QUEUE_ORDERED_* constantsTejun Heo
Separate out ordering type (drain,) and action masks (preflush, postflush, fua) from visible ordering mode selectors (QUEUE_ORDERED_*). Ordering types are now named QUEUE_ORDERED_BY_* while action masks are named QUEUE_ORDERED_DO_*. This change is necessary to add QUEUE_ORDERED_DO_BAR and make it optional to improve empty barrier implementation. Signed-off-by: Tejun Heo <> Signed-off-by: Jens Axboe <>
2008-12-29block: use cancel_work_sync() instead of kblockd_flush_work()Cheng Renquan
After many improvements on kblockd_flush_work, it is now identical to cancel_work_sync, so a direct call to cancel_work_sync is suggested. The only difference is that cancel_work_sync is a GPL symbol, so no non-GPL modules anymore. Signed-off-by: Cheng Renquan <> Cc: Jens Axboe <> Signed-off-by: Jens Axboe <>
2008-12-29block: Supress Buffer I/O errors when SCSI REQ_QUIET flag setKeith Mannthey
Allow the scsi request REQ_QUIET flag to be propagated to the buffer file system layer. The basic ideas is to pass the flag from the scsi request to the bio (block IO) and then to the buffer layer. The buffer layer can then suppress needless printks. This patch declutters the kernel log by removed the 40-50 (per lun) buffer io error messages seen during a boot in my multipath setup . It is a good chance any real errors will be missed in the "noise" it the logs without this patch. During boot I see blocks of messages like " __ratelimit: 211 callbacks suppressed Buffer I/O error on device sdm, logical block 5242879 Buffer I/O error on device sdm, logical block 5242879 Buffer I/O error on device sdm, logical block 5242847 Buffer I/O error on device sdm, logical block 1 Buffer I/O error on device sdm, logical block 5242878 Buffer I/O error on device sdm, logical block 5242879 Buffer I/O error on device sdm, logical block 5242879 Buffer I/O error on device sdm, logical block 5242879 Buffer I/O error on device sdm, logical block 5242879 Buffer I/O error on device sdm, logical block 5242872 " in my logs. My disk environment is multipath fiber channel using the SCSI_DH_RDAC code and multipathd. This topology includes an "active" and "ghost" path for each lun. IO's to the "ghost" path will never complete and the SCSI layer, via the scsi device handler rdac code, quick returns the IOs to theses paths and sets the REQ_QUIET scsi flag to suppress the scsi layer messages. I am wanting to extend the QUIET behavior to include the buffer file system layer to deal with these errors as well. I have been running this patch for a while now on several boxes without issue. A few runs of bonnie++ show no noticeable difference in performance in my setup. Thanks for John Stultz for the quiet_error finalization. Submitted-by: Keith Mannthey <> Signed-off-by: Jens Axboe <>
2008-12-29block: don't take lock on changing ra_pagesWu Fengguang
There's no need to take queue_lock or kernel_lock when modifying bdi->ra_pages. So remove them. Also remove out of date comment for queue_max_sectors_store(). Signed-off-by: Wu Fengguang <> Signed-off-by: Jens Axboe <>
2008-12-29block/blk-tag.c: cleanup kernel-docQinghuang Feng
There is no argument named @tags in blk_init_tags, remove its' comment. Signed-off-by: Qinghuang Feng <> Signed-off-by: Jens Axboe <>
2008-12-29scsi-ioctl: use clock_t <> jiffiesMilton Miller
Convert the timeout ioctl scalling to use the clock_t functions which are much more accurate with some USER_HZ vs HZ combinations. Signed-off-by: Milton Miller <> Signed-off-by: Jens Axboe <>
2008-12-29block: leave the request timeout timer running even on an empty listJens Axboe
For sync IO, we'll often do them serialized. This means we'll be touching the queue timer for every IO, as opposed to only occasionally like we do for queued IO. Instead of deleting the timer when the last request is removed, just let continue running. If a new request comes up soon we then don't have to readd the timer again. If no new requests arrive, the timer will expire without side effect later. This improves high iops sync IO by ~1%. Signed-off-by: Jens Axboe <>
2008-12-29block: add comment in blk_rq_timed_out() about why next can not be 0Jens Axboe
Signed-off-by: Jens Axboe <>
2008-12-29block: optimizations in blk_rq_timed_out_timer()
Now the rq->deadline can't be zero if the request is in the timeout_list, so there is no need to have next_set. There is no need to access a request's deadline field if blk_rq_timed_out is called on it. Signed-off-by: Malahal Naineni <> Signed-off-by: Jens Axboe <>
2008-12-19Merge branches 'tracing/ftrace', 'tracing/ring-buffer' and 'tracing/urgent' ↵Ingo Molnar
into tracing/core Conflicts: include/linux/ftrace.h
2008-12-05Enforce a minimum SG_IO timeoutLinus Torvalds
There's no point in having too short SG_IO timeouts, since if the command does end up timing out, we'll end up through the reset sequence that is several seconds long in order to abort the command that timed out. As a result, shorter timeouts than a few seconds simply do not make sense, as the recovery would be longer than the timeout itself. Add a BLK_MIN_SG_TIMEOUT to match the existign BLK_DEFAULT_SG_TIMEOUT. Suggested-by: Alan Cox <> Acked-by: Tejun Heo <> Acked-by: Jens Axboe <> Cc: Jeff Garzik <> Signed-off-by: Linus Torvalds <>
2008-12-05Merge branches 'tracing/ftrace', 'tracing/function-graph-tracer' and ↵Ingo Molnar
'tracing/urgent' into tracing/core
2008-12-04[PATCH 1/2] kill FMODE_NDELAY_NOWChristoph Hellwig
Update FMODE_NDELAY before each ioctl call so that we can kill the magic FMODE_NDELAY_NOW. It would be even better to do this directly in setfl(), but for that we'd need to have FMODE_NDELAY for all files, not just block special files. Signed-off-by: Christoph Hellwig <> Signed-off-by: Al Viro <>
2008-12-04[PATCH] Fix block dev compat ioctl handlingAndreas Schwab
Commit 33c2dca4957bd0da3e1af7b96d0758d97e708ef6 (trim file propagation in block/compat_ioctl.c) removed the handling of some ioctls from compat_blkdev_driver_ioctl. That caused them to be rejected as unknown by the compat layer. Signed-off-by: Andreas Schwab <> Cc: Al Viro <> Signed-off-by: Al Viro <>
2008-12-03block: fix setting of max_segment_size and seg_boundary maskMilan Broz
Fix setting of max_segment_size and seg_boundary mask for stacked md/dm devices. When stacking devices (LVM over MD over SCSI) some of the request queue parameters are not set up correctly in some cases by default, namely max_segment_size and and seg_boundary mask. If you create MD device over SCSI, these attributes are zeroed. Problem become when there is over this mapping next device-mapper mapping - queue attributes are set in DM this way: request_queue max_segment_size seg_boundary_mask SCSI 65536 0xffffffff MD RAID1 0 0 LVM 65536 -1 (64bit) Unfortunately bio_add_page (resp. bio_phys_segments) calculates number of physical segments according to these parameters. During the generic_make_request() is segment cout recalculated and can increase bio->bi_phys_segments count over the allowed limit. (After bio_clone() in stack operation.) Thi is specially problem in CCISS driver, where it produce OOPS here BUG_ON(creq->nr_phys_segments > MAXSGENTRIES); (MAXSEGENTRIES is 31 by default.) Sometimes even this command is enough to cause oops: dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10 This command generates bios with 250 sectors, allocated in 32 4k-pages (last page uses only 1024 bytes). For LVM layer, it allocates bio with 31 segments (still OK for CCISS), unfortunatelly on lower layer it is recalculated to 32 segments and this violates CCISS restriction and triggers BUG_ON(). The patch tries to fix it by: * initializing attributes above in queue request constructor blk_queue_make_request() * make sure that blk_queue_stack_limits() inherits setting (DM uses its own function to set the limits because it blk_queue_stack_limits() was introduced later. It should probably switch to use generic stack limit function too.) * sets the default seg_boundary value in one place (blkdev.h) * use this mask as default in DM (instead of -1, which differs in 64bit) Bugs related to this: Signed-off-by: Milan Broz <> Reviewed-by: Alasdair G Kergon <> Cc: Neil Brown <> Cc: FUJITA Tomonori <> Cc: Tejun Heo <> Cc: Mike Miller <> Signed-off-by: Jens Axboe <>
2008-12-03block: internal dequeue shouldn't start timerTejun Heo
blkdev_dequeue_request() and elv_dequeue_request() are equivalent and both start the timeout timer. Barrier code dequeues the original barrier request but doesn't passes the request itself to lower level driver, only broken down proxy requests; however, as the original barrier code goes through the same dequeue path and timeout timer is started on it. If barrier sequence takes long enough, this timer expires but the low level driver has no idea about this request and oops follows. Timeout timer shouldn't have been started on the original barrier request as it never goes through actual IO. This patch unexports elv_dequeue_request(), which has no external user anyway, and makes it operate on elevator proper w/o adding the timer and make blkdev_dequeue_request() call elv_dequeue_request() and add timer. Internal users which don't pass the request to driver - barrier code and end_that_request_last() - are converted to use elv_dequeue_request(). Signed-off-by: Tejun Heo <> Cc: Mike Anderson <> Signed-off-by: Jens Axboe <>
2008-12-03block: set disk->node_id before it's being usedCheng Renquan
disk->node_id will be refered in allocating in disk_expand_part_tbl, so we should set it before disk->node_id is refered. Signed-off-by: Cheng Renquan <> Signed-off-by: Jens Axboe <>
2008-12-03When block layer fails to map iov, it calls bio_unmap_user to undoPetr Vandrovec
mapping. Which is good if pages were mapped - but if they were provided by someone else and just copied then bad things happen - pages are released once here, and once by caller, leading to user triggerable BUG at include/linux/mm.h:246. Signed-off-by: Petr Vandrovec <> Signed-off-by: Jens Axboe <>
2008-11-26blktrace: port to tracepoints, updateIngo Molnar
Port to the new tracepoints API: split DEFINE_TRACE() and DECLARE_TRACE() sites. Spread them out to the usage sites, as suggested by Mathieu Desnoyers. Signed-off-by: Ingo Molnar <> Acked-by: Mathieu Desnoyers <>
2008-11-26blktrace: port to tracepointsArnaldo Carvalho de Melo
This was a forward port of work done by Mathieu Desnoyers, I changed it to encode the 'what' parameter on the tracepoint name, so that one can register interest in specific events and not on classes of events to then check the 'what' parameter. Signed-off-by: Arnaldo Carvalho de Melo <> Signed-off-by: Jens Axboe <> Signed-off-by: Ingo Molnar <>
2008-11-18block: hold extra reference to bio in blk_rq_map_user_iov()Jens Axboe
If the size passed in is OK but we end up mapping too many segments, we call the unmap path directly like from IO completion. But from IO completion we have an extra reference to the bio, so this error case goes OOPS when it attempts to free and already free bio. Fix it by getting an extra reference to the bio before calling the unmap failure case. Reported-by: Petr Vandrovec <> Signed-off-by: Jens Axboe <>
2008-11-18block: fix boot failure with CONFIG_DEBUG_BLOCK_EXT_DEVT=y and nashZhang, Yanmin
We run into system boot failure with kernel 2.6.28-rc. We found it on a couple of machines, including T61 notebook, nehalem machine, and another HPC NX6325 notebook. All the machines use FedoraCore 8 or FedoraCore 9. With kernel prior to 2.6.28-rc, system boot doesn't fail. I debug it and locate the root cause. Pls. see As a matter of fact, there are 2 bugs. 1)root=/dev/sda1, system boot randomly fails. Mostly, boot for 5 times and fails once. nash has a bug. Some of its functions misuse return value 0. Sometimes, 0 means timeout and no uevent available. Sometimes, 0 means nash gets an uevent, but the uevent isn't block-related (for exmaple, usb). If by coincidence, kernel tells nash that uevents are available, but kernel also set timeout, nash might stops collecting other uevents in queue if current uevent isn't block-related. I work out a patch for nash to fix it. 2) root=LABEL=/, system always can't boot. initrd init reports switchroot fails. Here is an executation branch of nash when booting: (1) nash read /sys/block/sda/dev; Assume major is 8 (on my desktop) (2) nash query /proc/devices with the major number; It found line "8 sd"; (3) nash use 'sd' to search its own probe table to find device (DISK) type for the device and add it to its own list; (4) Later on, it probes all devices in its list to get filesystem labels; scsi register "8 sd" always. When major is 259, nash fails to find the device(DISK) type. I enables CONFIG_DEBUG_BLOCK_EXT_DEVT=y when compiling kernel, so 259 is picked up for device /dev/sda1, which causes nash to fail to find device (DISK) type. To fixing issue 2), I create a patch for nash and another patch for kernel. Below is the patch for kernel 2.6.28-rc4. It registers blkext, a new block device in proc/devices. With 2 patches on nash and 1 patch on kernel, I boot my machines for dozens of times without failure. Signed-off-by Zhang Yanmin <> Acked-by: Tejun Heo <> Signed-off-by: Jens Axboe <>
2008-11-18block: make add_partition() return pointer to hd_structTejun Heo
Make add_partition() return pointer to the new hd_struct on success and ERR_PTR() value on failure. This change will be used to fix md autodetection bug. Signed-off-by: Tejun Heo <> Cc: Neil Brown <> Signed-off-by: Jens Axboe <>
2008-11-06Block: use round_jiffies_up()Alan Stern
This patch (as1159b) changes the timeout routines in the block core to use round_jiffies_up(). There's no point in rounding the timer deadline down, since if it expires too early we will have to restart it. The patch also removes some unnecessary tests when a request is removed from the queue's timer list. Signed-off-by: Alan Stern <> Signed-off-by: Jens Axboe <>
2008-11-06blk: move blk_delete_timer call in end_that_request_lastMike Anderson
Move the calling blk_delete_timer to later in end_that_request_last to address an issue where blkdev_dequeue_request may have add a timer for the request. Signed-off-by: Mike Anderson <> Acked-by: Tejun Heo <> Signed-off-by: Jens Axboe <>
2008-11-06block: add timer on blkdev_dequeue_request() not elv_next_request()Tejun Heo
Block queue supports two usage models - one where block driver peeks at the front of queue using elv_next_request(), processes it and finishes it and the other where block driver peeks at the front of queue, dequeue the request using blkdev_dequeue_request() and finishes it. The latter is more flexible as it allows the driver to process multiple commands concurrently. These two inconsistent usage models affect the block layer implementation confusing. For some, elv_next_request() is considered the issue point while others consider blkdev_dequeue_request() the issue point. Till now the inconsistency mostly affect only accounting, so it didn't really break anything seriously; however, with block layer timeout, this inconsistency hits hard. Block layer considers elv_next_request() the issue point and adds timer but SCSI layer thinks it was just peeking and when the request can't process the command right away, it's just left there without further processing. This makes the request dangling on the timer list and, when the timer goes off, the request which the SCSI layer and below think is still on the block queue ends up in the EH queue, causing various problems - EH hang (failed count goes over busy count and EH never wakes up), WARN_ON() and oopses as low level driver trying to handle the unknown command, etc. depending on the timing. As SCSI midlayer is the only user of block layer timer at the moment, moving blk_add_timer() to elv_dequeue_request() fixes the problem; however, this two usage models definitely need to be cleaned up in the future. Signed-off-by: Tejun Heo <> Signed-off-by: Jens Axboe <>
2008-11-06block: remove unused ll_new_mergeable()FUJITA Tomonori
Signed-off-by: FUJITA Tomonori <> Signed-off-by: Jens Axboe <>
2008-10-23Merge branch 'proc' of ↵Linus Torvalds
git:// * 'proc' of git:// (35 commits) proc: remove fs/proc/proc_misc.c proc: move /proc/vmcore creation to fs/proc/vmcore.c proc: move pagecount stuff to fs/proc/page.c proc: move all /proc/kcore stuff to fs/proc/kcore.c proc: move /proc/schedstat boilerplate to kernel/sched_stats.h proc: move /proc/modules boilerplate to kernel/module.c proc: move /proc/diskstats boilerplate to block/genhd.c proc: move /proc/zoneinfo boilerplate to mm/vmstat.c proc: move /proc/vmstat boilerplate to mm/vmstat.c proc: move /proc/pagetypeinfo boilerplate to mm/vmstat.c proc: move /proc/buddyinfo boilerplate to mm/vmstat.c proc: move /proc/vmallocinfo to mm/vmalloc.c proc: move /proc/slabinfo boilerplate to mm/slub.c, mm/slab.c proc: move /proc/slab_allocators boilerplate to mm/slab.c proc: move /proc/interrupts boilerplate code to fs/proc/interrupts.c proc: move /proc/stat to fs/proc/stat.c proc: move rest of /proc/partitions code to block/genhd.c proc: move /proc/cpuinfo code to fs/proc/cpuinfo.c proc: move /proc/devices code to fs/proc/devices.c proc: move rest of /proc/locks to fs/locks.c ...
2008-10-23compat_blkdev_driver_ioctl: Remove unused variable warningLinus Torvalds
Variable 'ret' is no longer used. Don't declare it. Signed-off-by: Linus Torvalds <>
2008-10-23proc: move /proc/diskstats boilerplate to block/genhd.cAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <> Acked-by: Jens Axboe <>
2008-10-23proc: move rest of /proc/partitions code to block/genhd.cAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <> Acked-by: Jens Axboe <>
2008-10-21[PATCH] kill the rest of struct file propagation in block ioctlsAl Viro
Now we can switch blkdev_ioctl() block_device/mode Signed-off-by: Al Viro <>
2008-10-21[PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSETAl Viro
We need to do bd_claim() only if file hadn't been opened with O_EXCL and then we have no need to use file itself as owner. Signed-off-by: Al Viro <>
2008-10-21[PATCH] get rid of blkdev_locked_ioctl()Al Viro
Most of that stuff doesn't need BKL at all; expand in the (only) caller, merge the switch into one there and leave BKL only around the stuff that might actually need it. Signed-off-by: Al Viro <>
2008-10-21[PATCH] get rid of blkdev_driver_ioctl()Al Viro
convert remaining callers to __blkdev_driver_ioctl() Signed-off-by: Al Viro <>
2008-10-21[PATCH] trim file propagation in block/compat_ioctl.cAl Viro
... and remove the handling of cases when it falls back to native without changing arguments. Signed-off-by: Al Viro <>
2008-10-21[PATCH] end of methods switch: remove the old onesAl Viro
Signed-off-by: Al Viro <>