summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2013-02-14hlist: drop the node parameter from iteratorsSasha Levin
I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14Fix build error due to bio_endio_batchMinchan Kim
This patch fixes build error of recent mmotm. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Kent Overstreet <koverstreet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14block-aio-batch-completion-for-bios-kiocbs-fix-fix-fix-fixKent Overstreet
Fix broken build when CONFIG_BLOCK=n by pulling common stuff out into a new header. Signed-off-by: Kent Overstreet <koverstreet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14block-aio-batch-completion-for-bios-kiocbs-fix-fix-fixAndrew Morton
fix warnings Cc: Kent Overstreet <koverstreet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14block-aio-batch-completion-for-bios-kiocbs-fix-fixAndrew Morton
fs/aio.c needs bio.h, move bio_endio_batch() declaration somewhere rational Cc: Kent Overstreet <koverstreet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14block, aio: batch completion for bios/kiocbsKent Overstreet
When completing a kiocb, there's some fixed overhead from touching the kioctx's ring buffer the kiocb belongs to. Some newer high end block devices can complete multiple IOs per interrupt, much like many network interfaces have been for some time. This plumbs through infrastructure so we can take advantage of multiple completions at the interrupt level, and complete multiple kiocbs at the same time. Drivers have to be converted to take advantage of this, but it's a simple change and the next patches will convert a few drivers. To use it, an interrupt handler (or any code that completes bios or requests) declares and initializes a struct batch_complete: struct batch_complete batch; batch_complete_init(&batch); Then, instead of calling bio_endio(), it calls bio_endio_batch(bio, err, &batch). This just adds the bio to a list in the batch_complete. At the end, it calls batch_complete(&batch); This completes all the bios all at once, building up a list of kiocbs; then the list of kiocbs are completed all at once. Also, in order to batch up the kiocbs we have to add a different bio_endio function to struct bio, that takes a pointer to the batch_complete - this patch converts the dio code's bio_endio function. In order to avoid changing every bio_endio function in the kernel (there are many), we currently use a union and a flag to indicate what kind of bio endio function to call. This is admittedly a hack, but should suffice for now. For batching to work through say md or dm devices, the md/dm bio_endio functions would have to be converted, much like the dio code. That is left for future patches. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14aio: kill ki_retryKent Overstreet
Thanks to Zach Brown's work to rip out the retry infrastructure, we don't need this anymore - ki_retry was only called right after the kiocb was initialized. This also refactors and trims some duplicated code, as well as cleaning up the refcounting/error handling a bit. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14aio: kill ki_keyKent Overstreet
ki_key wasn't actually used for anything previously - it was always 0. Drop it to trim struct kiocb a bit. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14aio-dont-include-aioh-in-schedh-fix-3-fix-fixAndrew Morton
fix build Reported-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Kent Overstreet <koverstreet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14aio-dont-include-aioh-in-schedh-fix-3-fixAndrew Morton
fix build Reported-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Kent Overstreet <koverstreet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14aio: don't include aio.h in sched.hKent Overstreet
Faster kernel compiles by way of fewer unnecessary includes. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14generic-dynamic-per-cpu-refcounting-doc-fixAndrew Morton
a little tidy Cc: Kent Overstreet <koverstreet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14generic-dynamic-per-cpu-refcounting-docKent Overstreet
percpu-refcount: Documentation Signed-off-by: Kent Overstreet <koverstreet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14percpu-refcount: sparse fixesKent Overstreet
Here's some more fixes, the percpu refcount code is now sparse clean for me. It's kind of ugly, but I'm not sure it's really any uglier than it was before. Seem reasonable? Signed-off-by: Kent Overstreet <koverstreet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14generic dynamic per cpu refcountingKent Overstreet
This implements a refcount with similar semantics to atomic_get()/atomic_dec_and_test(), that starts out as just an atomic_t but dynamically switches to per cpu refcounting when the rate of gets/puts becomes too high. It also implements two stage shutdown, as we need it to tear down the percpu counts. Before dropping the initial refcount, you must call percpu_ref_kill(); this puts the refcount in "shutting down mode" and switches back to a single atomic refcount with the appropriate barriers (synchronize_rcu()). It's also legal to call percpu_ref_kill() multiple times - it only returns true once, so callers don't have to reimplement shutdown synchronization. For the sake of simplicity/efficiency, the heuristic is pretty simple - it just switches to percpu refcounting if there are more than x gets in one second (completely arbitrarily, 4096). It'd be more correct to count the number of cache misses or something else more profile driven, but doing so would require accessing the shared ref twice per get - by just counting the number of gets(), we can stick that counter in the high bits of the refcount and increment both with a single atomic64_add(). But I expect this'll be good enough in practice. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14aio: kill batch allocationKent Overstreet
Previously, allocating a kiocb required touching quite a few global (well, per kioctx) cachelines... so batching up allocation to amortize those was worthwhile. But we've gotten rid of some of those, and in another couple of patches kiocb allocation won't require writing to any shared cachelines, so that means we can just rip this code out. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14aio-use-cancellation-list-lazily-fixKent Overstreet
The cancellation changes were fubar - we can't cancel a kiocb if it doesn't actually have a cancellation callback. The use of xchg() in aio_complete() was right - there we're marking the kiocb as completed - but we need to use cmpxchg() in kiocb_cancel() - a lock isn't sufficient since we're synchronizing with aio_complete() which isn't taking any locks. Signed-off-by: Kent Overstreet <koverstreet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14aio: use cancellation list lazilyKent Overstreet
Cancelling kiocbs requires adding them to a per kioctx linked list, which is one of the few things we need to take the kioctx lock for in the fast path. But most kiocbs can't be cancelled - so if we just do this lazily, we can avoid quite a bit of locking overhead. While we're at it, instead of using a flag bit switch to using ki_cancel itself to indicate that a kiocb has been cancelled/completed. This lets us get rid of ki_flags entirely. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14wait-add-wait_event_hrtimeout-fixAndrew Morton
fix description of `timeout' arg Cc: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14wait: add wait_event_hrtimeout()Kent Overstreet
Analagous to wait_event_timeout() and friends, this adds wait_event_hrtimeout() and wait_event_interruptible_hrtimeout(). Note that unlike the versions that use regular timers, these don't return the amount of time remaining when they return - instead, they return 0 or -ETIME if they timed out. because I was uncomfortable with the semantics of doing it the other way (that I could get it right, anyways). If the timer expires, there's no real guarantee that expire_time - current_time would be <= 0 - due to timer slack certainly, and I'm not sure I want to know the implications of the different clock bases in hrtimers. If the timer does expire and the code calculates that the time remaining is nonnegative, that could be even worse if the calling code then reuses that timeout. Probably safer to just return 0 then, but I could imagine weird bugs or at least unintended behaviour arising from that too. I came to the conclusion that if other users end up actually needing the amount of time remaining, the sanest thing to do would be to create a version that uses absolute timeouts instead of relative. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14aio: make aio_put_req() locklessKent Overstreet
Freeing a kiocb needed to touch the kioctx for three things: * Pull it off the reqs_active list * Decrementing reqs_active * Issuing a wakeup, if the kioctx was in the process of being freed. This patch moves these to aio_complete(), for a couple reasons: * aio_complete() already has to issue the wakeup, so if we drop the kioctx refcount before aio_complete does its wakeup we don't have to do it twice. * aio_complete currently has to take the kioctx lock, so it makes sense for it to pull the kiocb off the reqs_active list too. * A later patch is going to change reqs_active to include unreaped completions - this will mean allocating a kiocb doesn't have to look at the ringbuffer. So taking the decrement of reqs_active out of kiocb_free() is useful prep work for that patch. This doesn't really affect cancellation, since existing (usb) code that implements a cancel function still calls aio_complete() - we just have to make sure that aio_complete does the necessary teardown for cancelled kiocbs. It does affect code paths where we free kiocbs that were never submitted; they need to decrement reqs_active and pull the kiocb off the reqs_active list. This occurs in two places: kiocb_batch_free(), which is going away in a later patch, and the error path in io_submit_one. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14aio: move private stuff out of aio.hKent Overstreet
Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14aio: kill return value of aio_complete()Kent Overstreet
Nothing used the return value, and it probably wasn't possible to use it safely for the locked versions (aio_complete(), aio_put_req()). Just kill it. Signed-off-by: Kent Overstreet <koverstreet@google.com> Acked-by: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14aio: remove retry-based AIOZach Brown
This removes the retry-based AIO infrastructure now that nothing in tree is using it. We want to remove retry-based AIO because it is fundemantally unsafe. It retries IO submission from a kernel thread that has only assumed the mm of the submitting task. All other task_struct references in the IO submission path will see the kernel thread, not the submitting task. This design flaw means that nothing of any meaningful complexity can use retry-based AIO. This removes all the code and data associated with the retry machinery. The most significant benefit of this is the removal of the locking around the unused run list in the submission path. This has only been compiled. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14aio: remove dead code from aio.hZach Brown
Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14include/linux/eventfd.h: fix incorrect filename is a commentMartin Sustrik
Comment in eventfd.h referred to 'include/asm-generic/fcntl.h' while the correct path is 'include/uapi/asm-generic/fcntl.h'. Signed-off-by: Martin Sustrik <sustrik@250bpm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14ipmi: remove superfluous kernel/userspace explanationRobert P. J. Day
Given the obvious distinction between kernel and userspace supported by uapi/, it seems unnecessary to comment on that. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14idr: implement idr_preload[_end]() and idr_alloc()Tejun Heo
The current idr interface is very cumbersome. * For all allocations, two function calls - idr_pre_get() and idr_get_new*() - should be made. * idr_pre_get() doesn't guarantee that the following idr_get_new*() will not fail from memory shortage. If idr_get_new*() returns -EAGAIN, the caller is expected to retry pre_get and allocation. * idr_get_new*() can't enforce upper limit. Upper limit can only be enforced by allocating and then freeing if above limit. * idr_layer buffer is unnecessarily per-idr. Each idr ends up keeping around MAX_IDR_FREE idr_layers. The memory consumed per idr is under two pages but it makes it difficult to make idr_layer larger. This patch implements the following new set of allocation functions. * idr_preload[_end]() - Similar to radix preload but doesn't fail. The first idr_alloc() inside preload section can be treated as if it were called with @gfp_mask used for idr_preload(). * idr_alloc() - Allocate an ID w/ lower and upper limits. Takes @gfp_flags and can be used w/o preloading. When used inside preloaded section, the allocation mask of preloading can be assumed. If idr_alloc() can be called from a context which allows sufficiently relaxed @gfp_mask, it can be used by itself. If, for example, idr_alloc() is called inside spinlock protected region, preloading can be used like the following. idr_preload(GFP_KERNEL); spin_lock(lock); id = idr_alloc(idr, ptr, start, end, GFP_NOWAIT); spin_unlock(lock); idr_preload_end(); if (id < 0) error; which is much simpler and less error-prone than idr_pre_get and idr_get_new*() loop. The new interface uses per-pcu idr_layer buffer and thus the number of idr's in the system doesn't affect the amount of memory used for preloading. idr_layer_alloc() is introduced to handle idr_layer allocations for both old and new ID allocation paths. This is a bit hairy now but the new interface is expected to replace the old and the internal implementation eventually will become simpler. v2: Improve idr_preload() comment a bit so that it's clear that preemption is disabled while preloaded. Make idr_layer_alloc() try kmem_cache first and then fall back to per-cpu buffer; otherwise, non-preloading idr_alloc()s empty per-cpu buffers makeing preloaders fill them, which, while not incorrect, still is a bit unfair. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14idr: remove _idr_rc_to_errno() hackTejun Heo
idr uses -1, IDR_NEED_TO_GROW and IDR_NOMORE_SPACE to communicate exception conditions internally. The return value is later translated to errno values using _idr_rc_to_errno(). This is confusing. Drop the custom ones and consistently use -EAGAIN for "tree needs to grow", -ENOMEM for "need more memory" and -ENOSPC for "ran out of ID space". Due to the weird memory preloading mechanism, [ra]_get_new*() return -EAGAIN on memory shortage, so we need to substitute -ENOMEM w/ -EAGAIN on those interface functions. They'll eventually be cleaned up and the translations will go away. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14idr: relocate idr_for_each_entry() and reorganize id[r|a]_get_new()Tejun Heo
* Move idr_for_each_entry() definition next to other idr related definitions. * Make id[r|a]_get_new() inline wrappers of id[r|a]_get_new_above(). This changes the implementation of idr_get_new() but the new implementation is trivial. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14idr: cosmetic updates to struct / initializer definitionsTejun Heo
* Tab align fields like a normal person. * Drop the unnecessary 0 inits from IDR_INIT(). This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14idr: deprecate idr_remove_all()Tejun Heo
There was only one legitimate use of idr_remove_all() and a lot more of incorrect uses (or lack of it). Now that idr_destroy() implies idr_remove_all() and all the in-kernel users updated not to use it, there's no reason to keep it around. Mark it deprecated so that we can later unexport it. idr_remove_all() is made an inline function calling __idr_remove_all() to avoid triggering deprecated warning on EXPORT_SYMBOL(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14fs/exec.c: make bprm_mm_init() staticYuanhan Liu
There is only one user of bprm_mm_init, and it's inside the same file. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14coredump: remove redundant defines for dumpable statesKees Cook
The existing SUID_DUMP_* defines duplicate the newer SUID_DUMPABLE_* defines introduced in 54b501992dd ("coredump: warn about unsafe suid_dumpable / core_pattern combo"). Remove the new ones, and use the prior values instead. Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Chen Gang <gang.chen@asianux.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alan Cox <alan@linux.intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: James Morris <james.l.morris@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14signalfd: add ability to return siginfo in a raw formatAndrey Vagin
signalfd should be called with the flag SFD_RAW for that. signalfd_siginfo is not full for siginfo with a negative si_code. copy_siginfo_to_user() is copied a full siginfo to user-space, if si_code is negative. signalfd_copyinfo() doesn't do that and can't be expanded, because it has not compatible format with siginfo_t. Another problem is that a constant __SI_* is removed from si_code. It's not a problem for usual applications, because they expect a defined type of siginfo (internal logic). When we want to dump pending signals, we can't predict a type of siginfo, so we should get it from kernel. The main idea of the raw format is that it should be enough for restoring exactly the same siginfo for the current process. This functionality is required for checkpointing pending signals. Signed-off-by: Andrey Vagin <avagin@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14fat: mark fs as dirty on mount and clean on umountOleksij Rempel
There is no documented methods to mark FAT as dirty. Unofficially MS started to use reserved Byte in boot sector for this purpose, at least since Win 2000. With Win 7 user is warned if fs is dirty and asked to clean it. Different versions of Win, handle it in different ways, but always have same meaning: - Win 2000 and XP, set it on write operations and remove it after operation was finnished - Win 7, set dirty flag on first write and remove it on umount. We will do it as follows: - set dirty flag on mount. If fs was initially dirty, warn user, remember it and do not do any changes to boot sector. - clean it on umount. If fs was initially dirty, leave it dirty. - do not do any thing if fs mounted read-only. - TODO: leave fs dirty if we found some error after mount. Signed-off-by: Oleksij Rempel <bug-track@fisher-privat.net> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14fat: add extended fileds to struct fat_boot_sectorOleksij Rempel
Later we will need "state" field to check if volume was cleanly unmounted. Signed-off-by: Oleksij Rempel <bug-track@fisher-privat.net> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14hfsplus: add osx.* prefix for handling namespace of Mac OS X extended attributesVyacheslav Dubeyko
hfsplus: reworked support of extended attributes. Current mainline implementation of hfsplus file system driver treats as extended attributes only two fields (fdType and fdCreator) of user_info field in file description record (struct hfsplus_cat_file). It is possible to get or set only these two fields as extended attributes. But HFS+ treats as com.apple.FinderInfo extended attribute an union of user_info and finder_info fields as for file (struct hfsplus_cat_file) as for folder (struct hfsplus_cat_folder). Moreover, current mainline implementation of hfsplus file system driver doesn't support special metadata file - attributes tree. Mac OS X 10.4 and later support extended attributes by making use of the HFS+ filesystem Attributes file B*-tree feature which allows for named forks. Mac OS X supports only inline extended attributes, limiting their size to 3802 bytes. Any regular file may have a list of extended attributes. HFS+ supports an arbitrary number of named forks. Each attribute is denoted by a name and the associated data. The name is a null-terminated Unicode string. It is possible to list, to get, to set, and to remove extended attributes from files or directories. It exists some peculiarity during getting of extended attributes list by means of getfattr utility. The getfattr utility expects prefix "user." before any extended attribute's name. So, it ignores any names that don't contained such prefix. Such behavior of getfattr utility results in unexpected empty output of extended attributes list even in the case when file (or folder) contains extended attributes. It needs to use empty string as regular expression pattern for names matching (getfattr --match=""). For support of extended attributes in HFS+: 1. It was added necessary on-disk layout declarations related to Attributes tree into hfsplus_raw.h file. 2. It was added attributes.c file with implementation of functionality of manipulation by records in Attributes tree. 3. It was reworked hfsplus_listxattr, hfsplus_getxattr, hfsplus_setxattr functions in ioctl.c. Moreover, it was added hfsplus_removexattr method. This patch: Add osx.* prefix for handling namespace of Mac OS X extended attributes. [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Reported-by: Hin-Tak Leung <htl10@users.sourceforge.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14drivers/rtc/rtc-pxa.c: fix set time sync time issueLeo Song
Fix set time and sync time issue, add some delay when set pxa rtc timer according to spec Signed-off-by: Leo Song <liangs@marvell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14epoll: support for disabling items, and a self-test appPaton J. Lewis
It is not currently possible to reliably delete epoll items when using the same epoll set from multiple threads. After calling epoll_ctl with EPOLL_CTL_DEL, another thread might still be executing code related to an event for that epoll item (in response to epoll_wait). Therefore the deleting thread does not know when it is safe to delete resources pertaining to the associated epoll item because another thread might be using those resources. The deleting thread could wait an arbitrary amount of time after calling epoll_ctl with EPOLL_CTL_DEL and before deleting the item, but this is inefficient and could result in the destruction of resources before another thread is done handling an event returned by epoll_wait. This patch enhances epoll_ctl to support EPOLL_CTL_DISABLE, which disables an epoll item. If epoll_ctl returns -EBUSY in this case, then another thread may handling a return from epoll_wait for this item. Otherwise if epoll_ctl returns 0, then it is safe to delete the epoll item. This allows multiple threads to use a mutex to determine when it is safe to delete an epoll item and its associated resources, which allows epoll items to be deleted both efficiently and without error in a multi-threaded environment. Note that EPOLL_CTL_DISABLE is only useful in conjunction with EPOLLONESHOT, and using EPOLL_CTL_DISABLE on an epoll item without EPOLLONESHOT returns -EINVAL. This patch also adds a new test_epoll self-test program to both demonstrate the need for this feature and test it. Signed-off-by: Paton J. Lewis <palewis@adobe.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jason Baron <jbaron@redhat.com> Cc: Paul Holland <pholland@adobe.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14backlight-add-new-lp8788-backlight-driver-checkpatch-fixesAndrew Morton
WARNING: please write a paragraph that describes the config symbol fully #45: FILE: drivers/video/backlight/Kconfig:380: +config BACKLIGHT_LP8788 ERROR: code indent should use tabs where possible #446: FILE: include/linux/mfd/lp8788.h:242: +^I^I Only valid when bl_mode is LP8788_BL_COMB_PWM_BASED$ total: 1 errors, 1 warnings, 403 lines checked NOTE: whitespace errors detected, you may wish to use scripts/cleanpatch or scripts/cleanfile ./patches/backlight-add-new-lp8788-backlight-driver.patch has style problems, please review. If any of these errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: "Kim, Milo" <Milo.Kim@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14backlight: add new lp8788 backlight driverKim, Milo
TI LP8788 PMU supports regulators, battery charger, RTC, ADC, backlight dri= ver and current sinks. This patch enables LP8788 backlight module. (Brightness mode) The brightness is controlled by PWM input or I2C register. All modes are supported in the driver. (Platform data) Configurable data can be defined in the platform side. name : backlight driver name. (default: "lcd-backlight") initial_brightness : initial value of backlight brightness bl_mode : brightness control by PWM or lp8788 register dim_mode : dimming mode selection full_scale : full scale current setting rise_time : brightness ramp up step time fall_time : brightness ramp down step time pwm_pol : PWM polarity setting when bl_mode is PWM based period_ns : platform specific PWM period value. unit is nano. The default values are set in case no platform data is defined. Signed-off-by: Milo(Woogyom) Kim <milo.kim@ti.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Samuel Ortiz <sameo@linux.intel.com> Cc: Thierry Reding <thierry.reding@avionic-design.de> Cc: "devendra.aaru" <devendra.aaru@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14include/linux/fs.h: disable preempt when acquire i_size_seqcount write lockFan Du
Two rt tasks bind to one CPU core. The higher priority rt task A preempts a lower priority rt task B which has already taken the write seq lock, and then the higher priority rt task A try to acquire read seq lock, it's doomed to lockup. rt task A with lower priority: call write i_size_write rt task B with higher priority: call sync, and preempt task A write_seqcount_begin(&inode->i_size_seqcount); i_size_read inode->i_size = i_size; read_seqcount_begin <-- lockup here... So disable preempt when acquiring every i_size_seqcount *write* lock will cure the problem. Signed-off-by: Fan Du <fan.du@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14smp: make smp_call_function_many() use logic similar to ↵Shaohua Li
smp_call_function_single() I'm testing swapout workload in a two-socket Xeon machine. The workload has 10 threads, each thread sequentially accesses separate memory region. TLB flush overhead is very big in the workload. For each page, page reclaim need move it from active lru list and then unmap it. Both need a TLB flush. And this is a multthread workload, TLB flush happens in 10 CPUs. In X86, TLB flush uses generic smp_call)function. So this workload stress smp_call_function_many heavily. Without patch, perf shows: + 24.49% [k] generic_smp_call_function_interrupt - 21.72% [k] _raw_spin_lock - _raw_spin_lock + 79.80% __page_check_address + 6.42% generic_smp_call_function_interrupt + 3.31% get_swap_page + 2.37% free_pcppages_bulk + 1.75% handle_pte_fault + 1.54% put_super + 1.41% grab_super_passive + 1.36% __swap_duplicate + 0.68% blk_flush_plug_list + 0.62% swap_info_get + 6.55% [k] flush_tlb_func + 6.46% [k] smp_call_function_many + 5.09% [k] call_function_interrupt + 4.75% [k] default_send_IPI_mask_sequence_phys + 2.18% [k] find_next_bit swapout throughput is around 1300M/s. With the patch, perf shows: - 27.23% [k] _raw_spin_lock - _raw_spin_lock + 80.53% __page_check_address + 8.39% generic_smp_call_function_single_interrupt + 2.44% get_swap_page + 1.76% free_pcppages_bulk + 1.40% handle_pte_fault + 1.15% __swap_duplicate + 1.05% put_super + 0.98% grab_super_passive + 0.86% blk_flush_plug_list + 0.57% swap_info_get + 8.25% [k] default_send_IPI_mask_sequence_phys + 7.55% [k] call_function_interrupt + 7.47% [k] smp_call_function_many + 7.25% [k] flush_tlb_func + 3.81% [k] _raw_spin_lock_irqsave + 3.78% [k] generic_smp_call_function_single_interrupt swapout throughput is around 1400M/s. So there is around a 7% improvement, and total cpu utilization doesn't change. Without the patch, cfd_data is shared by all CPUs. generic_smp_call_function_interrupt does read/write cfd_data several times which will create a lot of cache ping-pong. With the patch, the data becomes per-cpu. The ping-pong is avoided. And from the perf data, this doesn't make call_single_queue lock contend. Next step is to remove generic_smp_call_function_interrupt() from arch code. Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14block: optionally snapshot page contents to provide stable pages during writeDarrick J. Wong
This provides a band-aid to provide stable page writes on jbd without needing to backport the fixed locking and page writeback bit handling schemes of jbd2. The band-aid works by using bounce buffers to snapshot page contents instead of waiting. For those wondering about the ext3 bandage -- fixing the jbd locking (which was done as part of ext4dev years ago) is a lot of surgery, and setting PG_writeback on data pages when we actually hold the page lock dropped ext3 performance by nearly an order of magnitude. If we're going to migrate iscsi and raid to use stable page writes, the complaints about high latency will likely return. We might as well centralize their page snapshotting thing to one place. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Artem Bityutskiy <dedekind1@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14mm: only enforce stable page writes if the backing device requires itDarrick J. Wong
Create a helper function to check if a backing device requires stable page writes and, if so, performs the necessary wait. Then, make it so that all points in the memory manager that handle making pages writable use the helper function. This should provide stable page write support to most filesystems, while eliminating unnecessary waiting for devices that don't require the feature. Before this patchset, all filesystems would block, regardless of whether or not it was necessary. ext3 would wait, but still generate occasional checksum errors. The network filesystems were left to do their own thing, so they'd wait too. After this patchset, all the disk filesystems except ext3 and btrfs will wait only if the hardware requires it. ext3 (if necessary) snapshots pages instead of blocking, and btrfs provides its own bdi so the mm will never wait. Network filesystems haven't been touched, so either they provide their own stable page guarantees or they don't block at all. The blocking behavior is back to what it was before 3.0 if you don't have a disk requiring stable page writes. Here's the result of using dbench to test latency on ext2: 3.8.0-rc3: Operation Count AvgLat MaxLat ---------------------------------------- WriteX 109347 0.028 59.817 ReadX 347180 0.004 3.391 Flush 15514 29.828 287.283 Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms 3.8.0-rc3 + patches: WriteX 105556 0.029 4.273 ReadX 335004 0.005 4.112 Flush 14982 30.540 298.634 Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms As you can see, the maximum write latency drops considerably with this patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave similarly, but see the cover letter for those results. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14bdi: allow block devices to say that they require stable page writesDarrick J. Wong
This patchset ("stable page writes, part 2") makes some key modifications to the original 'stable page writes' patchset. First, it provides creators (devices and filesystems) of a backing_dev_info a flag that declares whether or not it is necessary to ensure that page contents cannot change during writeout. It is no longer assumed that this is true of all devices (which was never true anyway). Second, the flag is used to relaxed the wait_on_page_writeback calls so that wait only occurs if the device needs it. Third, it fixes up the remaining disk-backed filesystems to use this improved conditional-wait logic to provide stable page writes on those filesystems. It is hoped that (for people not using checksumming devices, anyway) this patchset will give back unnecessary performance decreases since the original stable page write patchset went into 3.0. Sorry about not fixing it sooner. Complaints were registered by several people about the long write latencies introduced by the original stable page write patchset. Generally speaking, the kernel ought to allocate as little extra memory as possible to facilitate writeout, but for people who simply cannot wait, a second page stability strategy is (re)introduced: snapshotting page contents. The waiting behavior is still the default strategy; to enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be set. This flag is used to bandaid^Henable stable page writeback on ext3[1], and is not used anywhere else. Given that there are already a few storage devices and network FSes that have rolled their own page stability wait/page snapshot code, it would be nice to move towards consolidating all of these. It seems possible that iscsi and raid5 may wish to use the new stable page write support to enable zero-copy writeout. Thank you to Jan Kara for helping fix a couple more filesystems. Per Andrew Morton's request, here are the result of using dbench to measure latencies on ext2: 3.8.0-rc3: Operation Count AvgLat MaxLat ---------------------------------------- WriteX 109347 0.028 59.817 ReadX 347180 0.004 3.391 Flush 15514 29.828 287.283 Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms 3.8.0-rc3 + patches: WriteX 105556 0.029 4.273 ReadX 335004 0.005 4.112 Flush 14982 30.540 298.634 Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms As you can see, for ext2 the maximum write latency decreases from ~60ms on a laptop hard disk to ~4ms. I'm not sure why the flush latencies increase, though I suspect that being able to dirty pages faster gives the flusher more work to do. On ext4, the average write latency decreases as well as all the maximum latencies: 3.8.0-rc3: WriteX 85624 0.152 33.078 ReadX 272090 0.010 61.210 Flush 12129 36.219 168.260 Throughput 44.8618 MB/sec 4 clients 4 procs max_latency=168.276 ms 3.8.0-rc3 + patches: WriteX 86082 0.141 30.928 ReadX 273358 0.010 36.124 Flush 12214 34.800 165.689 Throughput 44.9941 MB/sec 4 clients 4 procs max_latency=165.722 ms XFS seems to exhibit similar latency improvements as ext2: 3.8.0-rc3: WriteX 125739 0.028 104.343 ReadX 399070 0.005 4.115 Flush 17851 25.004 131.390 Throughput 66.0024 MB/sec 4 clients 4 procs max_latency=131.406 ms 3.8.0-rc3 + patches: WriteX 123529 0.028 6.299 ReadX 392434 0.005 4.287 Flush 17549 25.120 188.687 Throughput 64.9113 MB/sec 4 clients 4 procs max_latency=188.704 ms ...and btrfs, just to round things out, also shows some latency decreases: 3.8.0-rc3: WriteX 67122 0.083 82.355 ReadX 212719 0.005 2.828 Flush 9547 47.561 147.418 Throughput 35.3391 MB/sec 4 clients 4 procs max_latency=147.433 ms 3.8.0-rc3 + patches: WriteX 64898 0.101 71.631 ReadX 206673 0.005 7.123 Flush 9190 47.963 219.034 Throughput 34.0795 MB/sec 4 clients 4 procs max_latency=219.044 ms Before this patchset, all filesystems would block, regardless of whether or not it was necessary. ext3 would wait, but still generate occasional checksum errors. The network filesystems were left to do their own thing, so they'd wait too. After this patchset, all the disk filesystems except ext3 and btrfs will wait only if the hardware requires it. ext3 (if necessary) snapshots pages instead of blocking, and btrfs provides its own bdi so the mm will never wait. Network filesystems haven't been touched, so either they provide their own wait code, or they don't block at all. The blocking behavior is back to what it was before 3.0 if you don't have a disk requiring stable page writes. This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and xfs. I've spot-checked 3.8.0-rc4 and seem to be getting the same results as -rc3. [1] The alternative fixes to ext3 include fixing the locking order and page bit handling like we did for ext4 (but then why not just use ext4?), or setting PG_writeback so early that ext3 becomes extremely slow. I tried that, but the number of write()s I could initiate dropped by nearly an order of magnitude. That was a bit much even for the author of the stable page series! :) This patch: Creates a per-backing-device flag that tracks whether or not pages must be held immutable during writeout. Eventually it will be used to waive wait_for_page_writeback() if nothing requires stable pages. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14mm: add vm event counters for balloon pages compactionRafael Aquini
Introduce a new set of vm event counters to keep track of ballooned pages compaction activity. Signed-off-by: Rafael Aquini <aquini@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <andi@firstfloor.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14net: change type of netns_ipvs->sysctl_sync_qlen_maxZhang Yanfei
This member of struct netns_ipvs is calculated from nr_free_buffer_pages so change its type to unsigned long in case of overflow. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Simon Horman <horms@verge.net.au> Cc: Julian Anastasov <ja@ssi.bg> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-02-14vmscan: change type of vm_total_pages to unsigned longZhang Yanfei
This variable is calculated from nr_free_pagecache_pages so change its type to unsigned long. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>