summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2013-02-23mm: vmscan: save work scanning (almost) empty LRU listsJohannes Weiner
In certain cases (kswapd reclaim, memcg target reclaim), a fixed minimum amount of pages is scanned from the LRU lists on each iteration, to make progress. Do not make this minimum bigger than the respective LRU list size, however, and save some busy work trying to isolate and reclaim pages that are not there. Empty LRU lists are quite common with memory cgroups in NUMA environments because there exists a set of LRU lists for each zone for each memory cgroup, while the memory of a single cgroup is expected to stay on just one node. The number of expected empty LRU lists is thus memcgs * (nodes - 1) * lru types Each attempt to reclaim from an empty LRU list does expensive size comparisons between lists, acquires the zone's lru lock etc. Avoid that. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Satoru Moriya <satoru.moriya@hds.com> Cc: Simon Jeons <simon.jeons@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: memcg: only evict file pages when we have plentyJohannes Weiner
Commit e9868505987a ("mm, vmscan: only evict file pages when we have plenty") makes a point of not going for anonymous memory while there is still enough inactive cache around. The check was added only for global reclaim, but it is just as useful to reduce swapping in memory cgroup reclaim: 200M-memcg-defconfig-j2 vanilla patched Real time 454.06 ( +0.00%) 453.71 ( -0.08%) User time 668.57 ( +0.00%) 668.73 ( +0.02%) System time 128.92 ( +0.00%) 129.53 ( +0.46%) Swap in 1246.80 ( +0.00%) 814.40 ( -34.65%) Swap out 1198.90 ( +0.00%) 827.00 ( -30.99%) Pages allocated 16431288.10 ( +0.00%) 16434035.30 ( +0.02%) Major faults 681.50 ( +0.00%) 593.70 ( -12.86%) THP faults 237.20 ( +0.00%) 242.40 ( +2.18%) THP collapse 241.20 ( +0.00%) 248.50 ( +3.01%) THP splits 157.30 ( +0.00%) 161.40 ( +2.59%) Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Satoru Moriya <satoru.moriya@hds.com> Cc: Simon Jeons <simon.jeons@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23CMA: make putback_lru_pages() call conditionalSrinivas Pandruvada
As per documentation and other places calling putback_lru_pages(), putback_lru_pages() is called on error only. Make the CMA code behave consistently. [akpm@linux-foundation.org: remove a test-n-branch in the wrapup code] Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm/hugetlb.c: convert to pr_foo()Andrew Morton
Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm/memcontrol.c: convert printk(KERN_FOO) to pr_foo()Andrew Morton
Acked-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23memcg, oom: provide more precise dump info while memcg oom happeningSha Zhengju
Currently when a memcg oom is happening the oom dump messages is still global state and provides few useful info for users. This patch prints more pointed memcg page statistics for memcg-oom and take hierarchy into consideration: Based on Michal's advice, we take hierarchy into consideration: supppose we trigger an OOM on A's limit root_memcg | A (use_hierachy=1) / \ B C | D then the printed info will be: Memory cgroup stats for /A:... Memory cgroup stats for /A/B:... Memory cgroup stats for /A/C:... Memory cgroup stats for /A/B/D:... Following are samples of oom output: (1) Before change: mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fbfb>] dump_header+0x83/0x1ca ..... (call trace) [<ffffffff8168a818>] page_fault+0x28/0x30 <<<<<<<<<<<<<<<<<<<<< memcg specific information Task in /A/B/D killed as a result of limit of /A memory: usage 101376kB, limit 101376kB, failcnt 57 memory+swap: usage 101376kB, limit 101376kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 ...... CPU 3: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 173 ...... CPU 3: hi: 186, btch: 31 usd: 130 <<<<<<<<<<<<<<<<<<<<< print global page state active_anon:92963 inactive_anon:40777 isolated_anon:0 active_file:33027 inactive_file:51718 isolated_file:0 unevictable:0 dirty:3 writeback:0 unstable:0 free:729995 slab_reclaimable:6897 slab_unreclaimable:6263 mapped:20278 shmem:35971 pagetables:5885 bounce:0 free_cma:0 <<<<<<<<<<<<<<<<<<<<< print per zone page state Node 0 DMA free:15836kB ... all_unreclaimable? no lowmem_reserve[]: 0 3175 3899 3899 Node 0 DMA32 free:2888564kB ... all_unrelaimable? no lowmem_reserve[]: 0 0 724 724 lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB 120710 total pagecache pages 0 pages in swap cache <<<<<<<<<<<<<<<<<<<<< print global swap cache stat Swap cache stats: add 0, delete 0, find 0/0 Free swap = 499708kB Total swap = 499708kB 1040368 pages RAM 58678 pages reserved 169065 pages shared 173632 pages non-shared [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2693] 0 2693 6005 1324 17 0 0 god [ 2754] 0 2754 6003 1320 16 0 0 god [ 2811] 0 2811 5992 1304 18 0 0 god [ 2874] 0 2874 6005 1323 18 0 0 god [ 2935] 0 2935 8720 7742 21 0 0 mal-30 [ 2976] 0 2976 21520 17577 42 0 0 mal-80 Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB We can see that messages dumped by show_free_areas() are longsome and can provide so limited info for memcg that just happen oom. (2) After change mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fd0b>] dump_header+0x83/0x1d1 .......(call trace) [<ffffffff8168a918>] page_fault+0x28/0x30 Task in /A/B/D killed as a result of limit of /A <<<<<<<<<<<<<<<<<<<<< memcg specific information memory: usage 102400kB, limit 102400kB, failcnt 140 memory+swap: usage 102400kB, limit 102400kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2260] 0 2260 6006 1325 18 0 0 god [ 2383] 0 2383 6003 1319 17 0 0 god [ 2503] 0 2503 6004 1321 18 0 0 god [ 2622] 0 2622 6004 1321 16 0 0 god [ 2695] 0 2695 8720 7741 22 0 0 mal-30 [ 2704] 0 2704 21520 17839 43 0 0 mal-80 Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB This version provides more pointed info for memcg in "Memory cgroup stats for XXX" section. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-22clean shmem_file_setup() a bitAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22fs: Preserve error code in get_empty_filp(), part 2Anatol Pomozov
Allocating a file structure in function get_empty_filp() might fail because of several reasons: - not enough memory for file structures - operation is not allowed - user is over its limit Currently the function returns NULL in all cases and we loose the exact reason of the error. All callers of get_empty_filp() assume that the function can fail with ENFILE only. Return error through pointer. Change all callers to preserve this error code. [AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit (things remaining here deal with alloc_file()), removed pipe(2) behaviour change] Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22new helper: file_inode(file)Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-21Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm changes from Peter Anvin: "This is a huge set of several partly interrelated (and concurrently developed) changes, which is why the branch history is messier than one would like. The *really* big items are two humonguous patchsets mostly developed by Yinghai Lu at my request, which completely revamps the way we create initial page tables. In particular, rather than estimating how much memory we will need for page tables and then build them into that memory -- a calculation that has shown to be incredibly fragile -- we now build them (on 64 bits) with the aid of a "pseudo-linear mode" -- a #PF handler which creates temporary page tables on demand. This has several advantages: 1. It makes it much easier to support things that need access to data very early (a followon patchset uses this to load microcode way early in the kernel startup). 2. It allows the kernel and all the kernel data objects to be invoked from above the 4 GB limit. This allows kdump to work on very large systems. 3. It greatly reduces the difference between Xen and native (Xen's equivalent of the #PF handler are the temporary page tables created by the domain builder), eliminating a bunch of fragile hooks. The patch series also gets us a bit closer to W^X. Additional work in this pull is the 64-bit get_user() work which you were also involved with, and a bunch of cleanups/speedups to __phys_addr()/__pa()." * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (105 commits) x86, mm: Move reserving low memory later in initialization x86, doc: Clarify the use of asm("%edx") in uaccess.h x86, mm: Redesign get_user with a __builtin_choose_expr hack x86: Be consistent with data size in getuser.S x86, mm: Use a bitfield to mask nuisance get_user() warnings x86/kvm: Fix compile warning in kvm_register_steal_time() x86-32: Add support for 64bit get_user() x86-32, mm: Remove reference to alloc_remap() x86-32, mm: Remove reference to resume_map_numa_kva() x86-32, mm: Rip out x86_32 NUMA remapping code x86/numa: Use __pa_nodebug() instead x86: Don't panic if can not alloc buffer for swiotlb mm: Add alloc_bootmem_low_pages_nopanic() x86, 64bit, mm: hibernate use generic mapping_init x86, 64bit, mm: Mark data/bss/brk to nx x86: Merge early kernel reserve for 32bit and 64bit x86: Add Crash kernel low reservation x86, kdump: Remove crashkernel range find limit for 64bit memblock: Add memblock_mem_size() x86, boot: Not need to check setup_header version for setup_data ...
2013-02-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 update from Martin Schwidefsky: "The most prominent change in this patch set is the software dirty bit patch for s390. It removes __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY and the page_test_and_clear_dirty primitive which makes the common memory management code a bit less obscure. Heiko fixed most of the PCI related fallout, more often than not missing GENERIC_HARDIRQS dependencies. Notable is one of the 3270 patches which adds an export to tty_io to be able to resize a tty. The rest is the usual bunch of cleanups and bug fixes." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits) s390/module: Add missing R_390_NONE relocation type drivers/gpio: add missing GENERIC_HARDIRQ dependency drivers/input: add couple of missing GENERIC_HARDIRQS dependencies s390/cleanup: rename SPP to LPP s390/mm: implement software dirty bits s390/mm: Fix crst upgrade of mmap with MAP_FIXED s390/linker skript: discard exit.data at runtime drivers/media: add missing GENERIC_HARDIRQS dependency s390/bpf,jit: add vlan tag support drivers/net,AT91RM9200: add missing GENERIC_HARDIRQS dependency iucv: fix kernel panic at reboot s390/Kconfig: sort list of arch selected config options phylib: remove !S390 dependeny from Kconfig uio: remove !S390 dependency from Kconfig dasd: fix sysfs cleanup in dasd_generic_remove s390/pci: fix hotplug module init s390/pci: cleanup clp page allocation s390/pci: cleanup clp inline assembly s390/perf: cpum_cf: fallback to software sampling events s390/mm: provide PAGE_SHARED define ...
2013-02-21Merge branch 'akpm' (incoming from Andrew)Linus Torvalds
Merge misc patches from Andrew Morton: - Florian has vanished so I appear to have become fbdev maintainer again :( - Joel and Mark are distracted to welcome to the new OCFS2 maintainer - The backlight queue - Small core kernel changes - lib/ updates - The rtc queue - Various random bits * akpm: (164 commits) rtc: rtc-davinci: use devm_*() functions rtc: rtc-max8997: use devm_request_threaded_irq() rtc: rtc-max8907: use devm_request_threaded_irq() rtc: rtc-da9052: use devm_request_threaded_irq() rtc: rtc-wm831x: use devm_request_threaded_irq() rtc: rtc-tps80031: use devm_request_threaded_irq() rtc: rtc-lp8788: use devm_request_threaded_irq() rtc: rtc-coh901331: use devm_clk_get() rtc: rtc-vt8500: use devm_*() functions rtc: rtc-tps6586x: use devm_request_threaded_irq() rtc: rtc-imxdi: use devm_clk_get() rtc: rtc-cmos: use dev_warn()/dev_dbg() instead of printk()/pr_debug() rtc: rtc-pcf8583: use dev_warn() instead of printk() rtc: rtc-sun4v: use pr_warn() instead of printk() rtc: rtc-vr41xx: use dev_info() instead of printk() rtc: rtc-rs5c313: use pr_err() instead of printk() rtc: rtc-at91rm9200: use dev_dbg()/dev_err() instead of printk()/pr_debug() rtc: rtc-rs5c372: use dev_dbg()/dev_warn() instead of printk()/pr_debug() rtc: rtc-ds2404: use dev_err() instead of printk() rtc: rtc-efi: use dev_err()/dev_warn()/pr_err() instead of printk() ...
2013-02-21block: optionally snapshot page contents to provide stable pages during writeDarrick J. Wong
This provides a band-aid to provide stable page writes on jbd without needing to backport the fixed locking and page writeback bit handling schemes of jbd2. The band-aid works by using bounce buffers to snapshot page contents instead of waiting. For those wondering about the ext3 bandage -- fixing the jbd locking (which was done as part of ext4dev years ago) is a lot of surgery, and setting PG_writeback on data pages when we actually hold the page lock dropped ext3 performance by nearly an order of magnitude. If we're going to migrate iscsi and raid to use stable page writes, the complaints about high latency will likely return. We might as well centralize their page snapshotting thing to one place. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Artem Bityutskiy <dedekind1@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-21mm: only enforce stable page writes if the backing device requires itDarrick J. Wong
Create a helper function to check if a backing device requires stable page writes and, if so, performs the necessary wait. Then, make it so that all points in the memory manager that handle making pages writable use the helper function. This should provide stable page write support to most filesystems, while eliminating unnecessary waiting for devices that don't require the feature. Before this patchset, all filesystems would block, regardless of whether or not it was necessary. ext3 would wait, but still generate occasional checksum errors. The network filesystems were left to do their own thing, so they'd wait too. After this patchset, all the disk filesystems except ext3 and btrfs will wait only if the hardware requires it. ext3 (if necessary) snapshots pages instead of blocking, and btrfs provides its own bdi so the mm will never wait. Network filesystems haven't been touched, so either they provide their own stable page guarantees or they don't block at all. The blocking behavior is back to what it was before 3.0 if you don't have a disk requiring stable page writes. Here's the result of using dbench to test latency on ext2: 3.8.0-rc3: Operation Count AvgLat MaxLat ---------------------------------------- WriteX 109347 0.028 59.817 ReadX 347180 0.004 3.391 Flush 15514 29.828 287.283 Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms 3.8.0-rc3 + patches: WriteX 105556 0.029 4.273 ReadX 335004 0.005 4.112 Flush 14982 30.540 298.634 Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms As you can see, the maximum write latency drops considerably with this patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave similarly, but see the cover letter for those results. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-21bdi: allow block devices to say that they require stable page writesDarrick J. Wong
This patchset ("stable page writes, part 2") makes some key modifications to the original 'stable page writes' patchset. First, it provides creators (devices and filesystems) of a backing_dev_info a flag that declares whether or not it is necessary to ensure that page contents cannot change during writeout. It is no longer assumed that this is true of all devices (which was never true anyway). Second, the flag is used to relaxed the wait_on_page_writeback calls so that wait only occurs if the device needs it. Third, it fixes up the remaining disk-backed filesystems to use this improved conditional-wait logic to provide stable page writes on those filesystems. It is hoped that (for people not using checksumming devices, anyway) this patchset will give back unnecessary performance decreases since the original stable page write patchset went into 3.0. Sorry about not fixing it sooner. Complaints were registered by several people about the long write latencies introduced by the original stable page write patchset. Generally speaking, the kernel ought to allocate as little extra memory as possible to facilitate writeout, but for people who simply cannot wait, a second page stability strategy is (re)introduced: snapshotting page contents. The waiting behavior is still the default strategy; to enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be set. This flag is used to bandaid^Henable stable page writeback on ext3[1], and is not used anywhere else. Given that there are already a few storage devices and network FSes that have rolled their own page stability wait/page snapshot code, it would be nice to move towards consolidating all of these. It seems possible that iscsi and raid5 may wish to use the new stable page write support to enable zero-copy writeout. Thank you to Jan Kara for helping fix a couple more filesystems. Per Andrew Morton's request, here are the result of using dbench to measure latencies on ext2: 3.8.0-rc3: Operation Count AvgLat MaxLat ---------------------------------------- WriteX 109347 0.028 59.817 ReadX 347180 0.004 3.391 Flush 15514 29.828 287.283 Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms 3.8.0-rc3 + patches: WriteX 105556 0.029 4.273 ReadX 335004 0.005 4.112 Flush 14982 30.540 298.634 Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms As you can see, for ext2 the maximum write latency decreases from ~60ms on a laptop hard disk to ~4ms. I'm not sure why the flush latencies increase, though I suspect that being able to dirty pages faster gives the flusher more work to do. On ext4, the average write latency decreases as well as all the maximum latencies: 3.8.0-rc3: WriteX 85624 0.152 33.078 ReadX 272090 0.010 61.210 Flush 12129 36.219 168.260 Throughput 44.8618 MB/sec 4 clients 4 procs max_latency=168.276 ms 3.8.0-rc3 + patches: WriteX 86082 0.141 30.928 ReadX 273358 0.010 36.124 Flush 12214 34.800 165.689 Throughput 44.9941 MB/sec 4 clients 4 procs max_latency=165.722 ms XFS seems to exhibit similar latency improvements as ext2: 3.8.0-rc3: WriteX 125739 0.028 104.343 ReadX 399070 0.005 4.115 Flush 17851 25.004 131.390 Throughput 66.0024 MB/sec 4 clients 4 procs max_latency=131.406 ms 3.8.0-rc3 + patches: WriteX 123529 0.028 6.299 ReadX 392434 0.005 4.287 Flush 17549 25.120 188.687 Throughput 64.9113 MB/sec 4 clients 4 procs max_latency=188.704 ms ...and btrfs, just to round things out, also shows some latency decreases: 3.8.0-rc3: WriteX 67122 0.083 82.355 ReadX 212719 0.005 2.828 Flush 9547 47.561 147.418 Throughput 35.3391 MB/sec 4 clients 4 procs max_latency=147.433 ms 3.8.0-rc3 + patches: WriteX 64898 0.101 71.631 ReadX 206673 0.005 7.123 Flush 9190 47.963 219.034 Throughput 34.0795 MB/sec 4 clients 4 procs max_latency=219.044 ms Before this patchset, all filesystems would block, regardless of whether or not it was necessary. ext3 would wait, but still generate occasional checksum errors. The network filesystems were left to do their own thing, so they'd wait too. After this patchset, all the disk filesystems except ext3 and btrfs will wait only if the hardware requires it. ext3 (if necessary) snapshots pages instead of blocking, and btrfs provides its own bdi so the mm will never wait. Network filesystems haven't been touched, so either they provide their own wait code, or they don't block at all. The blocking behavior is back to what it was before 3.0 if you don't have a disk requiring stable page writes. This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and xfs. I've spot-checked 3.8.0-rc4 and seem to be getting the same results as -rc3. [1] The alternative fixes to ext3 include fixing the locking order and page bit handling like we did for ext4 (but then why not just use ext4?), or setting PG_writeback so early that ext3 becomes extremely slow. I tried that, but the number of write()s I could initiate dropped by nearly an order of magnitude. That was a bit much even for the author of the stable page series! :) This patch: Creates a per-backing-device flag that tracks whether or not pages must be held immutable during writeout. Eventually it will be used to waive wait_for_page_writeback() if nothing requires stable pages. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-21Merge tag 'driver-core-3.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core patches from Greg Kroah-Hartman: "Here is the big driver core merge for 3.9-rc1 There are two major series here, both of which touch lots of drivers all over the kernel, and will cause you some merge conflicts: - add a new function called devm_ioremap_resource() to properly be able to check return values. - remove CONFIG_EXPERIMENTAL Other than those patches, there's not much here, some minor fixes and updates" Fix up trivial conflicts * tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (221 commits) base: memory: fix soft/hard_offline_page permissions drivercore: Fix ordering between deferred_probe and exiting initcalls backlight: fix class_find_device() arguments TTY: mark tty_get_device call with the proper const values driver-core: constify data for class_find_device() firmware: Ignore abort check when no user-helper is used firmware: Reduce ifdef CONFIG_FW_LOADER_USER_HELPER firmware: Make user-mode helper optional firmware: Refactoring for splitting user-mode helper code Driver core: treat unregistered bus_types as having no devices watchdog: Convert to devm_ioremap_resource() thermal: Convert to devm_ioremap_resource() spi: Convert to devm_ioremap_resource() power: Convert to devm_ioremap_resource() mtd: Convert to devm_ioremap_resource() mmc: Convert to devm_ioremap_resource() mfd: Convert to devm_ioremap_resource() media: Convert to devm_ioremap_resource() iommu: Convert to devm_ioremap_resource() drm: Convert to devm_ioremap_resource() ...
2013-02-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: "The biggest part of this pull request is a patch series from Maxim Patlasov to optimize scatter-gather direct IO. There's also the addition of a "readdirplus" API, poll events and various fixes and cleanups. There's a one line change outside of fuse to mm/filemap.c which makes the argument of iov_iter_single_seg_count() const, required by Maxim's patches." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (22 commits) fuse: allow control of adaptive readdirplus use Synchronize fuse header with one used in library fuse: send poll events fuse: don't WARN when nlink is zero fuse: avoid out-of-scope stack access fuse: bump version for READDIRPLUS FUSE: Adapt readdirplus to application usage patterns Do not use RCU for current process credentials fuse: cleanup fuse_direct_io() fuse: optimize __fuse_direct_io() fuse: optimize fuse_get_user_pages() fuse: pass iov[] to fuse_get_user_pages() mm: minor cleanup of iov_iter_single_seg_count() fuse: use req->page_descs[] for argpages cases fuse: add per-page descriptor <offset, length> to fuse_req fuse: rework fuse_do_ioctl() fuse: rework fuse_perform_write() fuse: rework fuse_readpages() fuse: rework fuse_retrieve() fuse: categorize fuse_get_req() ...
2013-02-19Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "Main changes: - scheduler side full-dynticks (user-space execution is undisturbed and receives no timer IRQs) preparation changes that convert the cputime accounting code to be full-dynticks ready, from Frederic Weisbecker. - Initial sched.h split-up changes, by Clark Williams - select_idle_sibling() performance improvement by Mike Galbraith: " 1 tbench pair (worst case) in a 10 core + SMT package: pre 15.22 MB/sec 1 procs post 252.01 MB/sec 1 procs " - sched_rr_get_interval() ABI fix/change. We think this detail is not used by apps (so it's not an ABI in practice), but lets keep it under observation. - misc RT scheduling cleanups, optimizations" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) sched/rt: Add <linux/sched/rt.h> header to <linux/init_task.h> cputime: Remove irqsave from seqlock readers sched, powerpc: Fix sched.h split-up build failure cputime: Restore CPU_ACCOUNTING config defaults for PPC64 sched/rt: Move rt specific bits into new header file sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice sched: Move sched.h sysctl bits into separate header sched: Fix signedness bug in yield_to() sched: Fix select_idle_sibling() bouncing cow syndrome sched/rt: Further simplify pick_rt_task() sched/rt: Do not account zero delta_exec in update_curr_rt() cputime: Safely read cputime of full dynticks CPUs kvm: Prepare to add generic guest entry/exit callbacks cputime: Use accessors to read task cputime stats cputime: Allow dynamic switch between tick/virtual based cputime accounting cputime: Generic on-demand virtual cputime accounting cputime: Move default nsecs_to_cputime() to jiffies based cputime file cputime: Librarize per nsecs resolution cputime definitions cputime: Avoid multiplication overflow on utime scaling context_tracking: Export context state for generic vtime ... Fix up conflict in kernel/context_tracking.c due to comment additions.
2013-02-18mm: fix pageblock bitmap allocationLinus Torvalds
Commit c060f943d092 ("mm: use aligned zone start for pfn_to_bitidx calculation") fixed out calculation of the index into the pageblock bitmap when a !SPARSEMEM zome was not aligned to pageblock_nr_pages. However, the _allocation_ of that bitmap had never taken this alignment requirement into accout, so depending on the exact size and alignment of the zone, the use of that index could then access past the allocation, resulting in some very subtle memory corruption. This was reported (and bisected) by Ingo Molnar: one of his random config builds would hang with certain very specific kernel command line options. In the meantime, commit c060f943d092 has been marked for stable, so this fix needs to be back-ported to the stable kernels that backported the commit to use the right alignment. Bisected-and-tested-by: Ingo Molnar <mingo@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-14s390/mm: implement software dirty bitsMartin Schwidefsky
The s390 architecture is unique in respect to dirty page detection, it uses the change bit in the per-page storage key to track page modifications. All other architectures track dirty bits by means of page table entries. This property of s390 has caused numerous problems in the past, e.g. see git commit ef5d437f71afdf4a "mm: fix XFS oops due to dirty pages without buffers on s390". To avoid future issues in regard to per-page dirty bits convert s390 to a fault based software dirty bit detection mechanism. All user page table entries which are marked as clean will be hardware read-only, even if the pte is supposed to be writable. A write by the user process will trigger a protection fault which will cause the user pte to be marked as dirty and the hardware read-only bit is removed. With this change the dirty bit in the storage key is irrelevant for Linux as a host, but the storage key is still required for KVM guests. The effect is that page_test_and_clear_dirty and the related code can be removed. The referenced bit in the storage key is still used by the page_test_and_clear_young primitive to provide page age information. For page cache pages of mappings with mapping_cap_account_dirty there will not be any change in behavior as the dirty bit tracking already uses read-only ptes to control the amount of dirty pages. Only for swap cache pages and pages of mappings without mapping_cap_account_dirty there can be additional protection faults. To avoid an excessive number of additional faults the mk_pte primitive checks for PageDirty if the pgprot value allows for writes and pre-dirties the pte. That avoids all additional faults for tmpfs and shmem pages until these pages are added to the swap cache. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-02-12mm: cma: fix accounting of CMA pages placed in high memoryMarek Szyprowski
The total number of low memory pages is determined as totalram_pages - totalhigh_pages, so without this patch all CMA pageblocks placed in highmem were accounted to low memory. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Acked-by: Kyungmin Park <kyungmin.park@samsung.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-12memcg: fix kmemcg registration for late cachesGlauber Costa
The designed workflow for the caches in kmemcg is: register it with memcg_register_cache() if kmemcg is already available or later on when a new kmemcg appears at memcg_update_cache_sizes() which will handle all caches in the system. The caches created at boot time will be handled by the later, and the memcg-caches as well as any system caches that are registered later on by the former. There is a bug, however, in memcg_register_cache: we correctly set up the array size, but do not mark the cache as a root cache. This means that allocations for any cache appearing late in the game will see memcg->memcg_params->is_root_cache == false, and in particular, trigger VM_BUG_ON(!cachep->memcg_params->is_root_cache) in __memcg_kmem_cache_get. The obvious fix is to include the missing assignment. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-12mm: don't overwrite mm->def_flags in do_mlockall()Gerald Schaefer
With commit 8e72033f2a48 ("thp: make MADV_HUGEPAGE check for mm->def_flags") the VM_NOHUGEPAGE flag may be set on s390 in mm->def_flags for certain processes, to prevent future thp mappings. This would be overwritten by do_mlockall(), which sets it back to 0 with an optional VM_LOCKED flag set. To fix this, instead of overwriting mm->def_flags in do_mlockall(), only the VM_LOCKED flag should be set or cleared. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Reported-by: Vivek Goyal <vgoyal@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-07sched/rt: Move rt specific bits into new header fileClark Williams
Move rt scheduler definitions out of include/linux/sched.h into new file include/linux/sched/rt.h Signed-off-by: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-07sched: Move sched.h sysctl bits into separate headerClark Williams
Move the sysctl-related bits from include/linux/sched.h into a new file: include/linux/sched/sysctl.h. Then update source files requiring access to those bits by including the new header file. Signed-off-by: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-06mm/sl[au]b: correct allocation type check in kmalloc_slab()Joonsoo Kim
commit "slab: Common Kmalloc cache determination" made mistake in kmalloc_slab(). SLAB_CACHE_DMA is for kmem_cache creation, not for allocation. For allocation, we should use GFP_XXX to identify type of allocation. So, change SLAB_CACHE_DMA to GFP_DMA. Acked-by: Christoph Lameter <cl@linux.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-06slab: Fixup CONFIG_PAGE_ALLOC/DEBUG_SLAB_LEAK sectionsChristoph Lameter
Variables were not properly converted and the conversion caused a naming conflict. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-05mm: fix wrong comments about anon_vma lockYuanhan Liu
We use rwsem since commit 5a505085f043 ("mm/rmap: Convert the struct anon_vma::mutex to an rwsem"). And most of comments are converted to the new rwsem lock; while just 2 more missed from: $ git grep 'anon_vma->mutex' Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-05mm/hugetlb: set PTE as huge in hugetlb_change_protection and ↵Tony Lu
remove_migration_pte When setting a huge PTE, besides calling pte_mkhuge(), we also need to call arch_make_huge_pte(), which we indeed do in make_huge_pte(), but we forget to do in hugetlb_change_protection() and remove_migration_pte(). Signed-off-by: Zhigang Lu <zlu@tilera.com> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-05thp: avoid dumping huge zero pageKirill A. Shutemov
No reason to preserve the huge zero page in core dumps. Reported-by: Michel Lespinasse <walken@google.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-01slab: Common definition for kmem_cache_nodeChristoph Lameter
Put the definitions for the kmem_cache_node structures together so that we have one structure. That will allow us to create more common fields in the future which could yield more opportunities to share code. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-01slab: Rename list3/l3 to nodeChristoph Lameter
The list3 or l3 pointers are pointing to per node structures. Reflect that in the names of variables used. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-01slab: Common Kmalloc cache determinationChristoph Lameter
Extract the optimized lookup functions from slub and put them into slab_common.c. Then make slab use these functions as well. Joonsoo notes that this fixes some issues with constant folding which also reduces the code size for slub. https://lkml.org/lkml/2012/10/20/82 Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-01slab: Common function to create the kmalloc arrayChristoph Lameter
The kmalloc array is created in similar ways in both SLAB and SLUB. Create a common function and have both allocators call that function. V1->V2: Whitespace cleanup Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-01slab: Common definition for the array of kmalloc cachesChristoph Lameter
Have a common definition fo the kmalloc cache arrays in SLAB and SLUB Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-01slab: Common constants for kmalloc boundariesChristoph Lameter
Standardize the constants that describe the smallest and largest object kept in the kmalloc arrays for SLAB and SLUB. Differentiate between the maximum size for which a slab cache is used (KMALLOC_MAX_CACHE_SIZE) and the maximum allocatable size (KMALLOC_MAX_SIZE, KMALLOC_MAX_ORDER). Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-01slab: Rename nodelists to nodeChristoph Lameter
Have a common naming between both slab caches for future changes. Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-01slab: Common name for the per node structuresChristoph Lameter
Rename the structure used for the per node structures in slab to have a name that expresses that fact. Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-01slab: Use common kmalloc_index/kmalloc_size functionsChristoph Lameter
Make slab use the common functions. We can get rid of a lot of old ugly stuff as a results. Among them the sizes array and the weird include/linux/kmalloc_sizes file and some pretty bad #include statements in slab_def.h. The one thing that is different in slab is that the 32 byte cache will also be created for arches that have page sizes larger than 4K. There are numerous smaller allocations that SLOB and SLUB can handle better because of their support for smaller allocation sizes so lets keep the 32 byte slab also for arches with > 4K pages. Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-01slab: Use proper formatting specs for unsigned size_tChristoph Lameter
Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-29mm: Add alloc_bootmem_low_pages_nopanic()Yinghai Lu
We don't need to panic in some case, like for swiotlb preallocating. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-35-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29memblock: Add memblock_mem_size()Yinghai Lu
Use it to get mem size under the limit_pfn. to replace local version in x86 reserved_initrd. -v2: remove not needed cast that is pointed out by HPA. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-29-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29Merge remote-tracking branch 'origin/x86/boot' into x86/mm2H. Peter Anvin
Coming patches to x86/mm2 require the changes and advanced baseline in x86/boot. Resolved Conflicts: arch/x86/kernel/setup.c mm/nobootmem.c Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-26userns: Allow the userns root to mount tmpfs.Eric W. Biederman
There is no backing store to tmpfs and file creation rules are the same as for any other filesystem so it is semantically safe to allow unprivileged users to mount it. ramfs is safe for the same reasons so allow either flavor of tmpfs to be mounted by a user namespace root user. The memory control group successfully limits how much memory tmpfs can consume on any system that cares about a user namespace root using tmpfs to exhaust memory the memory control group can be deployed. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-01-24mm: minor cleanup of iov_iter_single_seg_count()Maxim Patlasov
The function does not modify iov_iter which 'i' points to. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24Negative (setpoint-dirty) in bdi_position_ratio()paul.szabo@sydney.edu.au
In bdi_position_ratio(), get difference (setpoint-dirty) right even when negative. Both setpoint and dirty are unsigned long, the difference was zero-padded thus wrongly sign-extended to s64. This issue affects all 32-bit architectures, does not affect 64-bit architectures where long and s64 are equivalent. In this function, dirty is between freerun and limit, the pseudo-float x is between [-1,1], expected to be negative about half the time. With zero-padding, instead of a small negative x we obtained a large positive one so bdi_position_ratio() returned garbage. Casting the difference to s64 also prevents overflow with left-shift; though normally these numbers are small and I never observed a 32-bit overflow there. (This patch does not solve the PAE OOM issue.) Paul Szabo psz@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of Sydney Australia Reviewed-by: Jan Kara <jack@suse.cz> Reported-by: Paul Szabo <psz@maths.usyd.edu.au> Reference: http://bugs.debian.org/695182 Signed-off-by: Paul Szabo <psz@maths.usyd.edu.au> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2013-01-21taint: add explicit flag to show whether lock dep is still OK.Rusty Russell
Fix up all callers as they were before, with make one change: an unsigned module taints the kernel, but doesn't turn off lockdep. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-01-17Merge 3.9-rc4 into driver-core-nextGreg Kroah-Hartman
This is to fix up a build problem with a wireless driver due to the dynamic-debug patches in this branch. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17mm: remove depends on CONFIG_EXPERIMENTALKees Cook
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. CC: Andrew Morton <akpm@linux-foundation.org> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> CC: Jan Beulich <JBeulich@novell.com> CC: Mel Gorman <mel@csn.ul.ie> CC: Seth Jennings <sjenning@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-14writeback: add more tracepointsTejun Heo
Add tracepoints for page dirtying, writeback_single_inode start, inode dirtying and writeback. For the latter two inode events, a pair of events are defined to denote start and end of the operations (the starting one has _start suffix and the one w/o suffix happens after the operation is complete). These inode ops are FS specific and can be non-trivial and having enclosing tracepoints is useful for external tracers. This is part of tracepoint additions to improve visiblity into dirtying / writeback operations for io tracer and userland. v2: writeback_dirty_inode[_start] TPs may be called for files on pseudo FSes w/ unregistered bdi. Check whether bdi->dev is %NULL before dereferencing. v3: buffer dirtying moved to a block TP. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>