summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2014-05-22Add linux-next specific files for 20140522next-20140522Stephen Rothwell
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
2014-05-22Merge branch 'akpm/master'Stephen Rothwell
2014-05-22mm: add strictlimit knobMaxim Patlasov
The "strictlimit" feature was introduced to enforce per-bdi dirty limits for FUSE which sets bdi max_ratio to 1% by default: http://article.gmane.org/gmane.linux.kernel.mm/105809 However the feature can be useful for other relatively slow or untrusted BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the feature: echo 1 > /sys/class/bdi/X:Y/strictlimit Being enabled, the feature enforces bdi max_ratio limit even if global (10%) dirty limit is not reached. Of course, the effect is not visible until /sys/class/bdi/X:Y/max_ratio is decreased to some reasonable value. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Artem S. Tashkinov" <t.artem@lycos.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22ufs: sb mutex merge + mutex_destroyFabian Frederick
788257d6101d9 ("ufs: remove the BKL") replaced BKL with mutex protection using functions lock_ufs, unlock_ufs and struct mutex 'mutex' in sb_info. b6963327e052 ("ufs: drop lock/unlock super") removed lock/unlock super and added struct mutex 's_lock' in sb_info. Those 2 mutexes are generally locked/unlocked at the same time except in allocation (balloc, ialloc). This patch merges the 2 mutexes and propagates first commit solution. It also adds mutex destruction before kfree during ufs_fill_super failure and ufs_put_super. [akpm@linux-foundation.org: avoid ifdefs, return -EROFS not -EINVAL] Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: "Chen, Jet" <jet.chen@intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22drivers/w1/w1_int.c: call put_device if device_register failsLevente Kurusa
Currently, memsetting and kfreeing the device is bad behaviour. The device will have a reference count of 1 and hence can cause trouble because it has kfree'd. Proper way to handle a failed device_register is to call put_device right after it fails. Signed-off-by: Levente Kurusa <levex@linux.com> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22mm: convert some level-less printks to pr_*Mitchel Humpherys
printk is meant to be used with an associated log level. There are some instances of printk scattered around the mm code where the log level is missing. Add a log level and adhere to suggestions by scripts/checkpatch.pl by moving to the pr_* macros. Also add the typical pr_fmt definition so that print statements can be easily traced back to the modules where they occur, correlated one with another, etc. This will require the removal of some (now redundant) prefixes on a few print statements. Signed-off-by: Mitchel Humpherys <mitchelh@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22MAINTAINERS: adi-buildroot-devel is moderatedRichard Weinberger
Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22MAINTAINERS: add linux-api for review of API/ABI changesJosh Triplett
This makes it more likely that patch submitters will CC API/ABI changes to the linux-api list, and tools like get_maintainer.pl will do so automatically. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Michael Kerrisk <mtk.man-pages@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22bio-modify-__bio_add_page-to-accept-pages-that-dont-start-a-new-segment-v3Maurizio Lombardi
Changes in V3: In case of error, V2 restored the previous number of segments but left the BIO_SEG_FLAG set. To avoid problems, after the page is removed from the bio vec, V3 performs a recount of the segments in the error code path. Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22bio: modify __bio_add_page() to accept pages that don't start a new segmentMaurizio Lombardi
The original behaviour is to refuse to add a new page if the maximum number of segments has been reached, regardless of the fact the page we are going to add can be merged into the last segment or not. Unfortunately, when the system runs under heavy memory fragmentation conditions, a driver may try to add multiple pages to the last segment. The original code won't accept them and EBUSY will be reported to userspace. This patch modifies the function so it refuses to add a page only in case the latter starts a new segment and the maximum number of segments has already been reached. The bug can be easily reproduced with the st driver: 1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE to 16 2) modprobe st buffer_kbs=1024 3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10 dd: error writing `/dev/st0': Device or resource busy Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22mm/kmemleak-test.c: use pr_fmt for loggingFabian Frederick
Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22fs/dlm/debug_fs.c: replace seq_printf by seq_putsFabian Frederick
Replace seq_printf where possible. This patch also fixes the following checkpatch warning "unnecessary whitespace before a quoted newline" Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Christine Caulfield <ccaulfie@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22fs/dlm/lockspace.c: convert simple_str to kstrFabian Frederick
Replace obsolete functions. Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Christine Caulfield <ccaulfie@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22fs/dlm/config.c: convert simple_str to kstrFabian Frederick
Replace obsolete functions simple_strtoul/kstrtouint simple_strtol/kstrtoint (kstr __must_check requires the right function to be applied) Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Christine Caulfield <ccaulfie@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22nmi-provide-the-option-to-issue-an-nmi-back-trace-to-every-cpu-but-current-fixStephen Rothwell
undo C99ism Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Don Zickus <dzickus@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22nmi: provide the option to issue an NMI back trace to every cpu but currentAaron Tomlin
Sometimes it is preferred not to use the trigger_all_cpu_backtrace() routine when one wants to avoid capturing a back trace for current. For instance if one was previously captured recently. This patch provides a new routine namely trigger_allbutself_cpu_backtrace() which offers the flexibility to issue an NMI to every cpu but current and capture a back trace accordingly. Patch x86 and sparc to support new routine. [dzickus@redhat.com: add stub in #else clause] [dzickus@redhat.com: don't print message in single processor case, wrap with get/put_cpu based on Oleg's suggestion] Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22Documentation/filesystems/vfat.txt: update the limitation for fat fallocateNamjae Jeon
Update the limitation for fat fallocate. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22fat: permit to return phy block number by fibmap in fallocated regionNamjae Jeon
Make the fibmap call the return the proper physical block number for any offset request in the fallocated range. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22fat: fallback to buffered write in case of fallocated region on direct IONamjae Jeon
For normal cases of direct IO write, trying to seek to location greater than file size, makes it fall back to buffered write to fill that region. Similarly, in case for write in Fallocated region, make it fall to buffered write. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22fat: zero out seek range on _fat_get_blockNamjae Jeon
For normal buffered write operations, normally if we try to write to an offset > than file size, it does a cont_expand_zero till that offset. Now, in case of fallocated regions, since the blocks are already allocated. So, make it zero out that buffers for those blocks till the seek'ed offset. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22fat: add fat_fallocate operationNamjae Jeon
Implement preallocation via the fallocate syscall on VFAT partitions. This patch is based on an earlier patch of the same name which had some issues detailed below and did not get accepted. Refer https://lkml.org/lkml/2007/12/22/130. a) The preallocated space was not persistent when the FALLOC_FL_KEEP_SIZE flag was set. It will deallocate cluster at evict time. b) There was no need to zero out the clusters when the flag was set Instead of doing an expanding truncate, just allocate clusters and add them to the fat chain. This reduces preallocation time. Compatibility with windows: There are no issues when FALLOC_FL_KEEP_SIZE is not set because it just does an expanding truncate. Thus reading from the preallocated area on windows returns null until data is written to it. When a file with preallocated area using the FALLOC_FL_KEEP_SIZE was written to on windows, the windows driver freed-up the preallocated clusters and allocated new clusters for the new data. The freed up clusters gets reflected in the free space available for the partition which can be seen from the Volume properties. The windows chkdsk tool also does not report any errors on a disk containing files with preallocated space. And there is also no issue using linux fat fsck. because discard preallocated clusters at repair time. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22fat: add i_disksize to represent uninitialized sizeNamjae Jeon
Add i_disksize to represent uninitialized allocated size. And mmu_private represent initialized allocated size. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22memcg-deprecate-memoryforce_empty-knob-fixAndrew Morton
- s/pr_info/pr_info_once/ - fix garbled printk text Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22memcg: deprecate memory.force_empty knobMichal Hocko
force_empty has been introduced primarily to drop memory before it gets reparented on the group removal. This alone doesn't sound fully justified because reparented pages which are not in use can be reclaimed also later when there is a memory pressure on the parent level. Mark the knob CFTYPE_INSANE which tells the cgroup core that it shouldn't create the knob with the experimental sane_behavior. Other users will get informed about the deprecation and asked to tell us more because I do not expect most users will use sane_behavior cgroups mode very soon. Anyway I expect that most users will be simply cgroup remove handlers which do that since ever without having any good reason for it. If somebody really cares because reparented pages, which would be dropped otherwise, push out more important ones then we should fix the reparenting code and put pages to the tail. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22mm: remap_file_pages: grab file ref to prevent race while mmapingSasha Levin
A file reference should be held while a file is mmaped, otherwise it might be freed while being used. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Suggested-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22mm: remap_file_pages: initialize populate before usageSasha Levin
'populate' wasn't initialized before being used in error paths, causing panics when mm_populate() would get called with invalid values. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22mm-replace-remap_file_pages-syscall-with-emulation-fixAndrew Morton
fix spello Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Armin Rigo <arigo@tunes.org> Cc: Dave Jones <davej@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22mm: replace remap_file_pages() syscall with emulationKirill A. Shutemov
remap_file_pages(2) was invented to be able efficiently map parts of huge file into limited 32-bit virtual address space such as in database workloads. Nonlinear mappings are pain to support and it seems there's no legitimate use-cases nowadays since 64-bit systems are widely available. Let's drop it and get rid of all these special-cased code. The patch replaces the syscall with emulation which creates new VMA on each remap_file_pages(), unless they it can be merged with an adjacent one. I didn't find *any* real code that uses remap_file_pages(2) to test emulation impact on. I've checked Debian code search and source of all packages in ALT Linux. No real users: libc wrappers, mentions in strace, gdb, valgrind and this kind of stuff. There are few basic tests in LTP for the syscall. They work just fine with emulation. To test performance impact, I've written small test case which demonstrate pretty much worst case scenario: map 4G shmfs file, write to begin of every page pgoff of the page, remap pages in reverse order, read every page. The test creates 1 million of VMAs if emulation is in use, so I had to set vm.max_map_count to 1100000 to avoid -ENOMEM. Before: 23.3 ( +- 4.31% ) seconds After: 43.9 ( +- 0.85% ) seconds Slowdown: 1.88x I believe we can live with that. Test case: #define _GNU_SOURCE #include <assert.h> #include <stdlib.h> #include <stdio.h> #include <sys/mman.h> #define MB (1024UL * 1024) #define SIZE (4096 * MB) int main(int argc, char **argv) { unsigned long *p; long i, pass; for (pass = 0; pass < 10; pass++) { p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (p == MAP_FAILED) { perror("mmap"); return -1; } for (i = 0; i < SIZE / 4096; i++) p[i * 4096 / sizeof(*p)] = i; for (i = 0; i < SIZE / 4096; i++) { if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096, 0, (SIZE - 4096 * (i + 1)) >> 12, 0)) { perror("remap_file_pages"); return -1; } } for (i = SIZE / 4096 - 1; i >= 0; i--) assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1); munmap(p, SIZE); } return 0; } Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Dave Jones <davej@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Armin Rigo <arigo@tunes.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22mm-mark-remap_file_pages-syscall-as-deprecated-fixAndrew Morton
fix spello Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Armin Rigo <arigo@tunes.org> Cc: Dave Jones <davej@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22mm: mark remap_file_pages() syscall as deprecatedKirill A. Shutemov
The remap_file_pages() system call is used to create a nonlinear mapping, that is, a mapping in which the pages of the file are mapped into a nonsequential order in memory. The advantage of using remap_file_pages() over using repeated calls to mmap(2) is that the former approach does not require the kernel to create additional VMA (Virtual Memory Area) data structures. Supporting of nonlinear mapping requires significant amount of non-trivial code in kernel virtual memory subsystem including hot paths. Also to get nonlinear mapping work kernel need a way to distinguish normal page table entries from entries with file offset (pte_file). Kernel reserves flag in PTE for this purpose. PTE flags are scarce resource especially on some CPU architectures. It would be nice to free up the flag for other usage. Fortunately, there are not many users of remap_file_pages() in the wild. It's only known that one enterprise RDBMS implementation uses the syscall on 32-bit systems to map files bigger than can linearly fit into 32-bit virtual address space. This use-case is not critical anymore since 64-bit systems are widely available. The plan is to deprecate the syscall and replace it with an emulation. The emulation will create new VMAs instead of nonlinear mappings. It's going to work slower for rare users of remap_file_pages() but ABI is preserved. One side effect of emulation (apart from performance) is that user can hit vm.max_map_count limit more easily due to additional VMAs. See comment for DEFAULT_MAX_MAP_COUNT for more details on the limit. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Dave Jones <davej@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Armin Rigo <arigo@tunes.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22mm: memcontrol: remove unnecessary memcg argument from soft limit functionsJohannes Weiner
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22mm: memcontrol: clean up memcg zoneinfo lookupJianyu Zhan
Memcg zoneinfo lookup sites have either the page, the zone, or the node id and zone index, but sites that only have the zone have to look up the node id and zone index themselves, whereas sites that already have those two integers use a function for a simple pointer chase. Provide mem_cgroup_zone_zoneinfo() that takes a zone pointer and let sites that already have node id and zone index - all for each node, for each zone iterators - use &memcg->nodeinfo[nid]->zoneinfo[zid]. Rename page_cgroup_zoneinfo() to mem_cgroup_page_zoneinfo() to match. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22mm/memblock.c: call kmemleak directly from memblock_(alloc|free)Catalin Marinas
Kmemleak could ignore memory blocks allocated via memblock_alloc() leading to false positives during scanning. This patch adds the corresponding callbacks and removes kmemleak_free_* calls in mm/nobootmem.c to avoid duplication. The kmemleak_alloc() in mm/nobootmem.c is kept since __alloc_memory_core_early() does not use memblock_alloc() directly. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22mm/mempool.c: update the kmemleak stack trace for mempool allocationsCatalin Marinas
When mempool_alloc() returns an existing pool object, kmemleak_alloc() is no longer called and the stack trace corresponds to the original object allocation. This patch updates the kmemleak allocation stack trace for such objects to make it more useful for debugging. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22lib/radix-tree.c: update the kmemleak stack trace for radix tree allocationsCatalin Marinas
Since radix_tree_preload() stack trace is not always useful for debugging an actual radix tree memory leak, this patch updates the kmemleak allocation stack trace in the radix_tree_node_alloc() function. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22mm: introduce kmemleak_update_trace()Catalin Marinas
The memory allocation stack trace is not always useful for debugging a memory leak (e.g. radix_tree_preload). This function, when called, updates the stack trace for an already allocated object. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22mm/kmemleak.c: use %u to print ->checksumJianpeng Ma
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22vmscan: memcg: always use swappiness of the reclaimed memcgMichal Hocko
Memory reclaim always uses swappiness of the reclaim target memcg (origin of the memory pressure) or vm_swappiness for global memory reclaim. This behavior was consistent (except for difference between global and hard limit reclaim) because swappiness was enforced to be consistent within each memcg hierarchy. After "mm: memcontrol: remove hierarchy restrictions for swappiness and oom_control" each memcg can have its own swappiness independent of hierarchical parents, though, so the consistency guarantee is gone. This can lead to an unexpected behavior. Say that a group is explicitly configured to not swapout by memory.swappiness=0 but its memory gets swapped out anyway when the memory pressure comes from its parent with a It is also unexpected that the knob is meaningless without setting the hard limit which would trigger the reclaim and enforce the swappiness. There are setups where the hard limit is configured higher in the hierarchy by an administrator and children groups are under control of somebody else who is interested in the swapout behavior but not necessarily about the memory limit. From a semantic point of view swappiness is an attribute defining anon vs. file proportional scanning of LRU which is memcg specific (unlike charges which are propagated up the hierarchy) so it should be applied to the particular memcg's LRU regardless where the memory pressure comes from. This patch removes vmscan_swappiness() and stores the swappiness into the scan_control structure. mem_cgroup_swappiness is then used to provide the correct value before shrink_lruvec is called. The global vm_swappiness is used for the root memcg. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22vmscan: memcg: check whether the low limit should be ignoredMichal Hocko
Low-limit (aka guarantee) is ignored when there is no group scanned during the first round of __shink_zone. This approach doesn't work when multiple reclaimers race and reclaim the same hierarchy (e.g. kswapd vs. direct reclaim or multiple tasks hitting the hard limit) because memcg iterator makes sure that multiple reclaimers are interleaved in the hierarchy. This means that some reclaimers can see 0 scanned groups although there are groups which are above the low-limit and they were reclaimed on behalf of other reclaimers. This leads to a premature low-limit break. This patch adds mem_cgroup_all_within_guarantee() which will check whether all the groups in the reclaimed hierarchy are within their low limit and shrink_zone will allow the fallback reclaim only when that is true. This alone is still not sufficient however because it would lead to another problem. If a reclaimer constantly fails to scan anything because it sees only groups within their guarantees while others do the reclaim then the reclaim priority would drop down very quickly. shrink_zone has to be careful to preserve scan at least one group semantic so __shrink_zone has to be retried until at least one group is scanned. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22memcg: document memory.low_limit_in_bytesMichal Hocko
Describe low_limit_in_bytes and its effect. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22memcg-doc-clarify-global-vs-limit-reclaims-fixMichal Hocko
update doc as per Johannes Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22memcg, doc: clarify global vs. limit reclaimsMichal Hocko
Be explicit about global and hard limit reclaims in our documentation. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22memcg: allow setting low_limitMichal Hocko
Export memory.low_limit_in_bytes knob with the same rules as the hard limit represented by limit_in_bytes knob (e.g. no limit to be set for the root cgroup). There is no memsw alternative for low_limit_in_bytes because the primary motivation behind this limit is to protect the working set of the group and so considering swap doesn't make much sense. There is also no kmem variant exported because we do not have any easy way to protect kernel allocations now. Please note that the low limit might exceed the hard limit which basically means that the group is not reclaimable if there is other reclaim target in the hierarchy under pressure. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22memcg-mm-introduce-lowlimit-reclaim-fixMichal Hocko
mem_cgroup_reclaim_eligible -> mem_cgroup_within_guarantee follow_low_limit -> honor_memcg_guarantee and as suggested by Johannes. Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22memcg, mm: introduce lowlimit reclaimMichal Hocko
Previous discussions have shown that soft limits cannot be reformed (http://lwn.net/Articles/555249/). This series introduces an alternative approach for protecting memory allocated to processes executing within a memory cgroup controller. It is based on a new tunable that was discussed with Johannes and Tejun held during the kernel summit 2013 and at LSF 2014. This patchset introduces such low limit that is functionally similar to a minimum guarantee. Memcgs which are under their lowlimit are not considered eligible for the reclaim (both global and hardlimit) unless all groups under the reclaimed hierarchy are below the low limit when all of them are considered eligible. The previous version of the patchset posted as a RFC (http://marc.info/?l=linux-mm&m=138677140628677&w=2) suggested a hard guarantee without any fallback. More discussions led me to reconsidering the default behavior and come up a more relaxed one. The hard requirement can be added later based on a use case which really requires. It would be controlled by memory.reclaim_flags knob which would specify whether to OOM or fallback (default) when all groups are bellow low limit. The default value of the limit is 0 so all groups are eligible by default and an interested party has to explicitly set the limit. The primary use case is to protect an amount of memory allocated to a workload without it being reclaimed by an unrelated activity. In some cases this requirement can be fulfilled by mlock but it is not suitable for many loads and generally requires application awareness. Such application awareness can be complex. It effectively forbids the use of memory overcommit as the application must explicitly manage memory residency. With the low limit, such workloads can be placed in a memcg with a low limit that protects the estimated working set. The hierarchical behavior of the lowlimit is described in the first patch. The second patch allows setting the lowlimit. The last 2 patches clarify documentation about the memcg reclaim in gereneral (3rd patch) and low limit (4th patch). This patch (of 5) This patch introduces low limit reclaim. The low_limit acts as a reclaim protection because groups which are under their low_limit are considered ineligible for reclaim. While hardlimit protects from using more memory than allowed lowlimit protects from getting below memory assigned to the group due to external memory pressure. More precisely a group is considered eligible for the reclaim under a specific hierarchy represented by its root only if the group is above its low limit and the same applies to all parents up the hierarchy to the root. Nevertheless the limit still might be ignored if all groups under the reclaimed hierarchy are under their low limits. This will prevent from OOM rather than protecting the memory. Consider the following hierarchy with memory pressure coming from the group A (hard limit reclaim - l-low_limit_in_bytes, u-usage_in_bytes, h-limit_in_bytes): root_mem_cgroup . _____/ / A (l = 80 u=90 h=90) / / \_________ / \ B (l=0 u=50) C (l=50 u=40) \ D (l=0 u=30) A and B are reclaimable but C and D are not (D is protected by C). The low_limit is 0 by default so every group is eligible. This patch doesn't provide a way to set the limit yet although the core infrastructure is there already. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22sysrq,rcu: suppress RCU stall warnings while sysrq runsRik van Riel
Some sysrq handlers can run for a long time, because they dump a lot of data onto a serial console. Having RCU stall warnings pop up in the middle of them only makes the problem worse. This patch temporarily disables RCU stall warnings while a sysrq request is handled. Signed-off-by: Rik van Riel <riel@redhat.com> Suggested-by: Paul McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Madper Xie <cxie@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Richard Weinberger <richard@nod.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22sysrq: rcu-ify __handle_sysrqRik van Riel
Echoing values into /proc/sysrq-trigger seems to be a popular way to get information out of the kernel. However, dumping information about thousands of processes, or hundreds of CPUs to serial console can result in IRQs being blocked for minutes, resulting in various kinds of cascade failures. The most common failure is due to interrupts being blocked for a very long time. This can lead to things like failed IO requests, and other things the system cannot easily recover from. This problem is easily fixable by making __handle_sysrq use RCU instead of spin_lock_irqsave. This leaves the warning that RCU grace periods have not elapsed for a long time, but the system will come back from that automatically. It also leaves sysrq-from-irq-context when the sysrq keys are pressed, but that is probably desired since people want that to work in situations where the system is already hosed. The callers of register_sysrq_key and unregister_sysrq_key appear to be capable of sleeping. Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Madper Xie <cxie@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22kernel/kprobes.c: convert printk to pr_foo()Fabian Frederick
Also fixes some checkpatch warnings -Static initialization -Lines over 80 characters Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22x86,vdso: fix an OOPS accessing the hpet mapping w/o an hpetAndy Lutomirski
The oops can be triggered in qemu using -no-hpet (but not nohpet) by reading a couple of pages past the end of the vdso text. This should send SIGBUS instead of OOPSing. The bug was introduced by: commit 7a59ed415f5b57469e22e41fc4188d5399e0b194 Author: Stefani Seibold <stefani@seibold.net> Date: Mon Mar 17 23:22:09 2014 +0100 x86, vdso: Add 32 bit VDSO time support for 32 bit kernel which is new in 3.15. This will be fixed separately in 3.15, but that patch will not apply to tip/x86/vdso. This is the equivalent fix for tip/x86/vdso and, presumably, 3.16. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stefani Seibold <stefani@seibold.net> Cc: <stable@vger.kernel.org> [needs rework for 3.15 and earlier] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-05-22rwsem-support-optimistic-spinning-fixDavidlohr Bueso
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>