summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2010-01-01Merge remote branch 'percpu/for-next'Stephen Rothwell
2010-01-01Merge remote branch 'limits/writable_limits'Stephen Rothwell
Conflicts: arch/x86/ia32/ia32entry.S arch/x86/include/asm/unistd_32.h arch/x86/include/asm/unistd_64.h arch/x86/kernel/syscall_table_32.S
2010-01-01Merge remote branch 'trivial/for-next'Stephen Rothwell
2010-01-01Merge remote branch 'slab/for-next'Stephen Rothwell
2009-12-30Merge branch 'slab/urgent' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: SLAB: Fix lockdep annotation breakage
2009-12-30mm: move sys_mmap_pgoff from util.cHugh Dickins
Move sys_mmap_pgoff() from mm/util.c to mm/mmap.c and mm/nommu.c, where we'd expect to find such code: especially now that it contains the MAP_HUGETLB handling. Revert mm/util.c to how it was in 2.6.32. This patch just ignores MAP_HUGETLB in the nommu case, as in 2.6.32, whereas 2.6.33-rc2 reported -ENOSYS. Perhaps validate_mmap_request() should reject it with -EINVAL? Add that later if necessary. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-28SLAB: Fix lockdep annotation breakagePekka Enberg
Commit ce79ddc8e2376a9a93c7d42daf89bfcbb9187e62 ("SLAB: Fix lockdep annotations for CPU hotplug") broke init_node_lock_keys() off-slab logic which causes lockdep false positives. Fix that up by reverting the logic back to original while keeping CPU hotplug fixes intact. Reported-and-tested-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reported-and-tested-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-12-24Merge branch 'sysctl' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc-2.6 * 'sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc-2.6: SYSCTL: Add a mutex to the page_alloc zone order sysctl SYSCTL: Print binary sysctl warnings (nearly) only once
2009-12-24Merge branch 'hwpoison' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: HWPOISON: Add PROC_FS dependency to hwpoison injector v2
2009-12-23SYSCTL: Add a mutex to the page_alloc zone order sysctlAndi Kleen
The zone list code clearly cannot tolerate concurrent writers (I couldn't find any locks for that), so simply add a global mutex. No need for RCU in this case. Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-12-22Merge branch 'merge' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (36 commits) powerpc/gc/wii: Remove get_irq_desc() powerpc/gc/wii: hlwd-pic: convert irq_desc.lock to raw_spinlock powerpc/gamecube/wii: Fix off-by-one error in ugecon/usbgecko_udbg powerpc/mpic: Fix problem that affinity is not updated powerpc/mm: Fix stupid bug in subpge protection handling powerpc/iseries: use DECLARE_COMPLETION_ONSTACK for non-constant completion powerpc: Fix MSI support on U4 bridge PCIe slot powerpc: Handle VSX alignment faults correctly in little-endian mode powerpc/mm: Fix typo of cpumask_clear_cpu() powerpc/mm: Fix hash_utils_64.c compile errors with DEBUG enabled. powerpc: Convert BUG() to use unreachable() powerpc/pseries: Make declarations of cpu_hotplug_driver_lock() ANSI compatible. powerpc/pseries: Don't panic when H_PROD fails during cpu-online. powerpc/mm: Fix a WARN_ON() with CONFIG_DEBUG_PAGEALLOC and CONFIG_DEBUG_VM powerpc/defconfigs: Set HZ=100 on pseries and ppc64 defconfigs powerpc/defconfigs: Disable token ring in powerpc defconfigs powerpc/defconfigs: Reduce 64bit vmlinux by making acenic and cramfs modules powerpc/pseries: Select XICS and PCI_MSI PSERIES powerpc/85xx: Wrong variable returned on error powerpc/iseries: Convert to proc_fops ...
2009-12-22MM: use helpers for rlimitsJiri Slaby
Make sure compiler won't do weird things with limits. E.g. fetching them twice may return 2 different values after writable limits are implemented. I.e. either use newly added rlimit helpers or ACCESS_ONCE if not applicable. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: James Morris <jmorris@namei.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: linux-mm@kvack.org
2009-12-21HWPOISON: Add PROC_FS dependency to hwpoison injector v2Andi Kleen
The injector filter requires stable_page_flags() which is supplied by procfs. So make it dependent on that. Also add ifdefs around the filter code in memory-failure.c so that when the filter is disabled due to missing dependencies the whole code still builds. Reported-by: Ingo Molnar Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-12-21Merge branch 'master' into percpuTejun Heo
Conflicts: arch/powerpc/platforms/pseries/hvCall.S include/linux/percpu.h
2009-12-20SLUB: Make slub statistics use this_cpu_incChristoph Lameter
this_cpu_inc() translates into a single instruction on x86 and does not need any register. So use it in stat(). We also want to avoid the calculation of the per cpu kmem_cache_cpu structure pointer. So pass a kmem_cache pointer instead of a kmem_cache_cpu pointer. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-12-20SLUB: this_cpu: Remove slub kmem_cache fieldsChristoph Lameter
Remove the fields in struct kmem_cache_cpu that were used to cache data from struct kmem_cache when they were in different cachelines. The cacheline that holds the per cpu array pointer now also holds these values. We can cut down the struct kmem_cache_cpu size to almost half. The get_freepointer() and set_freepointer() functions that used to be only intended for the slow path now are also useful for the hot path since access to the size field does not require accessing an additional cacheline anymore. This results in consistent use of functions for setting the freepointer of objects throughout SLUB. Also we initialize all possible kmem_cache_cpu structures when a slab is created. No need to initialize them when a processor or node comes online. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-12-20SLUB: Get rid of dynamic DMA kmalloc cache allocationChristoph Lameter
Dynamic DMA kmalloc cache allocation is troublesome since the new percpu allocator does not support allocations in atomic contexts. Reserve some statically allocated kmalloc_cpu structures instead. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-12-20SLUB: Use this_cpu operations in slubChristoph Lameter
Using per cpu allocations removes the needs for the per cpu arrays in the kmem_cache struct. These could get quite big if we have to support systems with thousands of cpus. The use of this_cpu_xx operations results in: 1. The size of kmem_cache for SMP configuration shrinks since we will only need 1 pointer instead of NR_CPUS. The same pointer can be used by all processors. Reduces cache footprint of the allocator. 2. We can dynamically size kmem_cache according to the actual nodes in the system meaning less memory overhead for configurations that may potentially support up to 1k NUMA nodes / 4k cpus. 3. We can remove the diddle widdle with allocating and releasing of kmem_cache_cpu structures when bringing up and shutting down cpus. The cpu alloc logic will do it all for us. Removes some portions of the cpu hotplug functionality. 4. Fastpath performance increases since per cpu pointer lookups and address calculations are avoided. V7-V8 - Convert missed get_cpu_slab() under CONFIG_SLUB_STATS Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-12-19Fix misspelling of "should" and "shouldn't" in comments.Adam Buchbinder
Some comments misspell "should" or "shouldn't"; this fixes them. No code changes. Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-12-19Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, irq: Allow 0xff for /proc/irq/[n]/smp_affinity on an 8-cpu system Makefile: Unexport LC_ALL instead of clearing it x86: Fix objdump version check in arch/x86/tools/chkobjdump.awk x86: Reenable TSC sync check at boot, even with NONSTOP_TSC x86: Don't use POSIX character classes in gen-insn-attr-x86.awk Makefile: set LC_CTYPE, LC_COLLATE, LC_NUMERIC to C x86: Increase MAX_EARLY_RES; insufficient on 32-bit NUMA x86: Fix checking of SRAT when node 0 ram is not from 0 x86, cpuid: Add "volatile" to asm in native_cpuid() x86, msr: msrs_alloc/free for CONFIG_SMP=n x86, amd: Get multi-node CPU info from NodeId MSR instead of PCI config space x86: Add IA32_TSC_AUX MSR and use it x86, msr/cpuid: Register enough minors for the MSR and CPUID drivers initramfs: add missing decompressor error check bzip2: Add missing checks for malloc returning NULL bzip2/lzma/gzip: pre-boot malloc doesn't return NULL on failure
2009-12-18mm: Add notifier in pageblock isolation for balloon driversRobert Jennings
Memory balloon drivers can allocate a large amount of memory which is not movable but could be freed to accomodate memory hotplug remove. Prior to calling the memory hotplug notifier chain the memory in the pageblock is isolated. Currently, if the migrate type is not MIGRATE_MOVABLE the isolation will not proceed, causing the memory removal for that page range to fail. Rather than failing pageblock isolation if the migrateteype is not MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock, and not on the LRU, are owned by a registered balloon driver (or other entity) using a notifier chain. If all of the non-movable pages are owned by a balloon, they can be freed later through the memory notifier chain and the range can still be isolated in set_migratetype_isolate(). Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Ingo Molnar <mingo@elte.hu> Cc: Brian King <brking@linux.vnet.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Gerald Schaefer <geralds@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-12-17Merge branch 'cpumask-cleanups' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * 'cpumask-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: cpumask: rename tsk_cpumask to tsk_cpus_allowed cpumask: don't recommend set_cpus_allowed hack in Documentation/cpu-hotplug.txt cpumask: avoid dereferencing struct cpumask cpumask: convert drivers/idle/i7300_idle.c to cpumask_var_t cpumask: use modern cpumask style in drivers/scsi/fcoe/fcoe.c cpumask: avoid deprecated function in mm/slab.c cpumask: use cpu_online in kernel/perf_event.c
2009-12-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: Keys: KEYCTL_SESSION_TO_PARENT needs TIF_NOTIFY_RESUME architecture support NOMMU: Optimise away the {dac_,}mmap_min_addr tests security/min_addr.c: make init_mmap_min_addr() static keys: PTR_ERR return of wrong pointer in keyctl_get_security()
2009-12-17Merge branch 'kmemleak' of git://linux-arm.org/linux-2.6Linus Torvalds
* 'kmemleak' of git://linux-arm.org/linux-2.6: kmemleak: fix kconfig for crc32 build error kmemleak: Reduce the false positives by checking for modified objects kmemleak: Show the age of an unreferenced object kmemleak: Release the object lock before calling put_object() kmemleak: Scan the _ftrace_events section in modules kmemleak: Simplify the kmemleak_scan_area() function prototype kmemleak: Do not use off-slab management with SLAB_NOLEAKTRACE
2009-12-17readahead: add blk_run_backing_devHisashi Hifumi
I added blk_run_backing_dev on page_cache_async_readahead so readahead I/O is unpluged to improve throughput on especially RAID environment. The normal case is, if page N become uptodate at time T(N), then T(N) <= T(N+1) holds. With RAID (and NFS to some degree), there is no strict ordering, the data arrival time depends on runtime status of individual disks, which breaks that formula. So in do_generic_file_read(), just after submitting the async readahead IO request, the current page may well be uptodate, so the page won't be locked, and the block device won't be implicitly unplugged: if (PageReadahead(page)) page_cache_async_readahead() if (!PageUptodate(page)) goto page_not_up_to_date; //... page_not_up_to_date: lock_page_killable(page); Therefore explicit unplugging can help. Following is the test result with dd. #dd if=testdir/testfile of=/dev/null bs=16384 -2.6.30-rc6 1048576+0 records in 1048576+0 records out 17179869184 bytes (17 GB) copied, 224.182 seconds, 76.6 MB/s -2.6.30-rc6-patched 1048576+0 records in 1048576+0 records out 17179869184 bytes (17 GB) copied, 206.465 seconds, 83.2 MB/s (7Disks RAID-0 Array) -2.6.30-rc6 1054976+0 records in 1054976+0 records out 17284726784 bytes (17 GB) copied, 212.233 seconds, 81.4 MB/s -2.6.30-rc6-patched 1054976+0 records out 17284726784 bytes (17 GB) copied, 198.878 seconds, 86.9 MB/s (7Disks RAID-5 Array) The patch was found to improve performance with the SCST scsi target driver. See http://sourceforge.net/mailarchive/forum.php?thread_name=a0272b440906030714g67eabc5k8f847fb1e538cc62%40mail.gmail.com&forum_name=scst-devel [akpm@linux-foundation.org: unbust comment layout] [akpm@linux-foundation.org: "fix" CONFIG_BLOCK=n] Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: Ronald <intercommit@gmail.com> Cc: Bart Van Assche <bart.vanassche@gmail.com> Cc: Vladislav Bolkhovitin <vst@vlnb.net> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-17cpumask: avoid deprecated function in mm/slab.cRusty Russell
These days we use cpumask_empty() which takes a pointer. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Christoph Lameter <cl@linux-foundation.org>
2009-12-16Fix breakage in shmem.cAl Viro
Replacing error = 0; if (error) op with nothing is not quite an equivalent transformation ;-) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-16x86: Fix checking of SRAT when node 0 ram is not from 0Yinghai Lu
Found one system that boot from socket1 instead of socket0, SRAT get rejected... [ 0.000000] SRAT: Node 1 PXM 0 0-a0000 [ 0.000000] SRAT: Node 1 PXM 0 100000-80000000 [ 0.000000] SRAT: Node 1 PXM 0 100000000-2080000000 [ 0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000 [ 0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000 [ 0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000 [ 0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000 [ 0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000 [ 0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000 [ 0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000 ... [ 0.000000] NUMA: Allocated memnodemap from 500000 - 701040 [ 0.000000] NUMA: Using 20 for the hash shift. [ 0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used [ 0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used [ 0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used [ 0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used [ 0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used [ 0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used [ 0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used [ 0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used [ 0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used [ 0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used [ 0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used. [ 0.000000] SRAT: SRAT not used. the early_node_map is not sorted because node0 with non zero start come first. so try to sort it right away after all regions are registered. also fixs refression by 8716273c (x86: Export srat physical topology) -v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g) -v3: update comments. Reported-and-tested-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4B2579D2.3010201@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-12-17NOMMU: Optimise away the {dac_,}mmap_min_addr testsDavid Howells
In NOMMU mode clamp dac_mmap_min_addr to zero to cause the tests on it to be skipped by the compiler. We do this as the minimum mmap address doesn't make any sense in NOMMU mode. mmap_min_addr and round_hint_to_min() can be discarded entirely in NOMMU mode. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2009-12-16Merge branch 'hwpoison' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (34 commits) HWPOISON: Remove stray phrase in a comment HWPOISON: Try to allocate migration page on the same node HWPOISON: Don't do early filtering if filter is disabled HWPOISON: Add a madvise() injector for soft page offlining HWPOISON: Add soft page offline support HWPOISON: Undefine short-hand macros after use to avoid namespace conflict HWPOISON: Use new shake_page in memory_failure HWPOISON: Use correct name for MADV_HWPOISON in documentation HWPOISON: mention HWPoison in Kconfig entry HWPOISON: Use get_user_page_fast in hwpoison madvise HWPOISON: add an interface to switch off/on all the page filters HWPOISON: add memory cgroup filter memcg: add accessor to mem_cgroup.css memcg: rename and export try_get_mem_cgroup_from_page() HWPOISON: add page flags filter mm: export stable page flags HWPOISON: limit hwpoison injector to known page types HWPOISON: add fs/device filters HWPOISON: return 0 to indicate success reliably HWPOISON: make semantics of IGNORED/DELAYED clear ...
2009-12-16Merge branch 'master' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (38 commits) direct I/O fallback sync simplification ocfs: stop using do_sync_mapping_range cleanup blockdev_direct_IO locking make generic_acl slightly more generic sanitize xattr handler prototypes libfs: move EXPORT_SYMBOL for d_alloc_name vfs: force reval of target when following LAST_BIND symlinks (try #7) ima: limit imbalance msg Untangling ima mess, part 3: kill dead code in ima Untangling ima mess, part 2: deal with counters Untangling ima mess, part 1: alloc_file() O_TRUNC open shouldn't fail after file truncation ima: call ima_inode_free ima_inode_free IMA: clean up the IMA counts updating code ima: only insert at inode creation time ima: valid return code from ima_inode_alloc fs: move get_empty_filp() deffinition to internal.h Sanitize exec_permission_lite() Kill cached_lookup() and real_lookup() Kill path_lookup_open() ... Trivial conflicts in fs/direct-io.c
2009-12-16direct I/O fallback sync simplificationChristoph Hellwig
In the case of direct I/O falling back to buffered I/O we sync data twice currently: once at the end of generic_file_buffered_write using filemap_write_and_wait_range and once a little later in __generic_file_aio_write using do_sync_mapping_range with all flags set. The wait before write of the do_sync_mapping_range call does not make any sense, so just keep the filemap_write_and_wait_range call and move it to the right spot. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-16make generic_acl slightly more genericChristoph Hellwig
Now that we cache the ACL pointers in the generic inode all the generic_acl cruft can go away and generic_acl.c can directly implement xattr handlers dealing with the full Posix ACL semantics for in-memory filesystems. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-16sanitize xattr handler prototypesChristoph Hellwig
Add a flags argument to struct xattr_handler and pass it to all xattr handler methods. This allows using the same methods for multiple handlers, e.g. for the ACL methods which perform exactly the same action for the access and default ACLs, just using a different underlying attribute. With a little more groundwork it'll also allow sharing the methods for the regular user/trusted/secure handlers in extN, ocfs2 and jffs2 like it's already done for xfs in this patch. Also change the inode argument to the handlers to a dentry to allow using the handlers mechnism for filesystems that require it later, e.g. cifs. [with GFS2 bits updated by Steven Whitehouse <swhiteho@redhat.com>] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jmorris@namei.org> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-16Untangling ima mess, part 1: alloc_file()Al Viro
There are 2 groups of alloc_file() callers: * ones that are followed by ima_counts_get * ones giving non-regular files So let's pull that ima_counts_get() into alloc_file(); it's a no-op in case of non-regular files. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-16switch alloc_file() to passing struct pathAl Viro
... and have the caller grab both mnt and dentry; kill leak in infiniband, while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-16switch shmem_file_setup() to alloc_file()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-16memcg: code clean, remove unused variable in mem_cgroup_resize_limit()Bob Liu
Variable `progress' isn't used in mem_cgroup_resize_limit() any more. Remove it. [akpm@linux-foundation.org: cleanup] Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16memcg: remove memcg_tasklistDaisuke Nishimura
memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock caused by race between oom and cpuset_attach) instead of cgroup_mutex to fix a deadlock problem. The cgroup_mutex, which was removed by the commit, in mem_cgroup_out_of_memory() was originally introduced at commit c7ba5c9e (Memory controller: OOM handling). IIUC, the intention of this cgroup_mutex was to prevent task move during select_bad_process() so that situations like below can be avoided. Assume cgroup "foo" has exceeded its limit and is about to trigger oom. 1. Process A, which has been in cgroup "baa" and uses large memory, is just moved to cgroup "foo". Process A can be the candidates for being killed. 2. Process B, which has been in cgroup "foo" and uses large memory, is just moved from cgroup "foo". Process B can be excluded from the candidates for being killed. But these race window exists anyway even if we hold a lock, because __mem_cgroup_try_charge() decides wether it should trigger oom or not outside of the lock. So the original cgroup_mutex in mem_cgroup_out_of_memory and thus current memcg_tasklist has no use. And IMHO, those races are not so critical for users. This patch removes it and make codes simpler. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16memcg: avoid oom-killing innocent task in case of use_hierarchyDaisuke Nishimura
task_in_mem_cgroup(), which is called by select_bad_process() to check whether a task can be a candidate for being oom-killed from memcg's limit, checks "curr->use_hierarchy"("curr" is the mem_cgroup the task belongs to). But this check return true(it's false positive) when: <some path>/aa use_hierarchy == 0 <- hitting limit <some path>/aa/00 use_hierarchy == 1 <- the task belongs to This leads to killing an innocent task in aa/00. This patch is a fix for this bug. And this patch also fixes the arg for mem_cgroup_print_oom_info(). We should print information of mem_cgroup which the task being killed, not current, belongs to. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16memcg: cleanup mem_cgroup_move_parent()Daisuke Nishimura
mem_cgroup_move_parent() calls try_charge first and cancel_charge on failure. IMHO, charge/uncharge(especially charge) is high cost operation, so we should avoid it as far as possible. This patch tries to delay try_charge in mem_cgroup_move_parent() by re-ordering checks it does. And this patch renames mem_cgroup_move_account() to __mem_cgroup_move_account(), changes the return value of __mem_cgroup_move_account() from int to void, and adds a new wrapper(mem_cgroup_move_account()), which checks whether a @pc is valid for moving account and calls __mem_cgroup_move_account(). This patch removes the last caller of trylock_page_cgroup(), so removes its definition too. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16memcg: add mem_cgroup_cancel_charge()Daisuke Nishimura
There are some places calling both res_counter_uncharge() and css_put() to cancel the charge and the refcnt we have got by mem_cgroup_tyr_charge(). This patch introduces mem_cgroup_cancel_charge() and call it in those places. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16memcg: make memcg's file mapped consistent with global VMKAMEZAWA Hiroyuki
In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE. This makes grep difficult. Replace memcg's MAPPED_FILE with FILE_MAPPED And in global VM, mapped shared memory is accounted into FILE_MAPPED. But memcg doesn't. fix it. Note: page_is_file_cache() just checks SwapBacked or not. So, we need to check PageAnon. Cc: Balbir Singh <balbir@in.ibm.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16memcg: coalesce charging via percpu storageKAMEZAWA Hiroyuki
This is a patch for coalescing access to res_counter at charging by percpu caching. At charge, memcg charges 64pages and remember it in percpu cache. Because it's cache, drain/flush if necessary. This version uses public percpu area. 2 benefits for using public percpu area. 1. Sum of stocked charge in the system is limited to # of cpus not to the number of memcg. This shows better synchonization. 2. drain code for flush/cpuhotplug is very easy (and quick) The most important point of this patch is that we never touch res_counter in fast path. The res_counter is system-wide shared counter which is modified very frequently. We shouldn't touch it as far as we can for avoiding false sharing. On x86-64 8cpu server, I tested overheads of memcg at page fault by running a program which does map/fault/unmap in a loop. Running a task per a cpu by taskset and see sum of the number of page faults in 60secs. [without memcg config] 40156968 page-faults # 0.085 M/sec ( +- 0.046% ) 27.67 cache-miss/faults [root cgroup] 36659599 page-faults # 0.077 M/sec ( +- 0.247% ) 31.58 cache miss/faults [in a child cgroup] 18444157 page-faults # 0.039 M/sec ( +- 0.133% ) 69.96 cache miss/faults [ + coalescing uncharge patch] 27133719 page-faults # 0.057 M/sec ( +- 0.155% ) 47.16 cache miss/faults [ + coalescing uncharge patch + this patch ] 34224709 page-faults # 0.072 M/sec ( +- 0.173% ) 34.69 cache miss/faults Changelog (since Oct/2): - updated comments - replaced get_cpu_var() with __get_cpu_var() if possible. - removed mutex for system-wide drain. adds a counter instead of it. - removed CONFIG_HOTPLUG_CPU Changelog (old): - rebased onto the latest mmotm - moved charge size check before __GFP_WAIT check for avoiding unnecesary - added asynchronous flush routine. - fixed bugs pointed out by Nishimura-san. [akpm@linux-foundation.org: tweak comments] [nishimura@mxp.nes.nec.co.jp: don't do INIT_WORK() repeatedly against the same work_struct] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16memcg: coalesce uncharge during unmap/truncateKAMEZAWA Hiroyuki
In massive parallel enviroment, res_counter can be a performance bottleneck. One strong techinque to reduce lock contention is reducing calls by coalescing some amount of calls into one. Considering charge/uncharge chatacteristic, - charge is done one by one via demand-paging. - uncharge is done by - in chunk at munmap, truncate, exit, execve... - one by one via vmscan/paging. It seems we have a chance to coalesce uncharges for improving scalability at unmap/truncation. This patch is a for coalescing uncharge. For avoiding scattering memcg's structure to functions under /mm, this patch adds memcg batch uncharge information to the task. A reason for per-task batching is for making use of caller's context information. We do batched uncharge (deleyed uncharge) when truncation/unmap occurs but do direct uncharge when uncharge is called by memory reclaim (vmscan.c). The degree of coalescing depends on callers - at invalidate/trucate... pagevec size - at unmap ....ZAP_BLOCK_SIZE (memory itself will be freed in this degree.) Then, we'll not coalescing too much. On x86-64 8cpu server, I tested overheads of memcg at page fault by running a program which does map/fault/unmap in a loop. Running a task per a cpu by taskset and see sum of the number of page faults in 60secs. [without memcg config] 40156968 page-faults # 0.085 M/sec ( +- 0.046% ) 27.67 cache-miss/faults [root cgroup] 36659599 page-faults # 0.077 M/sec ( +- 0.247% ) 31.58 miss/faults [in a child cgroup] 18444157 page-faults # 0.039 M/sec ( +- 0.133% ) 69.96 miss/faults [child with this patch] 27133719 page-faults # 0.057 M/sec ( +- 0.155% ) 47.16 miss/faults We can see some amounts of improvement. (root cgroup doesn't affected by this patch) Another patch for "charge" will follow this and above will be improved more. Changelog(since 2009/10/02): - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes) - some clean up and commentary/description updates. - added initialize code to copy_process(). (possible bug fix) Changelog(old): - fixed !CONFIG_MEM_CGROUP case. - rebased onto the latest mmotm + softlimit fix patches. - unified patch for callers - added commetns. - make ->do_batch as bool. - removed css_get() at el. We don't need it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16memcg: fix memory.memsw.usage_in_bytes for root cgroupKirill A. Shutemov
A memory cgroup has a memory.memsw.usage_in_bytes file. It shows the sum of the usage of pages and swapents in the cgroup. Presently the root cgroup's memsw.usage_in_bytes shows the wrong value - the number of swapents are not added. So take MEM_CGROUP_STAT_SWAPOUT into account. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16oom-kill: fix NUMA constraint check with nodemaskKAMEZAWA Hiroyuki
Fix node-oriented allocation handling in oom-kill.c I myself think of this as a bugfix not as an ehnancement. In these days, things are changed as - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask(). - mempolicy don't maintain its own private zonelists. (And cpuset doesn't use nodemask for __alloc_pages_nodemask()) So, current oom-killer's check function is wrong. This patch does - check nodemask, if nodemask && nodemask doesn't cover all node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY. - Scan all zonelist under nodemask, if it hits cpuset's wall this faiulre is from cpuset. And - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE. This doesn't change "current" behavior. If callers use __GFP_THISNODE it should handle "page allocation failure" by itself. - handle __GFP_NOFAIL+__GFP_THISNODE path. This is something like a FIXME but this gfpmask is not used now. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16oom-kill: show virtual size and rss information of the killed processKOSAKI Motohiro
In a typical oom analysis scenario, we frequently want to know whether the killed process has a memory leak or not at the first step. This patch adds vsz and rss information to the oom log to help this analysis. To save time for the debugging. example: =================================================================== rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0 Pid: 1308, comm: rsyslogd Not tainted 2.6.32-rc6 #24 Call Trace: [<ffffffff8132e35b>] ?_spin_unlock+0x2b/0x40 [<ffffffff810f186e>] oom_kill_process+0xbe/0x2b0 (snip) 492283 pages non-shared Out of memory: kill process 2341 (memhog) score 527276 or a child Killed process 2341 (memhog) vsz:1054552kB, anon-rss:970588kB, file-rss:4kB =========================================================================== ^ | here [rientjes@google.com: fix race, add pid & comm to message] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16HWPOISON: Remove stray phrase in a commentAndi Kleen
Better to have complete sentences. Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-12-16HWPOISON: Try to allocate migration page on the same nodeAndi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com>