summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2008-12-01memcg: memory hotplug fix for notifier callbackKAMEZAWA Hiroyuki
Fixes for memcg/memory hotplug. While memory hotplug allocate/free memmap, page_cgroup doesn't free page_cgroup at OFFLINE when page_cgroup is allocated via bootomem. (Because freeing bootmem requires special care.) Then, if page_cgroup is allocated by bootmem and memmap is freed/allocated by memory hotplug, page_cgroup->page == page is no longer true. But current MEM_ONLINE handler doesn't check it and update page_cgroup->page if it's not necessary to allocate page_cgroup. (This was not found because memmap is not freed if SPARSEMEM_VMEMMAP is y.) And I noticed that MEM_ONLINE can be called against "part of section". So, freeing page_cgroup at CANCEL_ONLINE will cause trouble. (freeing used page_cgroup) Don't rollback at CANCEL. One more, current memory hotplug notifier is stopped by slub because it sets NOTIFY_STOP_MASK to return vaule. So, page_cgroup's callback never be called. (low priority than slub now.) I think this slub's behavior is not intentional(BUG). and fixes it. Another way to be considered about page_cgroup allocation: - free page_cgroup at OFFLINE even if it's from bootmem and remove specieal handler. But it requires more changes. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=12041 Signed-off-by: KAMEZAWA Hiruyoki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Tested-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01mm: vmalloc fix lazy unmapping cache aliasingNick Piggin
Jim Radford has reported that the vmap subsystem rewrite was sometimes causing his VIVT ARM system to behave strangely (seemed like going into infinite loops trying to fault in pages to userspace). We determined that the problem was most likely due to a cache aliasing issue. flush_cache_vunmap was only being called at the moment the page tables were to be taken down, however with lazy unmapping, this can happen after the page has subsequently been freed and allocated for something else. The dangling alias may still have dirty data attached to it. The fix for this problem is to do the cache flushing when the caller has called vunmap -- it would be a bug for them to write anything else to the mapping at that point. That appeared to solve Jim's problems. Reported-by: Jim Radford <radford@blackbean.org> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01vmscan: protect zone rotation stats by lru lockJohannes Weiner
The zone's rotation statistics must not be accessed without the corresponding LRU lock held. Fix an unprotected write in shrink_active_list(). Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-30meminit section warningsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19vmscan: fix get_scan_ratio() commentRik van Riel
Fix the old comment on the scan ratio calculations. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19vmscan: let GFP_NOFS go to swap againHugh Dickins
In the past, GFP_NOFS (but of course not GFP_NOIO) was allowed to reclaim by writing to swap. That got partially broken in 2.6.23, when may_enter_fs initialization was moved up before the allocation of swap, so its PageSwapCache test was failing the first time around, Fix it by setting may_enter_fs when add_to_swap() succeeds with __GFP_IO. In fact, check __GFP_IO before calling add_to_swap(): allocating swap we're not ready to use just increases disk seeking. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19migration: fix writepage errorHugh Dickins
Page migration's writeout() has got understandably confused by the nasty AOP_WRITEPAGE_ACTIVATE case: as in normal success, a writepage() error has unlocked the page, so writeout() then needs to relock it. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19mm: vmalloc search restart fixGlauber Costa
Current vmalloc restart search for a free area in case we can't find one. The reason is there are areas which are lazily freed, and could be possibly freed now. However, current implementation start searching the tree from the last failing address, which is pretty much by definition at the end of address space. So, we fail. The proposal of this patch is to restart the search from the beginning of the requested vstart address. This fixes the regression in running KVM virtual machines for me, described in http://lkml.org/lkml/2008/10/28/349, caused by commit db64fe02258f1507e13fe5212a989922323685ce. Signed-off-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19mm: vmalloc failure flush fixNick Piggin
An initial vmalloc failure should start off a synchronous flush of lazy areas, in case someone is in progress flushing them already, which could cause us to return an allocation failure even if there is plenty of KVA free. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19mm: vmalloc allocator off by oneNick Piggin
Fix off by one bug in the KVA allocator that can leave gaps in the address space. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19cpuset: update top cpuset's mems after adding a nodeMiao Xie
After adding a node into the machine, top cpuset's mems isn't updated. By reviewing the code, we found that the update function cpuset_track_online_nodes() was invoked after node_states[N_ONLINE] changes. It is wrong because N_ONLINE just means node has pgdat, and if node has/added memory, we use N_HIGH_MEMORY. So, We should invoke the update function after node_states[N_HIGH_MEMORY] changes, just like its commit says. This patch fixes it. And we use notifier of memory hotplug instead of direct calling of cpuset_track_online_nodes(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Paul Menage <menage@google.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-16unitialized return value in mm/mlock.c: __mlock_vma_pages_range()Helge Deller
Fix an unitialized return value when compiling on parisc (with CONFIG_UNEVICTABLE_LRU=y): mm/mlock.c: In function `__mlock_vma_pages_range': mm/mlock.c:165: warning: `ret' might be used uninitialized in this function Signed-off-by: Helge Deller <deller@gmx.de> [ It isn't ever really used uninitialized, since no caller should ever call this function with an empty range. But the compiler is correct that from a local analysis standpoint that is impossible to see, and fixing the warning is appropriate. ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-15mm: remove unevictable's show_page_pathKOSAKI Motohiro
Hugh Dickins reported show_page_path() is buggy and unsafe because - lack dput() against d_find_alias() - don't concern vma->vm_mm->owner == NULL - lack lock_page() it was only for debugging, so rather than trying to fix it, just remove it now. Reported-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Lee Schermerhorn <Lee.Schermerhorn@hp.com> CC: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-12memcg: bugfix for memory hotplugKAMEZAWA Hiroyuki
The start pfn calculation in page_cgroup's memory hotplug notifier chain is wrong. Tested-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-12mm: remove lru_add_drain_all() from the munlock pathKOSAKI Motohiro
lockdep warns about following message at boot time on one of my test machine. Then, schedule_on_each_cpu() sholdn't be called when the task have mmap_sem. Actually, lru_add_drain_all() exist to prevent the unevictalble pages stay on reclaimable lru list. but currenct unevictable code can rescue unevictable pages although it stay on reclaimable list. So removing is better. In addition, this patch add lru_add_drain_all() to sys_mlock() and sys_mlockall(). it isn't must. but it reduce the failure of moving to unevictable list. its failure can rescue in vmscan later. but reducing is better. Note, if above rescuing happend, the Mlocked and the Unevictable field mismatching happend in /proc/meminfo. but it doesn't cause any real trouble. ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.28-rc2-mm1 #2 ------------------------------------------------------- lvm/1103 is trying to acquire lock: (&cpu_hotplug.lock){--..}, at: [<c0130789>] get_online_cpus+0x29/0x50 but task is already holding lock: (&mm->mmap_sem){----}, at: [<c01878ae>] sys_mlockall+0x4e/0xb0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&mm->mmap_sem){----}: [<c0153da2>] check_noncircular+0x82/0x110 [<c0185e6a>] might_fault+0x4a/0xa0 [<c0156161>] validate_chain+0xb11/0x1070 [<c0185e6a>] might_fault+0x4a/0xa0 [<c0156923>] __lock_acquire+0x263/0xa10 [<c015714c>] lock_acquire+0x7c/0xb0 (*) grab mmap_sem [<c0185e6a>] might_fault+0x4a/0xa0 [<c0185e9b>] might_fault+0x7b/0xa0 [<c0185e6a>] might_fault+0x4a/0xa0 [<c0294dd0>] copy_to_user+0x30/0x60 [<c01ae3ec>] filldir+0x7c/0xd0 [<c01e3a6a>] sysfs_readdir+0x11a/0x1f0 (*) grab sysfs_mutex [<c01ae370>] filldir+0x0/0xd0 [<c01ae370>] filldir+0x0/0xd0 [<c01ae4c6>] vfs_readdir+0x86/0xa0 (*) grab i_mutex [<c01ae75b>] sys_getdents+0x6b/0xc0 [<c010355a>] syscall_call+0x7/0xb [<ffffffff>] 0xffffffff -> #2 (sysfs_mutex){--..}: [<c0153da2>] check_noncircular+0x82/0x110 [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0 [<c0156161>] validate_chain+0xb11/0x1070 [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0 [<c0156923>] __lock_acquire+0x263/0xa10 [<c015714c>] lock_acquire+0x7c/0xb0 (*) grab sysfs_mutex [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0 [<c04f8b55>] mutex_lock_nested+0xa5/0x2f0 [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0 [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0 [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0 [<c01e422f>] create_dir+0x3f/0x90 [<c01e42a9>] sysfs_create_dir+0x29/0x50 [<c04faaf5>] _spin_unlock+0x25/0x40 [<c028f21d>] kobject_add_internal+0xcd/0x1a0 [<c028f37a>] kobject_set_name_vargs+0x3a/0x50 [<c028f41d>] kobject_init_and_add+0x2d/0x40 [<c019d4d2>] sysfs_slab_add+0xd2/0x180 [<c019d580>] sysfs_add_func+0x0/0x70 [<c019d5dc>] sysfs_add_func+0x5c/0x70 (*) grab slub_lock [<c01400f2>] run_workqueue+0x172/0x200 [<c014008f>] run_workqueue+0x10f/0x200 [<c0140bd0>] worker_thread+0x0/0xf0 [<c0140c6c>] worker_thread+0x9c/0xf0 [<c0143c80>] autoremove_wake_function+0x0/0x50 [<c0140bd0>] worker_thread+0x0/0xf0 [<c0143972>] kthread+0x42/0x70 [<c0143930>] kthread+0x0/0x70 [<c01042db>] kernel_thread_helper+0x7/0x1c [<ffffffff>] 0xffffffff -> #1 (slub_lock){----}: [<c0153d2d>] check_noncircular+0xd/0x110 [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0 [<c0156161>] validate_chain+0xb11/0x1070 [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0 [<c015433d>] mark_lock+0x35d/0xd00 [<c0156923>] __lock_acquire+0x263/0xa10 [<c015714c>] lock_acquire+0x7c/0xb0 [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0 [<c04f93a3>] down_read+0x43/0x80 [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0 (*) grab slub_lock [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0 [<c04fd9ac>] notifier_call_chain+0x3c/0x70 [<c04f5454>] _cpu_up+0x84/0x110 [<c04f552b>] cpu_up+0x4b/0x70 (*) grab cpu_hotplug.lock [<c06d1530>] kernel_init+0x0/0x170 [<c06d15e5>] kernel_init+0xb5/0x170 [<c06d1530>] kernel_init+0x0/0x170 [<c01042db>] kernel_thread_helper+0x7/0x1c [<ffffffff>] 0xffffffff -> #0 (&cpu_hotplug.lock){--..}: [<c0155bff>] validate_chain+0x5af/0x1070 [<c040f7e0>] dev_status+0x0/0x50 [<c0156923>] __lock_acquire+0x263/0xa10 [<c015714c>] lock_acquire+0x7c/0xb0 [<c0130789>] get_online_cpus+0x29/0x50 [<c04f8b55>] mutex_lock_nested+0xa5/0x2f0 [<c0130789>] get_online_cpus+0x29/0x50 [<c0130789>] get_online_cpus+0x29/0x50 [<c017bc30>] lru_add_drain_per_cpu+0x0/0x10 [<c0130789>] get_online_cpus+0x29/0x50 (*) grab cpu_hotplug.lock [<c0140cf2>] schedule_on_each_cpu+0x32/0xe0 [<c0187095>] __mlock_vma_pages_range+0x85/0x2c0 [<c0156945>] __lock_acquire+0x285/0xa10 [<c0188f09>] vma_merge+0xa9/0x1d0 [<c0187450>] mlock_fixup+0x180/0x200 [<c0187548>] do_mlockall+0x78/0x90 (*) grab mmap_sem [<c01878e1>] sys_mlockall+0x81/0xb0 [<c010355a>] syscall_call+0x7/0xb [<ffffffff>] 0xffffffff other info that might help us debug this: 1 lock held by lvm/1103: #0: (&mm->mmap_sem){----}, at: [<c01878ae>] sys_mlockall+0x4e/0xb0 stack backtrace: Pid: 1103, comm: lvm Not tainted 2.6.28-rc2-mm1 #2 Call Trace: [<c01555fc>] print_circular_bug_tail+0x7c/0xd0 [<c0155bff>] validate_chain+0x5af/0x1070 [<c040f7e0>] dev_status+0x0/0x50 [<c0156923>] __lock_acquire+0x263/0xa10 [<c015714c>] lock_acquire+0x7c/0xb0 [<c0130789>] get_online_cpus+0x29/0x50 [<c04f8b55>] mutex_lock_nested+0xa5/0x2f0 [<c0130789>] get_online_cpus+0x29/0x50 [<c0130789>] get_online_cpus+0x29/0x50 [<c017bc30>] lru_add_drain_per_cpu+0x0/0x10 [<c0130789>] get_online_cpus+0x29/0x50 [<c0140cf2>] schedule_on_each_cpu+0x32/0xe0 [<c0187095>] __mlock_vma_pages_range+0x85/0x2c0 [<c0156945>] __lock_acquire+0x285/0xa10 [<c0188f09>] vma_merge+0xa9/0x1d0 [<c0187450>] mlock_fixup+0x180/0x200 [<c0187548>] do_mlockall+0x78/0x90 [<c01878e1>] sys_mlockall+0x81/0xb0 [<c010355a>] syscall_call+0x7/0xb Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh@veritas.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-12cpusets: update mems allowed in page allocatorDavid Rientjes
If all allowable memory is unreclaimable, it is possible to loop forever in the page allocator for ~__GFP_NORETRY allocations. During this time, it is also possible for a task's cpuset to expand its set of allowable nodes so that it now includes free memory. The cached copy of this set, current->mems_allowed, is stale, however, since there has not been a subsequent call to cpuset_update_task_memory_state(). The cached copy of the set of allowable nodes is now updated in the page allocator's slow path so the additional memory is available to get_page_from_freelist(). [akpm@linux-foundation.org: add comment] Signed-off-by: David Rientjes <rientjes@google.com> Cc: Paul Menage <menage@google.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-12hugetlb: make unmap_ref_private multi-size-awareAdam Litke
Oops. Part of the hugetlb private reservation code was not fully converted to use hstates. When a huge page must be unmapped from VMAs due to a failed COW, HPAGE_SIZE is used in the call to unmap_hugepage_range() regardless of the page size being used. This works if the VMA is using the default huge page size. Otherwise we might unmap too much, too little, or trigger a BUG_ON. Rare but serious -- fix it. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-12parisc: fix find_extend_vma() breakageDenys Vlasenko
The STACK_GROWSUP case of stack expansion was missing a test for 'prev', which got removed by commit cb8f488c33539f096580e202f5438a809195008f ("mmap.c: deinline a few functions") by mistake. I found my original email in "sent" folder. The patch in that mail does NOT remove !prev. That change had beed added by someone else. Ok, I think we are not much interested in who did it, let's fix it for good. [ "It looks like this was caused by me fixing rejects. That was the fancy include-lots-of-context-so-it-wont-apply patch." - akpm ] Reported-and-bisected-by: Helge Deller <deller@gmx.de> Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-07vmap: cope with vm_unmap_aliases before vmalloc_init()Jeremy Fitzhardinge
Xen can end up calling vm_unmap_aliases() before vmalloc_init() has been called. In this case its safe to make it a simple no-op. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Linux Memory Management List <linux-mm@kvack.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-06Merge master.kernel.org:/home/rmk/linux-2.6-armLinus Torvalds
* master.kernel.org:/home/rmk/linux-2.6-arm: [ARM] xsc3: fix xsc3_l2_inv_range [ARM] mm: fix page table initialization [ARM] fix naming of MODULE_START / MODULE_END ARM: OMAP: Fix define for twl4030 irqs ARM: OMAP: Fix get_irqnr_and_base to clear spurious interrupt bits ARM: OMAP: Fix debugfs_create_*'s error checking method for arm/plat-omap ARM: OMAP: Fix compiler warnings in gpmc.c [ARM] fix VFP+softfloat binaries
2008-11-06memory hotplug: fix page_zone() calculation in test_pages_isolated()Gerald Schaefer
My last bugfix here (adding zone->lock) introduced a new problem: Using page_zone(pfn_to_page(pfn)) to get the zone after the for() loop is wrong. pfn will then be >= end_pfn, which may be in a different zone or not present at all. This may lead to an addressing exception in page_zone() or spin_lock_irqsave(). Now I use __first_valid_page() again after the loop to find a valid page for page_zone(). Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: Nathan Fontenot <nfont@austin.ibm.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-06mm/oom_kill.c: fix badness() kerneldocQinghuang Feng
Paramter @mem has been removed since v2.6.26, now delete it's comment. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-06vmemmap: warn about page_structs with remote distanceDavid Rientjes
It's insufficient to simply compare node ids when warning about offnode page_structs since it's possible to still have local affinity. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-06mm: move migrate_prep out from under mmap_semChristoph Lameter
Move the migrate_prep outside the mmap_sem for the following system calls 1. sys_move_pages 2. sys_migrate_pages 3. sys_mbind() It really does not matter when we flush the lru. The system is free to add pages onto the lru even during migration which will make the page migration either skip the page (mbind, migrate_pages) or return a busy state (move_pages). Fixes this lockdep warning (and potential deadlock): Some VM place has mmap_sem -> kevent_wq via lru_add_drain_all() net/core/dev.c::dev_ioctl() has rtnl_lock -> mmap_sem (*) the ioctl has copy_from_user() and it can do page fault. linkwatch_event has kevent_wq -> rtnl_lock Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh@veritas.com> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-06oom: do not dump task state for non thread group leadersDavid Rientjes
When /proc/sys/vm/oom_dump_tasks is enabled, it's only necessary to dump task state information for thread group leaders. The kernel log gets quickly overwhelmed on machines with a massive number of threads by dumping non-thread group leaders. Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-06hugetlb: pull gigantic page initialisation out of the default pathAndy Whitcroft
As we can determine exactly when a gigantic page is in use we can optimise the common regular page cases by pulling out gigantic page initialisation into its own function. As gigantic pages are never released to buddy we do not need a destructor. This effectivly reverts the previous change to the main buddy allocator. It also adds a paranoid check to ensure we never release gigantic pages from hugetlbfs to the main buddy. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> [2.6.27.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-06hugetlbfs: handle pages higher order than MAX_ORDERAndy Whitcroft
When working with hugepages, hugetlbfs assumes that those hugepages are smaller than MAX_ORDER. Specifically it assumes that the mem_map is contigious and uses that to optimise access to the elements of the mem_map that represent the hugepage. Gigantic pages (such as 16GB pages on powerpc) by definition are of greater order than MAX_ORDER (larger than MAX_ORDER_NR_PAGES in size). This means that we can no longer make use of the buddy alloctor guarentees for the contiguity of the mem_map, which ensures that the mem_map is at least contigious for maximmally aligned areas of MAX_ORDER_NR_PAGES pages. This patch adds new mem_map accessors and iterator helpers which handle any discontiguity at MAX_ORDER_NR_PAGES boundaries. It then uses these to implement gigantic page versions of copy_huge_page and clear_huge_page, and to allow follow_hugetlb_page handle gigantic pages. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> [2.6.27.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-06[ARM] fix naming of MODULE_START / MODULE_ENDRussell King
As of 73bdf0a60e607f4b8ecc5aec597105976565a84f, the kernel needs to know where modules are located in the virtual address space. On ARM, we located this region between MODULE_START and MODULE_END. Unfortunately, everyone else calls it MODULES_VADDR and MODULES_END. Update ARM to use the same naming, so is_vmalloc_or_module_addr() can work properly. Also update the comment on mm/vmalloc.c to reflect that ARM also places modules in a separate region from the vmalloc space. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2008-10-30nfsd: fix vm overcommit crashAlan Cox
Junjiro R. Okajima reported a problem where knfsd crashes if you are using it to export shmemfs objects and run strict overcommit. In this situation the current->mm based modifier to the overcommit goes through a NULL pointer. We could simply check for NULL and skip the modifier but we've caught other real bugs in the past from mm being NULL here - cases where we did need a valid mm set up (eg the exec bug about a year ago). To preserve the checks and get the logic we want shuffle the checking around and add a new helper to the vm_ security wrappers Also fix a current->mm reference in nommu that should use the passed mm [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix build] Reported-by: Junjiro R. Okajima <hooanon05@yahoo.co.jp> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-30mm: fix kernel-doc function notationRandy Dunlap
Delete excess kernel-doc notation in mm/ subdirectory. Actually this is a kernel-doc notation fix. Warning(/var/linsrc/linux-2.6.27-git10//mm/vmalloc.c:902): Excess function parameter or struct member 'returns' description in 'vm_map_ram' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-30fs: remove prepare_write/commit_writeNick Piggin
Nothing uses prepare_write or commit_write. Remove them from the tree completely. [akpm@linux-foundation.org: schedule simple_prepare_write() for unexporting] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-23Merge branch 'proc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc * 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc: (35 commits) proc: remove fs/proc/proc_misc.c proc: move /proc/vmcore creation to fs/proc/vmcore.c proc: move pagecount stuff to fs/proc/page.c proc: move all /proc/kcore stuff to fs/proc/kcore.c proc: move /proc/schedstat boilerplate to kernel/sched_stats.h proc: move /proc/modules boilerplate to kernel/module.c proc: move /proc/diskstats boilerplate to block/genhd.c proc: move /proc/zoneinfo boilerplate to mm/vmstat.c proc: move /proc/vmstat boilerplate to mm/vmstat.c proc: move /proc/pagetypeinfo boilerplate to mm/vmstat.c proc: move /proc/buddyinfo boilerplate to mm/vmstat.c proc: move /proc/vmallocinfo to mm/vmalloc.c proc: move /proc/slabinfo boilerplate to mm/slub.c, mm/slab.c proc: move /proc/slab_allocators boilerplate to mm/slab.c proc: move /proc/interrupts boilerplate code to fs/proc/interrupts.c proc: move /proc/stat to fs/proc/stat.c proc: move rest of /proc/partitions code to block/genhd.c proc: move /proc/cpuinfo code to fs/proc/cpuinfo.c proc: move /proc/devices code to fs/proc/devices.c proc: move rest of /proc/locks to fs/locks.c ...
2008-10-23memcg: fix page_cgroup allocationKAMEZAWA Hiroyuki
page_cgroup_init() is called from mem_cgroup_init(). But at this point, we cannot call alloc_bootmem(). (and this caused panic at boot.) This patch moves page_cgroup_init() to init/main.c. Time table is following: == parse_args(). # we can trust mem_cgroup_subsys.disabled bit after this. .... cgroup_init_early() # "early" init of cgroup. .... setup_arch() # memmap is allocated. ... page_cgroup_init(); mem_init(); # we cannot call alloc_bootmem after this. .... cgroup_init() # mem_cgroup is initialized. == Before page_cgroup_init(), mem_map must be initialized. So, I added page_cgroup_init() to init/main.c directly. (*) maybe this is not very clean but - cgroup_init_early() is too early - in cgroup_init(), we have to use vmalloc instead of alloc_bootmem(). use of vmalloc area in x86-32 is important and we should avoid very large vmalloc() in x86-32. So, we want to use alloc_bootmem() and added page_cgroup_init() directly to init/main.c [akpm@linux-foundation.org: remove unneeded/bad mem_cgroup_subsys declaration] [akpm@linux-foundation.org: fix build] Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-23mm: page_cgroup needs linux/vmalloc.h for vmalloc_node()/vfree().Paul Mundt
mm/page_cgroup.c: In function 'init_section_page_cgroup': mm/page_cgroup.c:111: error: implicit declaration of function 'vmalloc_node' mm/page_cgroup.c:111: warning: assignment makes pointer from integer without a cast mm/page_cgroup.c: In function '__free_page_cgroup': mm/page_cgroup.c:140: error: implicit declaration of function 'vfree' Signed-off-by: Paul Mundt <lethal@linux-sh.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-23proc: move /proc/zoneinfo boilerplate to mm/vmstat.cAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux-foundation.org>
2008-10-23proc: move /proc/vmstat boilerplate to mm/vmstat.cAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux-foundation.org>
2008-10-23proc: move /proc/pagetypeinfo boilerplate to mm/vmstat.cAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23proc: move /proc/buddyinfo boilerplate to mm/vmstat.cAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-23proc: move /proc/vmallocinfo to mm/vmalloc.cAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux-foundation.org>
2008-10-23proc: move /proc/slabinfo boilerplate to mm/slub.c, mm/slab.cAlexey Dobriyan
Lose dummy ->write hook in case of SLUB, it's possible now. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-10-23proc: move /proc/slab_allocators boilerplate to mm/slab.cAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-10-23proc: switch /proc/meminfo to seq_fileAlexey Dobriyan
and move it to fs/proc/meminfo.c while I'm at it. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2008-10-20mm: remove duplicated #include'sHuang Weiyi
Removed duplicated #include <linux/vmalloc.h> in mm/vmalloc.c and "internal.h" in mm/memory.c. Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20Export tiny shmem_file_setup for DRM-GEMHugh Dickins
We're trying to keep the !CONFIG_SHMEM tiny-shmem.c (using ramfs without swap) in synch with CONFIG_SHMEM shmem.c (and mpm is preparing patches to combine them). I was glad to see EXPORT_SYMBOL_GPL(shmem_file_setup) go into shmem.c, but why not support DRM-GEM when !CONFIG_SHMEM too? But caution says still depend on MMU, since !CONFIG_MMU is.. different. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Matt Mackall <mpm@selenic.com> Acked-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86 ACPI: fix breakage of resume on 64-bit UP systems with SMP kernel Introduce is_vmalloc_or_module_addr() and use with DEBUG_VIRTUAL
2008-10-20make mm/rmap.c:anon_vma_cachep staticAdrian Bunk
This patch makes the needlessly global anon_vma_cachep static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20memcg: allocate all page_cgroup at bootKAMEZAWA Hiroyuki
Allocate all page_cgroup at boot and remove page_cgroup poitner from struct page. This patch adds an interface as struct page_cgroup *lookup_page_cgroup(struct page*) All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported. Remove page_cgroup pointer reduces the amount of memory by - 4 bytes per PAGE_SIZE. - 8 bytes per PAGE_SIZE if memory controller is disabled. (even if configured.) On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory. On my x86-64 server with 48GB of memory, this saves 96MB of memory. I think this reduction makes sense. By pre-allocation, kmalloc/kfree in charge/uncharge are removed. This means - we're not necessary to be afraid of kmalloc faiulre. (this can happen because of gfp_mask type.) - we can avoid calling kmalloc/kfree. - we can avoid allocating tons of small objects which can be fragmented. - we can know what amount of memory will be used for this extra-lru handling. I added printk message as "allocated %ld bytes of page_cgroup" "please try cgroup_disable=memory option if you don't want" maybe enough informative for users. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20memcg: atomic ops for page_cgroup->flagsKAMEZAWA Hiroyuki
This patch makes page_cgroup->flags to be atomic_ops and define functions (and macros) to access it. Before trying to modify memory resource controller, this atomic operation on flags is necessary. Most of flags in this patch is for LRU and modfied under mz->lru_lock but we'll add another flags which is not for LRU soon. For example, we'll place LOCK bit on flags field. We need atomic operation to modify LRU bit without LOCK. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20memcg: optimize per-cpu statisticsKAMEZAWA Hiroyuki
Some obvious optimization to memcg. I found mem_cgroup_charge_statistics() is a little big (in object) and does unnecessary address calclation. This patch is for optimization to reduce the size of this function. And res_counter_charge() is 'likely' to succeed. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20memcg: avoid accounting special pagesKAMEZAWA Hiroyuki
There are not-on-LRU pages which can be mapped and they are not worth to be accounted. (becasue we can't shrink them and need dirty codes to handle specical case) We'd like to make use of usual objrmap/radix-tree's protcol and don't want to account out-of-vm's control pages. When special_mapping_fault() is called, page->mapping is tend to be NULL and it's charged as Anonymous page. insert_page() also handles some special pages from drivers. This patch is for avoiding to account special pages. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>