summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2008-06-30Merge commit 'slab/for-next'Stephen Rothwell
2008-06-30Merge commit 'trivial/next'Stephen Rothwell
2008-06-30Merge commit 'x86/auto-x86-next'Stephen Rothwell
Conflicts: arch/x86/kernel/entry_32.S arch/x86/kernel/process_32.c arch/x86/kernel/process_64.c
2008-06-30Merge commit 'ftrace/auto-ftrace-next'Stephen Rothwell
2008-06-30Merge commit 'cpus4096/auto-cpus4096-next'Stephen Rothwell
2008-06-25manual merge of x86/setup-memoryIngo Molnar
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-25Merge branch 'x86/xen' into auto-x86-nextIngo Molnar
2008-06-25mm: add a ptep_modify_prot transaction abstractionJeremy Fitzhardinge
This patch adds an API for doing read-modify-write updates to a pte's protection bits which may race against hardware updates to the pte. After reading the pte, the hardware may asynchonously set the accessed or dirty bits on a pte, which would be lost when writing back the modified pte value. The existing technique to handle this race is to use ptep_get_and_clear() atomically fetch the old pte value and clear it in memory. This has the effect of marking the pte as non-present, which will prevent the hardware from updating its state. When the new value is written back, the pte will be present again, and the hardware can resume updating the access/dirty flags. When running in a virtualized environment, pagetable updates are relatively expensive, since they generally involve some trap into the hypervisor. To mitigate the cost of these updates, we tend to batch them. However, because of the atomic nature of ptep_get_and_clear(), it is inherently non-batchable. This new interface allows batching by giving the underlying implementation enough information to open a transaction between the read and write phases: ptep_modify_prot_start() returns the current pte value, and puts the pte entry into a state where either the hardware will not update the pte, or if it does, the updates will be preserved on commit. ptep_modify_prot_commit() writes back the updated pte, makes sure that any hardware updates made since ptep_modify_prot_start() are preserved. ptep_modify_prot_start() and _commit() must be exactly paired, and used while holding the appropriate pte lock. They do not protect against other software updates of the pte in any way. The current implementations of ptep_modify_prot_start and _commit are functionally unchanged from before: _start() uses ptep_get_and_clear() fetch the pte and zero the entry, preventing any hardware updates. _commit() simply writes the new pte value back knowing that the hardware has not updated the pte in the meantime. The only current user of this interface is mprotect Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-25Merge branch 'x86/xen' into x86/setup-memoryIngo Molnar
2008-06-25Merge branch 'linus' into x86/mpparseIngo Molnar
2008-06-25Merge branch 'linus' into tracing/ftracetip-tracing-ftrace-2008-06-25_10.27_WedIngo Molnar
2008-06-25Merge branch 'linus' into cpus4096Ingo Molnar
2008-06-25Merge branch 'linus' into core/miscIngo Molnar
2008-06-24x86 boot: more consistently use type int for node idsPaul Jackson
Everywhere I look, node id's are of type 'int', except in this one case, which has 'unsigned long'. Change this one to 'int' as well. There is nothing special about the way this variable 'nid' is used in this routine to justify using an unusual type here. Signed-off-by: Paul Jackson <pj@sgi.com> Cc: "Yinghai Lu" <yhlu.kernel@gmail.com> Cc: "Jack Steiner" <steiner@sgi.com> Cc: "Mike Travis" <travis@sgi.com> Cc: "Huang Cc: Ying" <ying.huang@intel.com> Cc: "Andi Kleen" <andi@firstfloor.org> Cc: "Andrew Morton" <akpm@linux-foundation.org> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-24x86 boot: show pfn addresses in hex not decimal in some kernel info printksPaul Jackson
Page frame numbers (the portion of physical addresses above the low order page offsets) are displayed in several kernel debug and info prints in decimal, not hex. Decimal addresse are unreadable. Use hex. Signed-off-by: Paul Jackson <pj@sgi.com> Cc: "Yinghai Lu" <yhlu.kernel@gmail.com> Cc: "Jack Steiner" <steiner@sgi.com> Cc: "Mike Travis" <travis@sgi.com> Cc: "Huang Cc: Ying" <ying.huang@intel.com> Cc: "Andi Kleen" <andi@firstfloor.org> Cc: "Andrew Morton" <akpm@linux-foundation.org> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23slub: current is always validAlexey Dobriyan
Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-06-23mm: fix race in COW logicNick Piggin
There is a race in the COW logic. It contains a shortcut to avoid the COW and reuse the page if we have the sole reference on the page, however it is possible to have two racing do_wp_page()ers with one causing the other to mistakenly believe it is safe to take the shortcut when it is not. This could lead to data corruption. Process 1 and process2 each have a wp pte of the same anon page (ie. one forked the other). The page's mapcount is 2. Then they both attempt to write to it around the same time... proc1 proc2 thr1 proc2 thr2 CPU0 CPU1 CPU3 do_wp_page() do_wp_page() trylock_page() can_share_swap_page() load page mapcount (==2) reuse = 0 pte unlock copy page to new_page pte lock page_remove_rmap(page); trylock_page() can_share_swap_page() load page mapcount (==1) reuse = 1 ptep_set_access_flags (allow W) write private key into page read from page ptep_clear_flush() set_pte_at(pte of new_page) Fix this by moving the page_remove_rmap of the old page after the pte clear and flush. Potentially the entire branch could be moved down here, but in order to stay consistent, I won't (should probably move all the *_mm_counter stuff with one patch). Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Hugh Dickins <hugh@veritas.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-23Fix ZERO_PAGE breakage with vmwareLinus Torvalds
Commit 89f5b7da2a6bad2e84670422ab8192382a5aeb9f ("Reinstate ZERO_PAGE optimization in 'get_user_pages()' and fix XIP") broke vmware, as reported by Jeff Chua: "This broke vmware 6.0.4. Jun 22 14:53:03.845: vmx| NOT_IMPLEMENTED /build/mts/release/bora-93057/bora/vmx/main/vmmonPosix.c:774" and the reason seems to be that there's an old bug in how we handle do FOLL_ANON on VM_SHARED areas in get_user_pages(), but since it only triggered if the whole page table was missing, nobody had apparently hit it before. The recent changes to 'follow_page()' made the FOLL_ANON logic trigger not just for whole missing page tables, but for individual pages as well, and exposed this problem. This fixes it by making the test for when FOLL_ANON is used more careful, and also makes the code easier to read and understand by moving the logic to a separate inline function. Reported-and-tested-by: Jeff Chua <jeff.chua.linux@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-23Merge branch 'linus' into x86/mpparseIngo Molnar
Conflicts: arch/x86/kernel/setup_32.c Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23Merge branch 'linus' into x86/mm-remove-arch-get-ram-rangeIngo Molnar
Conflicts: arch/x86/kernel/setup_32.c Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23Merge branch 'linus' into tracing/ftracetip-tracing-ftrace-2008-06-23_09.11_MonIngo Molnar
2008-06-22slub: Add check for kfree() of non slab objects.Christoph Lameter
We can detect kfree()s on non slab objects by checking for PageCompound(). Works in the same way as for ksize. This helped me catch an invalid kfree(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-06-21Slab: Fix memory leak in fallback_alloc()tip-x86-idle-2008-06-23_09.16_Montip-core-urgent-2008-06-23_08.56_MonChristoph Lameter
The zonelist patches caused the loop that checks for available objects in permitted zones to not terminate immediately. One object per zone per allocation may be allocated and then abandoned. Break the loop when we have successfully allocated one object. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-21Add return value to reserve_bootmem_node()Bernhard Walle
This patch changes the function reserve_bootmem_node() from void to int, returning -ENOMEM if the allocation fails. This fixes a build problem on x86 with CONFIG_KEXEC=y and CONFIG_NEED_MULTIPLE_NODES=y Signed-off-by: Bernhard Walle <bwalle@suse.de> Reported-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-20Reinstate ZERO_PAGE optimization in 'get_user_pages()' and fix XIPLinus Torvalds
KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit 557ed1fa2620dc119adb86b34c614e152a629a80 ("remove ZERO_PAGE") removed the ZERO_PAGE from the VM mappings, any users of get_user_pages() will generally now populate the VM with real empty pages needlessly. We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but since fault handling no longer uses ZERO_PAGE for new anonymous pages, we now need to handle that special case in follow_page() instead. In particular, the removal of ZERO_PAGE effectively removed the core file writing optimization where we would skip writing pages that had not been populated at all, and increased memory pressure a lot by allocating all those useless newly zeroed pages. This reinstates the optimization by making the unmapped PTE case the same as for a non-existent page table, which already did this correctly. While at it, this also fixes the XIP case for follow_page(), where the caller could not differentiate between the case of a page that simply could not be used (because it had no "struct page" associated with it) and a page that just wasn't mapped. We do that by simply returning an error pointer for pages that could not be turned into a "struct page *". The error is arbitrarily picked to be EFAULT, since that was what get_user_pages() already used for the equivalent IO-mapped page case. [ Also removed an impossible test for pte_offset_map_lock() failing: that's not how that function works ] Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-20Merge branch 'x86/numa' into x86/mpparseIngo Molnar
Conflicts: arch/x86/kernel/mpparse.c arch/x86/kernel/setup.c arch/x86/kernel/setup_32.c arch/x86/mm/init_64.c include/asm-x86/proto.h Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-18RFC x86: try to remove arch_get_ram_rangeYinghai Lu
want to remove arch_get_ram_range, and use early_node_map instead. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-16Merge branch 'linus' into tracing/ftracetip-tracing-ftrace-2008-06-16_09.26_Montip-tracing-ftrace-2008-06-16_09.16_MonIngo Molnar
2008-06-16Merge branch 'linus' into cpus4096Ingo Molnar
2008-06-16Merge branch 'linus' into core/miscIngo Molnar
2008-06-16Merge branch 'linus' into x86/numaIngo Molnar
2008-06-16Merge branch 'linus' into x86/mpparseIngo Molnar
2008-06-16x86, mm: use add_highpages_with_active_regions() for high pages init v2Yinghai Lu
use early_node_map to init high pages, so we can remove page_is_ram() and page_is_reserved_early() in the big loop with add_one_highpage also remove page_is_reserved_early(), it is not needed anymore. v2: fix the build of other platforms Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-14x86: replace shrink_active_range() with remove_active_range()Yinghai Lu
in case we have kva before ramdisk on a node, we still need to use those ranges. v2: reserve_early kva ram area, in case there are holes in highmem, to avoid those area could be treat as free high pages. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-12pagemap: pass mm into pagewalkersDave Hansen
We need this at least for huge page detection for now, because powerpc needs the vm_area_struct to be able to determine whether a virtual address is referring to a huge page (its pmd_huge() doesn't work). It might also come in handy for some of the other users. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-12mm: fix incorrect variable type in do_try_to_free_pages()kosaki.motohiro@jp.fujitsu.com
"Smarter retry of costly-order allocations" patch series change behaver of do_try_to_free_pages(). But unfortunately ret variable type was unchanged. Thus an overflow is possible. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-12nommu: Correct kobjsize() page validity checks.Paul Mundt
This implements a few changes on top of the recent kobjsize() refactoring introduced by commit 6cfd53fc03670c7a544a56d441eb1a6cc800d72b. As Christoph points out: virt_to_head_page cannot return NULL. virt_to_page also does not return NULL. pfn_valid() needs to be used to figure out if a page is valid. Otherwise the page struct reference that was returned may have PageReserved() set to indicate that it is not a valid page. As discussed further in the thread, virt_addr_valid() is the preferable way to validate the object pointer in this case. In addition to fixing up the reserved page case, it also has the benefit of encapsulating the hack introduced by commit 4016a1390d07f15b267eecb20e76a48fd5c524ef on the impacted platforms, allowing us to get rid of the extra checking in kobjsize() for the platforms that don't perform this type of bizarre memory_end abuse (every nommu platform that isn't blackfin). If blackfin decides to get in line with every other platform and use PageReserved for the DMA pages in question, kobjsize() will also continue to work fine. It also turns out that compound_order() will give us back 0-order for non-head pages, so we can get rid of the PageCompound check and just use compound_order() directly. Clean that up while we're at it. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Reviewed-by: Christoph Lameter <clameter@sgi.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-10bootmem: add return value to reserve_bootmem_node()Bernhard Walle
This patch changes the function reserve_bootmem_node() from void to int, returning -ENOMEM if the allocation fails. Signed-off-by: Bernhard Walle <bwalle@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10bootmem: add return value to reserve_bootmem_node()Bernhard Walle
This patch changes the function reserve_bootmem_node() from void to int, returning -ENOMEM if the allocation fails. Signed-off-by: Bernhard Walle <bwalle@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10do_generic_file_read: s/EINTR/EIO/ if lock_page_killable() failsOleg Nesterov
If lock_page_killable() fails because the task was killed by SIGKILL or any other fatal signal, do_generic_file_read() returns -EIO. This seems to be OK, because in fact the userspace won't see this error, the task will dequeue SIGKILL and exit. However, /sbin/init is different, it will dequeue SIGKILL, ignore it, and return to the user-space with the bogus -EIO. Change the code to return the error code from lock_page_killable(), -EINTR. This doesn't fix the bug, but perhaps makes sense anyway. Imho, with this change the code looks a bit more logical, and the "good" init should handle the spurious EINTR or short read. Afaics we can also change lock_page_killable() to return -ERESTARTNOINTR, but this can't prevent the short reads. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10mm, x86: shrink_active_range() should check allYinghai Lu
Now we are using register_e820_active_regions() instead of add_active_range() directly. So end_pfn could be different between the value in early_node_map to node_end_pfn. So we need to make shrink_active_range() smarter. shrink_active_range() is a generic MM function in mm/page_alloc.c but it is only used on 32-bit x86. Should we move it back to some file in arch/x86? Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-09mm: Minor clean-up of page flags in mm/page_alloc.cRuss Anderson
Minor source code cleanup of page flags in mm/page_alloc.c. Move the definition of the groups of bits to page-flags.h. The purpose of this clean up is that the next patch will conditionally add a page flag to the groups. Doing that in a header file is cleaner than adding #ifdefs to the C code. Signed-off-by: Russ Anderson <rja@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06nommu: fix kobjsize() for SLOB and SLUBPaul Mundt
kobjsize() has been abusing page->index as a method for sorting out compound order, which blows up both for page cache pages, and SLOB's reuse of the index in struct slob_page. Presently we are not able to accurately size arbitrary pointers that don't come from kmalloc(), so the best we can do is sort out the compound order from the head page if it's a compound page, or default to 0-order if it's impossible to ksize() the object. Obviously this leaves quite a bit to be desired in terms of object sizing accuracy, but the behaviour is unchanged over the existing implementation, while fixing the page->index oopses originally reported here: http://marc.info/?l=linux-mm&m=121127773325245&w=2 Accuracy could also be improved by having SLUB and SLOB both set PG_slab on ksizeable pages, rather than just handling the __GFP_COMP cases irregardless of the PG_slab setting, as made possibly with Pekka's patches: http://marc.info/?l=linux-kernel&m=121139439900534&w=2 http://marc.info/?l=linux-kernel&m=121139440000537&w=2 http://marc.info/?l=linux-kernel&m=121139440000540&w=2 This is primarily a bugfix for nommu systems for 2.6.26, with the aim being to gradually kill off kobjsize() and its particular brand of object abuse entirely. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Paul Mundt <lethal@linux-sh.org> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06brk: make sys_brk() honor COMPAT_BRK when computing lower boundJiri Kosina
Fix a regression introduced by commit 4cc6028d4040f95cdb590a87db478b42b8be0508 Author: Jiri Kosina <jkosina@suse.cz> Date: Wed Feb 6 22:39:44 2008 +0100 brk: check the lower bound properly The check in sys_brk() on minimum value the brk might have must take CONFIG_COMPAT_BRK setting into account. When this option is turned on (i.e. we support ancient legacy binaries, e.g. libc5-linked stuff), the lower bound on brk value is mm->end_code, otherwise the brk start is allowed to be arbitrarily shifted. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-06hugetlb: fix lockdep errorNick Piggin
============================================= [ INFO: possible recursive locking detected ] 2.6.26-rc4 #30 --------------------------------------------- heap-overflow/2250 is trying to acquire lock: (&mm->page_table_lock){--..}, at: [<c0000000000cf2e8>] .copy_hugetlb_page_range+0x108/0x280 but task is already holding lock: (&mm->page_table_lock){--..}, at: [<c0000000000cf2dc>] .copy_hugetlb_page_range+0xfc/0x280 other info that might help us debug this: 3 locks held by heap-overflow/2250: #0: (&mm->mmap_sem){----}, at: [<c000000000050e44>] .dup_mm+0x134/0x410 #1: (&mm->mmap_sem/1){--..}, at: [<c000000000050e54>] .dup_mm+0x144/0x410 #2: (&mm->page_table_lock){--..}, at: [<c0000000000cf2dc>] .copy_hugetlb_page_range+0xfc/0x280 stack backtrace: Call Trace: [c00000003b2774e0] [c000000000010ce4] .show_stack+0x74/0x1f0 (unreliable) [c00000003b2775a0] [c0000000003f10e0] .dump_stack+0x20/0x34 [c00000003b277620] [c0000000000889bc] .__lock_acquire+0xaac/0x1080 [c00000003b277740] [c000000000089000] .lock_acquire+0x70/0xb0 [c00000003b2777d0] [c0000000003ee15c] ._spin_lock+0x4c/0x80 [c00000003b277870] [c0000000000cf2e8] .copy_hugetlb_page_range+0x108/0x280 [c00000003b277950] [c0000000000bcaa8] .copy_page_range+0x558/0x790 [c00000003b277ac0] [c000000000050fe0] .dup_mm+0x2d0/0x410 [c00000003b277ba0] [c000000000051d24] .copy_process+0xb94/0x1020 [c00000003b277ca0] [c000000000052244] .do_fork+0x94/0x310 [c00000003b277db0] [c000000000011240] .sys_clone+0x60/0x80 [c00000003b277e30] [c0000000000078c4] .ppc_clone+0x8/0xc Fix is the same way that mm/memory.c copy_page_range does the lockdep annotation. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Adam Litke <agl@us.ibm.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-03x86, numa, 32-bit: print out debug info on all kvasYinghai Lu
also fix the print out of node_remap_end_vaddr Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: slub: ksize() abuse checks slob: Fix to return wrong pointer
2008-05-24memory hotplug: fix early allocation handlingHeiko Carstens
Trying to add memory via add_memory() from within an initcall function results in bootmem alloc of 163840 bytes failed! Kernel panic - not syncing: Out of memory This is caused by zone_wait_table_init() which uses system_state to decide if it should use the bootmem allocator or not. When initcalls are handled the system_state is still SYSTEM_BOOTING but the bootmem allocator doesn't work anymore. So the allocation will fail. To fix this use slab_is_available() instead as indicator like we do it everywhere else. [akpm@linux-foundation.org: coding-style fix] Reviewed-by: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24zonelists: handle a node zonelist with no applicable entriesAndy Whitcroft
When booting 2.6.26-rc3 on a multi-node x86_32 numa system we are seeing panics when trying node local allocations: BUG: unable to handle kernel NULL pointer dereference at 0000034c IP: [<c1042507>] get_page_from_freelist+0x4a/0x18e *pdpt = 00000000013a7001 *pde = 0000000000000000 Oops: 0000 [#1] SMP Modules linked in: Pid: 0, comm: swapper Not tainted (2.6.26-rc3-00003-g5abc28d #82) EIP: 0060:[<c1042507>] EFLAGS: 00010282 CPU: 0 EIP is at get_page_from_freelist+0x4a/0x18e EAX: c1371ed8 EBX: 00000000 ECX: 00000000 EDX: 00000000 ESI: f7801180 EDI: 00000000 EBP: 00000000 ESP: c1371ec0 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Process swapper (pid: 0, ti=c1370000 task=c12f5b40 task.ti=c1370000) Stack: 00000000 00000000 00000000 00000000 000612d0 000412d0 00000000 000412d0 f7801180 f7c0101c f7c01018 c10426e4 f7c01018 00000001 00000044 00000000 00000001 c12f5b40 00000001 00000010 00000000 000412d0 00000286 000412d0 Call Trace: [<c10426e4>] __alloc_pages_internal+0x99/0x378 [<c10429ca>] __alloc_pages+0x7/0x9 [<c105e0e8>] kmem_getpages+0x66/0xef [<c105ec55>] cache_grow+0x8f/0x123 [<c105f117>] ____cache_alloc_node+0xb9/0xe4 [<c105f427>] kmem_cache_alloc_node+0x92/0xd2 [<c122118c>] setup_cpu_cache+0xaf/0x177 [<c105e6ca>] kmem_cache_create+0x2c8/0x353 [<c13853af>] kmem_cache_init+0x1ce/0x3ad [<c13755c5>] start_kernel+0x178/0x1ee This occurs when we are scanning the zonelists looking for a ZONE_NORMAL page. In this system there is only ZONE_DMA and ZONE_NORMAL memory on node 0, all other nodes are mapped above 4GB physical. Here is a dump of the zonelists from this system: zonelists pgdat=c1400000 0: c14006c0:2 f7c006c0:2 f7e006c0:2 c1400360:1 c1400000:0 1: c14006c0:2 c1400360:1 c1400000:0 zonelists pgdat=f7c00000 0: f7c006c0:2 f7e006c0:2 c14006c0:2 c1400360:1 c1400000:0 1: f7c006c0:2 zonelists pgdat=f7e00000 0: f7e006c0:2 c14006c0:2 f7c006c0:2 c1400360:1 c1400000:0 1: f7e006c0:2 When performing a node local allocation we call get_page_from_freelist() looking for a page. It in turn calls first_zones_zonelist() which returns a preferred_zone. Where there are no applicable zones this will be NULL. However we use this unconditionally, leading to this panic. Where there are no applicable zones there is no possibility of a successful allocation, so simply fail the allocation. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24mm: fix atomic_t overflow in vmAlan Cox
The atomic_t type is 32bit but a 64bit system can have more than 2^32 pages of virtual address space available. Without this we overflow on ludicrously large mappings Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>