summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2014-04-23Merge branch 'akpm-current/current'Stephen Rothwell
2014-04-23Merge remote-tracking branch 'percpu/for-next'Stephen Rothwell
2014-04-23Merge remote-tracking branch 'tip/auto-latest'Stephen Rothwell
2014-04-23Merge remote-tracking branch 'vfs/for-next'Stephen Rothwell
2014-04-23Merge remote-tracking branch 's390/features'Stephen Rothwell
2014-04-23mm/util.c: add kstrimdup()Sebastian Capella
kstrimdup() creates a whitespace-trimmed duplicate of the passed in null-terminated string. This is useful for strings coming from sysfs that often include trailing whitespace due to user input. Thanks to Joe Perches for this implementation. Signed-off-by: Sebastian Capella <sebastian.capella@linaro.org> Cc: Joe Perches <joe@perches.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23do_shared_fault(): check that mmap_sem is heldAndrew Morton
mmap_sem() is required to protect the vma, which holds ->vm_file, which pins fault_page->mapping. Cc: Andi Kleen <ak@linux.intel.com> Cc: Bob Liu <lliubbo@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm-vmscan-do-not-throttle-based-on-pfmemalloc-reserves-if-node-has-no-zone_n ↵Andrew Morton
ormal-checkpatch-fixes ERROR: code indent should use tabs where possible #66: FILE: mm/vmscan.c:2585: + for_each_zone_zonelist_nodemask(zone, z, zonelist,$ WARNING: please, no spaces at the start of a line #66: FILE: mm/vmscan.c:2585: + for_each_zone_zonelist_nodemask(zone, z, zonelist,$ ERROR: code indent should use tabs where possible #67: FILE: mm/vmscan.c:2586: + gfp_mask, nodemask) {$ WARNING: please, no spaces at the start of a line #67: FILE: mm/vmscan.c:2586: + gfp_mask, nodemask) {$ total: 2 errors, 2 warnings, 56 lines checked NOTE: whitespace errors detected, you may wish to use scripts/cleanpatch or scripts/cleanfile ./patches/mm-vmscan-do-not-throttle-based-on-pfmemalloc-reserves-if-node-has-no-zone_normal.patch has style problems, please review. If any of these errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ↵Mel Gorman
ZONE_NORMAL throttle_direct_reclaim() is meant to trigger during swap-over-network during which the min watermark is treated as a pfmemalloc reserve. It throttes on the first node in the zonelist but this is flawed. On a NUMA machine running a 32-bit kernel (I know) allocation requests freom CPUs on node 1 would detect no pfmemalloc reserves and the process gets throttled. This patch adjusts throttling of direct reclaim to throttle based on the first node in the zonelist that has a usable ZONE_NORMAL or lower zone. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm/mmap.c: remove the first mapping checkHuang Shijie
Remove the first mapping check for vma_link. Move the mutex_lock into the braces when vma->vm_file is true. Signed-off-by: Huang Shijie <b32955@freescale.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm/swap.c: clean up *lru_cache_add* functionsJianyu Zhan
In mm/swap.c, __lru_cache_add() is exported, but actually there are no users outside this file. This patch unexports __lru_cache_add(), and makes it static. It also exports lru_cache_add_file(), as it is use by cifs and fuse, which can loaded as modules. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Bob Liu <bob.liu@oracle.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm-memcontrolc-introduce-helper-mem_cgroup_zoneinfo_zone-checkpatch-fixesAndrew Morton
WARNING: please, no spaces at the start of a line #29: FILE: mm/memcontrol.c:689: + int nid = zone_to_nid(zone);$ WARNING: please, no spaces at the start of a line #30: FILE: mm/memcontrol.c:690: + int zid = zone_idx(zone);$ WARNING: please, no spaces at the start of a line #32: FILE: mm/memcontrol.c:692: + return mem_cgroup_zoneinfo(memcg, nid, zid);$ total: 0 errors, 3 warnings, 35 lines checked ./patches/mm-memcontrolc-introduce-helper-mem_cgroup_zoneinfo_zone.patch has style problems, please review. If any of these errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm/memcontrol.c: introduce helper mem_cgroup_zoneinfo_zone()Jianyu Zhan
Introduce helper mem_cgroup_zoneinfo_zone(). This makes mem_cgroup_iter() code more compact. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm: debug: make bad_range() output more usable and readableDave Hansen
Nobody outputs memory addresses in decimal. PFNs are essentially addresses, and they're gibberish in decimal. Output them in hex. Also, add the nid and zone name to give a little more context to the message. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm-compaction-cleanup-isolate_freepages-fix 2Vlastimil Babka
Cleanup detection of compaction scanners crossing in isolate_freepages(). To make sure compact_finished() observes scanners crossing, we can just set free_pfn to migrate_pfn instead of confusing max() construct. Suggested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dongjun Shin <d.j.shin@samsung.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sunghwan Yun <sunghwan.yun@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm-compaction-cleanup-isolate_freepages-fixAndrew Morton
comment fixes, per Minchan Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dongjun Shin <d.j.shin@samsung.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sunghwan Yun <sunghwan.yun@samsung.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm/compaction: cleanup isolate_freepages()Vlastimil Babka
isolate_freepages() is currently somewhat hard to follow thanks to many looks like it is related to the 'low_pfn' variable, but in fact it is not. This patch renames the 'high_pfn' variable to a hopefully less confusing name, and slightly changes its handling without a functional change. A comment made obsolete by recent changes is also updated. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: Dongjun Shin <d.j.shin@samsung.com> Cc: Sunghwan Yun <sunghwan.yun@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm/compaction: clean up unused code linesHeesub Shin
Remove code lines currently not in use or never called. Signed-off-by: Heesub Shin <heesub.shin@samsung.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dongjun Shin <d.j.shin@samsung.com> Cc: Sunghwan Yun <sunghwan.yun@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: Dongjun Shin <d.j.shin@samsung.com> Cc: Sunghwan Yun <sunghwan.yun@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm-gupc-tweaks-fixAndrew Morton
fix x86_64 allnoconfig build mm/gup.c: In function 'follow_page_pte': mm/gup.c:100: error: implicit declaration of function 'trylock_page' mm/gup.c:109: error: implicit declaration of function 'unlock_page' mm/gup.c: In function '__get_user_pages': mm/gup.c:467: error: implicit declaration of function 'flush_anon_page' mm/gup.c:468: error: implicit declaration of function 'flush_dcache_page' Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm/gup.c: tweaksAndrew Morton
- include some more header files, but many are still missed - fix some 80-col overflows by removing unneeded `inline' Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm: cleanup __get_user_pages()Kirill A. Shutemov
Get rid of two nested loops over nr_pages and other random cleanups. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm-extract-code-to-fault-in-a-page-from-__get_user_pages-fixAndrew Morton
s/BUILD_BUG/BUG/ Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm: extract code to fault in a page from __get_user_pages()Kirill A. Shutemov
Nesting level in __get_user_pages() is just insane. Let's try to fix it a bit. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm: cleanup follow_page_mask()Kirill A. Shutemov
Cleanups: - move pte-related code to separate function. It's about half of the function; - get rid of some goto-logic; - use 'return NULL' instead of 'return page' where page can only be NULL; Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm: extract in_gate_area() case from __get_user_pages()Kirill A. Shutemov
The case is special and disturb from reading main __get_user_pages() code path. Let's move it to separate function. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm: move get_user_pages()-related code to separate fileKirill A. Shutemov
mm/memory.c is overloaded: over 4k lines. get_user_pages() code is pretty much self-contained let's move it to separate file. No other changes made. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pagesVlastimil Babka
For the MIGRATE_RESERVE pages, it is important they do not get misplaced on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE pageblock might be changed to other migratetype in try_to_steal_freepages(). For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise they could get allocated as unmovable and result in CMA failure. This is ensured by setting the freepage_migratetype appropriately when placing pages on pcp lists, and using the information when releasing them back to free_list. It is also assumed that CMA and RESERVE pageblocks are created only in the init phase. This patch adds DEBUG_VM checks to catch any regressions introduced for this invariant. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Yong-Taek Lee <ytk.lee@samsung.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm/page_alloc: prevent MIGRATE_RESERVE pages from being misplacedVlastimil Babka
For the MIGRATE_RESERVE pages, it is important they do not get misplaced on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE pageblock might be changed to other migratetype in try_to_steal_freepages(). Currently, it is however possible for this to happen when MIGRATE_RESERVE page is allocated on pcplist through rmqueue_bulk() as a fallback for other desired migratetype, and then later freed back through free_pcppages_bulk() without being actually used. This happens because free_pcppages_bulk() uses get_freepage_migratetype() to choose the free_list, and rmqueue_bulk() calls set_freepage_migratetype() with the *desired* migratetype and not the page's original MIGRATE_RESERVE migratetype. This patch fixes the problem by moving the call to set_freepage_migratetype() from rmqueue_bulk() down to __rmqueue_smallest() and __rmqueue_fallback() where the actual page's migratetype (e.g. from which free_list the page is taken from) is used. Note that this migratetype might be different from the pageblock's migratetype due to freepage stealing decisions. This is OK, as page stealing never uses MIGRATE_RESERVE as a fallback, and also takes care to leave all MIGRATE_CMA pages on the correct freelist. Therefore, as an additional benefit, the call to get_pageblock_migratetype() from rmqueue_bulk() when CMA is enabled, can be removed completely. This relies on the fact that MIGRATE_CMA pageblocks are created only during system init, and the above. The related is_migrate_isolate() check is also unnecessary, as memory isolation has other ways to move pages between freelists, and drain pcp lists containing pages that should be isolated. The buffered_rmqueue() can also benefit from calling get_freepage_migratetype() instead of get_pageblock_migratetype(). A separate patch will add VM_BUG_ON checks for the invariant that for MIGRATE_RESERVE and MIGRATE_CMA pageblocks, freepage_migratetype must equal to pageblock_migratetype so that these pages always go to the correct free_list. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Yong-Taek Lee <ytk.lee@samsung.com> Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Suggested-by: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23slab: get_online_mems for kmem_cache_{create,destroy,shrink}Vladimir Davydov
When we create a sl[au]b cache, we allocate kmem_cache_node structures for each online NUMA node. To handle nodes taken online/offline, we register memory hotplug notifier and allocate/free kmem_cache_node corresponding to the node that changes its state for each kmem cache. To synchronize between the two paths we hold the slab_mutex during both the cache creationg/destruction path and while tuning per-node parts of kmem caches in memory hotplug handler, but that's not quite right, because it does not guarantee that a newly created cache will have all kmem_cache_nodes initialized in case it races with memory hotplug. For instance, in case of slub: CPU0 CPU1 ---- ---- kmem_cache_create: online_pages: __kmem_cache_create: slab_memory_callback: slab_mem_going_online_callback: lock slab_mutex for each slab_caches list entry allocate kmem_cache node unlock slab_mutex lock slab_mutex init_kmem_cache_nodes: for_each_node_state(node, N_NORMAL_MEMORY) allocate kmem_cache node add kmem_cache to slab_caches list unlock slab_mutex online_pages (continued): node_states_set_node As a result we'll get a kmem cache with not all kmem_cache_nodes allocated. To avoid issues like that we should hold get/put_online_mems() during the whole kmem cache creation/destruction/shrink paths, just like we deal with cpu hotplug. This patch does the trick. Note, that after it's applied, there is no need in taking the slab_mutex for kmem_cache_shrink any more, so it is removed from there. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mem-hotplug: implement get/put_online_memsVladimir Davydov
kmem_cache_{create,destroy,shrink} need to get a stable value of cpu/node online mask, because they init/destroy/access per-cpu/node kmem_cache parts, which can be allocated or destroyed on cpu/mem hotplug. To protect against cpu hotplug, these functions use {get,put}_online_cpus. However, they do nothing to synchronize with memory hotplug - taking the slab_mutex does not eliminate the possibility of race as described in patch 2. What we need there is something like get_online_cpus, but for memory. We already have lock_memory_hotplug, which serves for the purpose, but it's a bit of a hammer right now, because it's backed by a mutex. As a result, it imposes some limitations to locking order, which are not desirable, and can't be used just like get_online_cpus. That's why in patch 1 I substitute it with get/put_online_mems, which work exactly like get/put_online_cpus except they block not cpu, but memory hotplug. [ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by myself, because it used an rw semaphore for get/put_online_mems, making them dead lock prune. ] This patch (of 2): {un}lock_memory_hotplug, which is used to synchronize against memory hotplug, is currently backed by a mutex, which makes it a bit of a hammer - threads that only want to get a stable value of online nodes mask won't be able to proceed concurrently. Also, it imposes some strong locking ordering rules on it, which narrows down the set of its usage scenarios. This patch introduces get/put_online_mems, which are the same as get/put_online_cpus, but for memory hotplug, i.e. executing a code inside a get/put_online_mems section will guarantee a stable value of online nodes, present pages, etc. lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23memcg: un-export __memcg_kmem_get_cacheVladimir Davydov
It is only used in slab and should not be used anywhere else so there is no need in exporting it. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm-page_alloc-do-not-cache-reclaim-distances-fixAndrew Morton
restore a hunk which got rejected and not fixed Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm: page_alloc: do not cache reclaim distancesMel Gorman
pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by zone_reclaim due to its distance. As it is expected that zone_reclaim_mode will be rarely enabled it is unreasonable for all machines to take a penalty. Fortunately, the zone_reclaim_mode() path is already slow and it is the path that takes the hit. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm: disable zone_reclaim_mode by defaultMel Gorman
When it was introduced, zone_reclaim_mode made sense as NUMA distances punished and workloads were generally partitioned to fit into a NUMA node. NUMA machines are now common but few of the workloads are NUMA-aware and it's routine to see major performance degradation due to zone_reclaim_mode being enabled but relatively few can identify the problem. Those that require zone_reclaim_mode are likely to be able to detect when it needs to be enabled and tune appropriately so lets have a sensible default for the bulk of users. This patch (of 2): zone_reclaim_mode causes processes to prefer reclaiming memory from local node instead of spilling over to other nodes. This made sense initially when NUMA machines were almost exclusively HPC and the workload was partitioned into nodes. The NUMA penalties were sufficiently high to justify reclaiming the memory. On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Users that are sophisticated enough to know they need zone_reclaim_mode will detect it. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23hugetlb-add-support-for-gigantic-page-allocation-at-runtime-checkpatch-fixesAndrew Morton
WARNING: braces {} are not necessary for any arm of this statement #282: FILE: mm/hugetlb.c:1650: + if (hstate_is_gigantic(h)) { [...] + } else { [...] total: 0 errors, 1 warnings, 219 lines checked ./patches/hugetlb-add-support-for-gigantic-page-allocation-at-runtime.patch has style problems, please review. If any of these errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Luiz Capitulino <lcapitulino@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23hugetlb: add support for gigantic page allocation at runtimeLuiz Capitulino
HugeTLB is limited to allocating hugepages whose size are less than MAX_ORDER order. This is so because HugeTLB allocates hugepages via the buddy allocator. Gigantic pages (that is, pages whose size is greater than MAX_ORDER order) have to be allocated at boottime. However, boottime allocation has at least two serious problems. First, it doesn't support NUMA and second, gigantic pages allocated at boottime can't be freed. This commit solves both issues by adding support for allocating gigantic pages during runtime. It works just like regular sized hugepages, meaning that the interface in sysfs is the same, it supports NUMA, and gigantic pages can be freed. For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G gigantic pages on node 1, one can do: # echo 2 > \ /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages And to free them all: # echo 0 > \ /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages The one problem with gigantic page allocation at runtime is that it can't be serviced by the buddy allocator. To overcome that problem, this commit scans all zones from a node looking for a large enough contiguous region. When one is found, it's allocated by using CMA, that is, we call alloc_contig_range() to do the actual allocation. For example, on x86_64 we scan all zones looking for a 1GB contiguous region. When one is found, it's allocated by alloc_contig_range(). One expected issue with that approach is that such gigantic contiguous regions tend to vanish as runtime goes by. The best way to avoid this for now is to make gigantic page allocations very early during system boot, say from a init script. Other possible optimization include using compaction, which is supported by CMA but is not explicitly used by this commit. It's also important to note the following: 1. Gigantic pages allocated at boottime by the hugepages= command-line option can be freed at runtime just fine 2. This commit adds support for gigantic pages only to x86_64. The reason is that I don't have access to nor experience with other archs. The code is arch indepedent though, so it should be simple to add support to different archs 3. I didn't add support for hugepage overcommit, that is allocating a gigantic page on demand when /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't think it's reasonable to do the hard and long work required for allocating a gigantic page at fault time. But it should be simple to add this if wanted Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23hugetlb: move helpers up in the fileLuiz Capitulino
Next commit will add new code which will want to call for_each_node_mask_to_alloc() macro. Move it, its buddy for_each_node_mask_to_free() and their dependencies up in the file so the new code can use them. This is just code movement, no logic change. Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23hugetlb: update_and_free_page(): don't clear PG_reserved bitLuiz Capitulino
Hugepages pages never get the PG_reserved bit set, so don't clear it. However, note that if the bit gets mistakenly set free_pages_check() will catch it. Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23hugetlb: add hstate_is_gigantic()Luiz Capitulino
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23hugetlb: prep_compound_gigantic_page(): drop __init markerLuiz Capitulino
The HugeTLB subsystem uses the buddy allocator to allocate hugepages during runtime. This means that hugepages allocation during runtime is limited to MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes greater than MAX_ORDER), this in turn means that those pages can't be allocated at runtime. HugeTLB supports gigantic page allocation during boottime, via the boot allocator. To this end the kernel provides the command-line options hugepagesz= and hugepages=, which can be used to instruct the kernel to allocate N gigantic pages during boot. For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages can be allocated and freed at runtime. If one wants to allocate 1G gigantic pages, this has to be done at boot via the hugepagesz= and hugepages= command-line options. Now, gigantic page allocation at boottime has two serious problems: 1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel evenly distributes boottime allocated hugepages among nodes. For example, suppose you have a four-node NUMA machine and want to allocate four 1G gigantic pages at boottime. The kernel will allocate one gigantic page per node. On the other hand, we do have users who want to be able to specify which NUMA node gigantic pages should allocated from. So that they can place virtual machines on a specific NUMA node. 2. Gigantic pages allocated at boottime can't be freed At this point it's important to observe that regular hugepages allocated at runtime don't have those problems. This is so because HugeTLB interface for runtime allocation in sysfs supports NUMA and runtime allocated pages can be freed just fine via the buddy allocator. This series adds support for allocating gigantic pages at runtime. It does so by allocating gigantic pages via CMA instead of the buddy allocator. Releasing gigantic pages is also supported via CMA. As this series builds on top of the existing HugeTLB interface, it makes gigantic page allocation and releasing just like regular sized hugepages. This also means that NUMA support just works. For example, to allocate two 1G gigantic pages on node 1, one can do: # echo 2 > \ /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages And, to release all gigantic pages on the same node: # echo 0 > \ /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages Please, refer to patch 5/5 for full technical details. Finally, please note that this series is a follow up for a previous series that tried to extend the command-line options set to be NUMA aware: http://marc.info/?l=linux-mm&m=139593335312191&w=2 During the discussion of that series it was agreed that having runtime allocation support for gigantic pages was a better solution. This patch (of 5): This function is going to be used by non-init code in a future commit. Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: David Rientjes <rientjes@google.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm/mmap.c: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERODuan Jiong
Fix a coccinelle error regarding usage of IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23slab: document kmalloc_orderVladimir Davydov
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm: memcontrol: remove hierarchy restrictions for swappiness and oom_controlJohannes Weiner
Per-memcg swappiness and oom killing can currently not be tweaked on a memcg that is part of a hierarchy, but not the root of that hierarchy. Users have complained that they can't configure this when they turned on hierarchy mode. In fact, with hierarchy mode becoming the default, this restriction disables the tunables entirely. But there is no good reason for this restriction. The settings for swappiness and OOM killing are taken from whatever memcg whose limit triggered reclaim and OOM invocation, regardless of its position in the hierarchy tree. Allow setting swappiness on any group. The knob on the root memcg already reads the global VM swappiness, make it writable as well. Allow disabling the OOM killer on any non-root memcg. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm-mempool-warn-about-__gfp_zero-usage-fixAndrew Morton
use VM_WARN_ON_ONCE Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm/mempool: warn about __GFP_ZERO usageSebastian Ott
Memory obtained via mempool_alloc is not always zeroed even when called with __GFP_ZERO. Add a note and VM_BUG_ON statement to make that clear. Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm: convert some level-less printks to pr_*Mitchel Humpherys
printk is meant to be used with an associated log level. There are some instances of printk scattered around the mm code where the log level is missing. Add a log level and adhere to suggestions by scripts/checkpatch.pl by moving to the pr_* macros. Also add the typical pr_fmt definition so that print statements can be easily traced back to the modules where they occur, correlated one with another, etc. This will require the removal of some (now redundant) prefixes on a few print statements. Signed-off-by: Mitchel Humpherys <mitchelh@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm/huge_memory.c: complete conversion to pr_foo()Andrew Morton
It was using a mix of pr_foo() and printk(KERN_ERR ...). Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23thp: consolidate assert checks in __split_huge_page()Kirill A. Shutemov
It doesn't make sense to have two assert checks for each invariant: one for printing and one for BUG(). Let's trigger BUG() if we print error message. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23memblock: introduce memblock_alloc_range()Akinobu Mita
This introduces memblock_alloc_range() which allocates memblock from the specified range of physical address. I would like to use this function to specify the location of CMA. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Don Dutile <ddutile@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2014-04-23mm,vmacache: optimize overflow system-wide flushingDavidlohr Bueso
For single threaded workloads, we can avoid flushing and iterating through the entire list of tasks, making the whole function a lot faster, requiring only a single atomic read for the mm_users. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>