Age | Commit message (Collapse) | Author |
|
|
|
|
|
|
|
|
|
|
|
kstrimdup() creates a whitespace-trimmed duplicate of the passed in
null-terminated string. This is useful for strings coming from sysfs that
often include trailing whitespace due to user input.
Thanks to Joe Perches for this implementation.
Signed-off-by: Sebastian Capella <sebastian.capella@linaro.org>
Cc: Joe Perches <joe@perches.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
mmap_sem() is required to protect the vma, which holds ->vm_file, which
pins fault_page->mapping.
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
ormal-checkpatch-fixes
ERROR: code indent should use tabs where possible
#66: FILE: mm/vmscan.c:2585:
+ for_each_zone_zonelist_nodemask(zone, z, zonelist,$
WARNING: please, no spaces at the start of a line
#66: FILE: mm/vmscan.c:2585:
+ for_each_zone_zonelist_nodemask(zone, z, zonelist,$
ERROR: code indent should use tabs where possible
#67: FILE: mm/vmscan.c:2586:
+ gfp_mask, nodemask) {$
WARNING: please, no spaces at the start of a line
#67: FILE: mm/vmscan.c:2586:
+ gfp_mask, nodemask) {$
total: 2 errors, 2 warnings, 56 lines checked
NOTE: whitespace errors detected, you may wish to use scripts/cleanpatch or
scripts/cleanfile
./patches/mm-vmscan-do-not-throttle-based-on-pfmemalloc-reserves-if-node-has-no-zone_normal.patch has style problems, please review.
If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
ZONE_NORMAL
throttle_direct_reclaim() is meant to trigger during swap-over-network
during which the min watermark is treated as a pfmemalloc reserve. It
throttes on the first node in the zonelist but this is flawed.
On a NUMA machine running a 32-bit kernel (I know) allocation requests
freom CPUs on node 1 would detect no pfmemalloc reserves and the process
gets throttled. This patch adjusts throttling of direct reclaim to
throttle based on the first node in the zonelist that has a usable
ZONE_NORMAL or lower zone.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Remove the first mapping check for vma_link. Move the mutex_lock into the
braces when vma->vm_file is true.
Signed-off-by: Huang Shijie <b32955@freescale.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In mm/swap.c, __lru_cache_add() is exported, but actually there are no
users outside this file.
This patch unexports __lru_cache_add(), and makes it static. It also
exports lru_cache_add_file(), as it is use by cifs and fuse, which can
loaded as modules.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
WARNING: please, no spaces at the start of a line
#29: FILE: mm/memcontrol.c:689:
+ int nid = zone_to_nid(zone);$
WARNING: please, no spaces at the start of a line
#30: FILE: mm/memcontrol.c:690:
+ int zid = zone_idx(zone);$
WARNING: please, no spaces at the start of a line
#32: FILE: mm/memcontrol.c:692:
+ return mem_cgroup_zoneinfo(memcg, nid, zid);$
total: 0 errors, 3 warnings, 35 lines checked
./patches/mm-memcontrolc-introduce-helper-mem_cgroup_zoneinfo_zone.patch has style problems, please review.
If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Introduce helper mem_cgroup_zoneinfo_zone(). This makes mem_cgroup_iter()
code more compact.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Nobody outputs memory addresses in decimal. PFNs are essentially
addresses, and they're gibberish in decimal. Output them in hex.
Also, add the nid and zone name to give a little more context to the
message.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Cleanup detection of compaction scanners crossing in isolate_freepages().
To make sure compact_finished() observes scanners crossing, we can just
set free_pfn to migrate_pfn instead of confusing max() construct.
Suggested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dongjun Shin <d.j.shin@samsung.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
comment fixes, per Minchan
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dongjun Shin <d.j.shin@samsung.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
isolate_freepages() is currently somewhat hard to follow thanks to many
looks like it is related to the 'low_pfn' variable, but in fact it is not.
This patch renames the 'high_pfn' variable to a hopefully less confusing name,
and slightly changes its handling without a functional change. A comment made
obsolete by recent changes is also updated.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dongjun Shin <d.j.shin@samsung.com>
Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Remove code lines currently not in use or never called.
Signed-off-by: Heesub Shin <heesub.shin@samsung.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dongjun Shin <d.j.shin@samsung.com>
Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dongjun Shin <d.j.shin@samsung.com>
Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
fix x86_64 allnoconfig build
mm/gup.c: In function 'follow_page_pte':
mm/gup.c:100: error: implicit declaration of function 'trylock_page'
mm/gup.c:109: error: implicit declaration of function 'unlock_page'
mm/gup.c: In function '__get_user_pages':
mm/gup.c:467: error: implicit declaration of function 'flush_anon_page'
mm/gup.c:468: error: implicit declaration of function 'flush_dcache_page'
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
- include some more header files, but many are still missed
- fix some 80-col overflows by removing unneeded `inline'
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Get rid of two nested loops over nr_pages and other random cleanups.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
s/BUILD_BUG/BUG/
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Nesting level in __get_user_pages() is just insane. Let's try to fix it
a bit.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Cleanups:
- move pte-related code to separate function. It's about half of the
function;
- get rid of some goto-logic;
- use 'return NULL' instead of 'return page' where page can only be
NULL;
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The case is special and disturb from reading main __get_user_pages()
code path. Let's move it to separate function.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
mm/memory.c is overloaded: over 4k lines. get_user_pages() code is
pretty much self-contained let's move it to separate file.
No other changes made.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
For the MIGRATE_RESERVE pages, it is important they do not get misplaced
on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
pageblock might be changed to other migratetype in
try_to_steal_freepages(). For MIGRATE_CMA, the pages also must not go to
a different free_list, otherwise they could get allocated as unmovable and
result in CMA failure.
This is ensured by setting the freepage_migratetype appropriately when
placing pages on pcp lists, and using the information when releasing them
back to free_list. It is also assumed that CMA and RESERVE pageblocks are
created only in the init phase. This patch adds DEBUG_VM checks to catch
any regressions introduced for this invariant.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Yong-Taek Lee <ytk.lee@samsung.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
For the MIGRATE_RESERVE pages, it is important they do not get misplaced
on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
pageblock might be changed to other migratetype in
try_to_steal_freepages().
Currently, it is however possible for this to happen when MIGRATE_RESERVE
page is allocated on pcplist through rmqueue_bulk() as a fallback for
other desired migratetype, and then later freed back through
free_pcppages_bulk() without being actually used. This happens because
free_pcppages_bulk() uses get_freepage_migratetype() to choose the
free_list, and rmqueue_bulk() calls set_freepage_migratetype() with the
*desired* migratetype and not the page's original MIGRATE_RESERVE
migratetype.
This patch fixes the problem by moving the call to
set_freepage_migratetype() from rmqueue_bulk() down to
__rmqueue_smallest() and __rmqueue_fallback() where the actual page's
migratetype (e.g. from which free_list the page is taken from) is used.
Note that this migratetype might be different from the pageblock's
migratetype due to freepage stealing decisions. This is OK, as page
stealing never uses MIGRATE_RESERVE as a fallback, and also takes care to
leave all MIGRATE_CMA pages on the correct freelist.
Therefore, as an additional benefit, the call to
get_pageblock_migratetype() from rmqueue_bulk() when CMA is enabled, can
be removed completely. This relies on the fact that MIGRATE_CMA
pageblocks are created only during system init, and the above. The
related is_migrate_isolate() check is also unnecessary, as memory
isolation has other ways to move pages between freelists, and drain pcp
lists containing pages that should be isolated. The buffered_rmqueue()
can also benefit from calling get_freepage_migratetype() instead of
get_pageblock_migratetype().
A separate patch will add VM_BUG_ON checks for the invariant that for
MIGRATE_RESERVE and MIGRATE_CMA pageblocks, freepage_migratetype must
equal to pageblock_migratetype so that these pages always go to the
correct free_list.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When we create a sl[au]b cache, we allocate kmem_cache_node structures for
each online NUMA node. To handle nodes taken online/offline, we register
memory hotplug notifier and allocate/free kmem_cache_node corresponding to
the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right, because
it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during the
whole kmem cache creation/destruction/shrink paths, just like we deal with
cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
kmem_cache_{create,destroy,shrink} need to get a stable value of cpu/node
online mask, because they init/destroy/access per-cpu/node kmem_cache
parts, which can be allocated or destroyed on cpu/mem hotplug. To protect
against cpu hotplug, these functions use {get,put}_online_cpus. However,
they do nothing to synchronize with memory hotplug - taking the slab_mutex
does not eliminate the possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory. We
already have lock_memory_hotplug, which serves for the purpose, but it's a
bit of a hammer right now, because it's backed by a mutex. As a result,
it imposes some limitations to locking order, which are not desirable, and
can't be used just like get_online_cpus. That's why in patch 1 I
substitute it with get/put_online_mems, which work exactly like
get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems, making
them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a hammer
- threads that only want to get a stable value of online nodes mask won't
be able to proceed concurrently. Also, it imposes some strong locking
ordering rules on it, which narrows down the set of its usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code inside
a get/put_online_mems section will guarantee a stable value of online
nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
It is only used in slab and should not be used anywhere else so there is
no need in exporting it.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
restore a hunk which got rejected and not fixed
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by
zone_reclaim due to its distance. As it is expected that zone_reclaim_mode
will be rarely enabled it is unreasonable for all machines to take a penalty.
Fortunately, the zone_reclaim_mode() path is already slow and it is the path
that takes the hit.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When it was introduced, zone_reclaim_mode made sense as NUMA distances
punished and workloads were generally partitioned to fit into a NUMA node.
NUMA machines are now common but few of the workloads are NUMA-aware and
it's routine to see major performance degradation due to zone_reclaim_mode
being enabled but relatively few can identify the problem.
Those that require zone_reclaim_mode are likely to be able to detect when
it needs to be enabled and tune appropriately so lets have a sensible
default for the bulk of users.
This patch (of 2):
zone_reclaim_mode causes processes to prefer reclaiming memory from local
node instead of spilling over to other nodes. This made sense initially when
NUMA machines were almost exclusively HPC and the workload was partitioned
into nodes. The NUMA penalties were sufficiently high to justify reclaiming
the memory. On current machines and workloads it is often the case that
zone_reclaim_mode destroys performance but not all users know how to detect
this. Favour the common case and disable it by default. Users that are
sophisticated enough to know they need zone_reclaim_mode will detect it.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
WARNING: braces {} are not necessary for any arm of this statement
#282: FILE: mm/hugetlb.c:1650:
+ if (hstate_is_gigantic(h)) {
[...]
+ } else {
[...]
total: 0 errors, 1 warnings, 219 lines checked
./patches/hugetlb-add-support-for-gigantic-page-allocation-at-runtime.patch has style problems, please review.
If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
HugeTLB is limited to allocating hugepages whose size are less than
MAX_ORDER order. This is so because HugeTLB allocates hugepages via the
buddy allocator. Gigantic pages (that is, pages whose size is greater
than MAX_ORDER order) have to be allocated at boottime.
However, boottime allocation has at least two serious problems. First, it
doesn't support NUMA and second, gigantic pages allocated at boottime
can't be freed.
This commit solves both issues by adding support for allocating gigantic
pages during runtime. It works just like regular sized hugepages, meaning
that the interface in sysfs is the same, it supports NUMA, and gigantic
pages can be freed.
For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
gigantic pages on node 1, one can do:
# echo 2 > \
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
And to free them all:
# echo 0 > \
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
The one problem with gigantic page allocation at runtime is that it can't
be serviced by the buddy allocator. To overcome that problem, this commit
scans all zones from a node looking for a large enough contiguous region.
When one is found, it's allocated by using CMA, that is, we call
alloc_contig_range() to do the actual allocation. For example, on x86_64
we scan all zones looking for a 1GB contiguous region. When one is found,
it's allocated by alloc_contig_range().
One expected issue with that approach is that such gigantic contiguous
regions tend to vanish as runtime goes by. The best way to avoid this for
now is to make gigantic page allocations very early during system boot,
say from a init script. Other possible optimization include using
compaction, which is supported by CMA but is not explicitly used by this
commit.
It's also important to note the following:
1. Gigantic pages allocated at boottime by the hugepages= command-line
option can be freed at runtime just fine
2. This commit adds support for gigantic pages only to x86_64. The
reason is that I don't have access to nor experience with other archs.
The code is arch indepedent though, so it should be simple to add
support to different archs
3. I didn't add support for hugepage overcommit, that is allocating
a gigantic page on demand when
/proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
think it's reasonable to do the hard and long work required for
allocating a gigantic page at fault time. But it should be simple
to add this if wanted
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Next commit will add new code which will want to call
for_each_node_mask_to_alloc() macro. Move it, its buddy
for_each_node_mask_to_free() and their dependencies up in the file so the
new code can use them. This is just code movement, no logic change.
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Hugepages pages never get the PG_reserved bit set, so don't clear it.
However, note that if the bit gets mistakenly set free_pages_check() will
catch it.
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The HugeTLB subsystem uses the buddy allocator to allocate hugepages
during runtime. This means that hugepages allocation during runtime is
limited to MAX_ORDER order. For archs supporting gigantic pages (that is,
page sizes greater than MAX_ORDER), this in turn means that those pages
can't be allocated at runtime.
HugeTLB supports gigantic page allocation during boottime, via the boot
allocator. To this end the kernel provides the command-line options
hugepagesz= and hugepages=, which can be used to instruct the kernel to
allocate N gigantic pages during boot.
For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages
can be allocated and freed at runtime. If one wants to allocate 1G
gigantic pages, this has to be done at boot via the hugepagesz= and
hugepages= command-line options.
Now, gigantic page allocation at boottime has two serious problems:
1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel
evenly distributes boottime allocated hugepages among nodes.
For example, suppose you have a four-node NUMA machine and want
to allocate four 1G gigantic pages at boottime. The kernel will
allocate one gigantic page per node.
On the other hand, we do have users who want to be able to specify
which NUMA node gigantic pages should allocated from. So that they
can place virtual machines on a specific NUMA node.
2. Gigantic pages allocated at boottime can't be freed
At this point it's important to observe that regular hugepages allocated
at runtime don't have those problems. This is so because HugeTLB
interface for runtime allocation in sysfs supports NUMA and runtime
allocated pages can be freed just fine via the buddy allocator.
This series adds support for allocating gigantic pages at runtime. It
does so by allocating gigantic pages via CMA instead of the buddy
allocator. Releasing gigantic pages is also supported via CMA. As this
series builds on top of the existing HugeTLB interface, it makes gigantic
page allocation and releasing just like regular sized hugepages. This
also means that NUMA support just works.
For example, to allocate two 1G gigantic pages on node 1, one can do:
# echo 2 > \
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
And, to release all gigantic pages on the same node:
# echo 0 > \
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
Please, refer to patch 5/5 for full technical details.
Finally, please note that this series is a follow up for a previous series
that tried to extend the command-line options set to be NUMA aware:
http://marc.info/?l=linux-mm&m=139593335312191&w=2
During the discussion of that series it was agreed that having runtime
allocation support for gigantic pages was a better solution.
This patch (of 5):
This function is going to be used by non-init code in a future
commit.
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Fix a coccinelle error regarding usage of IS_ERR and PTR_ERR instead of
PTR_ERR_OR_ZERO.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Per-memcg swappiness and oom killing can currently not be tweaked on a
memcg that is part of a hierarchy, but not the root of that hierarchy.
Users have complained that they can't configure this when they turned on
hierarchy mode. In fact, with hierarchy mode becoming the default, this
restriction disables the tunables entirely.
But there is no good reason for this restriction. The settings for
swappiness and OOM killing are taken from whatever memcg whose limit
triggered reclaim and OOM invocation, regardless of its position in the
hierarchy tree.
Allow setting swappiness on any group. The knob on the root memcg already
reads the global VM swappiness, make it writable as well.
Allow disabling the OOM killer on any non-root memcg.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
use VM_WARN_ON_ONCE
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Memory obtained via mempool_alloc is not always zeroed even when
called with __GFP_ZERO. Add a note and VM_BUG_ON statement to make
that clear.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
printk is meant to be used with an associated log level. There are some
instances of printk scattered around the mm code where the log level is
missing. Add a log level and adhere to suggestions by
scripts/checkpatch.pl by moving to the pr_* macros.
Also add the typical pr_fmt definition so that print statements can be
easily traced back to the modules where they occur, correlated one with
another, etc. This will require the removal of some (now redundant)
prefixes on a few print statements.
Signed-off-by: Mitchel Humpherys <mitchelh@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
It was using a mix of pr_foo() and printk(KERN_ERR ...).
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
It doesn't make sense to have two assert checks for each invariant: one
for printing and one for BUG().
Let's trigger BUG() if we print error message.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This introduces memblock_alloc_range() which allocates memblock from the
specified range of physical address. I would like to use this function to
specify the location of CMA.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
For single threaded workloads, we can avoid flushing and iterating through
the entire list of tasks, making the whole function a lot faster,
requiring only a single atomic read for the mm_users.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|