Age | Commit message (Collapse) | Author |
|
The "strictlimit" feature was introduced to enforce per-bdi dirty limits
for FUSE which sets bdi max_ratio to 1% by default:
http://article.gmane.org/gmane.linux.kernel.mm/105809
However the feature can be useful for other relatively slow or untrusted
BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable
the feature:
echo 1 > /sys/class/bdi/X:Y/strictlimit
Being enabled, the feature enforces bdi max_ratio limit even if global
(10%) dirty limit is not reached. Of course, the effect is not visible
until /sys/class/bdi/X:Y/max_ratio is decreased to some reasonable value.
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Artem S. Tashkinov" <t.artem@lycos.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
A file reference should be held while a file is mmaped, otherwise it might
be freed while being used.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
'populate' wasn't initialized before being used in error paths,
causing panics when mm_populate() would get called with invalid
values.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
fix spello
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Armin Rigo <arigo@tunes.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
remap_file_pages(2) was invented to be able efficiently map parts of huge
file into limited 32-bit virtual address space such as in database
workloads.
Nonlinear mappings are pain to support and it seems there's no legitimate
use-cases nowadays since 64-bit systems are widely available.
Let's drop it and get rid of all these special-cased code.
The patch replaces the syscall with emulation which creates new VMA on
each remap_file_pages(), unless they it can be merged with an adjacent
one.
I didn't find *any* real code that uses remap_file_pages(2) to test
emulation impact on. I've checked Debian code search and source of all
packages in ALT Linux. No real users: libc wrappers, mentions in strace,
gdb, valgrind and this kind of stuff.
There are few basic tests in LTP for the syscall. They work just fine
with emulation.
To test performance impact, I've written small test case which demonstrate
pretty much worst case scenario: map 4G shmfs file, write to begin of
every page pgoff of the page, remap pages in reverse order, read every
page.
The test creates 1 million of VMAs if emulation is in use, so I had to set
vm.max_map_count to 1100000 to avoid -ENOMEM.
Before: 23.3 ( +- 4.31% ) seconds
After: 43.9 ( +- 0.85% ) seconds
Slowdown: 1.88x
I believe we can live with that.
Test case:
#define _GNU_SOURCE
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#define MB (1024UL * 1024)
#define SIZE (4096 * MB)
int main(int argc, char **argv)
{
unsigned long *p;
long i, pass;
for (pass = 0; pass < 10; pass++) {
p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}
for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;
for (i = 0; i < SIZE / 4096; i++) {
if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
perror("remap_file_pages");
return -1;
}
}
for (i = SIZE / 4096 - 1; i >= 0; i--)
assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
munmap(p, SIZE);
}
return 0;
}
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Armin Rigo <arigo@tunes.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Conflicts:
arch/x86/syscalls/syscall_64.tbl
|
|
|
|
|
|
|
|
|
|
|
|
kstrimdup() creates a whitespace-trimmed duplicate of the passed in
null-terminated string. This is useful for strings coming from sysfs that
often include trailing whitespace due to user input.
Thanks to Joe Perches for this implementation.
Signed-off-by: Sebastian Capella <sebastian.capella@linaro.org>
Cc: Joe Perches <joe@perches.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
mmap_sem() is required to protect the vma, which holds ->vm_file, which
pins fault_page->mapping.
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Change zswap to use the zpool api instead of directly using zbud. Add a
boot-time param to allow selecting which zpool implementation to use, with
zbud as the default.
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Weijie Yang <weijie.yang@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Update zbud and zsmalloc to implement the zpool api.
[fengguang.wu@intel.com: make functions static]
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Weijie Yang <weijie.yang@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add zpool api.
zpool provides an interface for memory storage, typically of compressed
memory. Users can select what backend to use; currently the only
implementations are zbud, a low density implementation with up to two
compressed pages per storage page, and zsmalloc, a higher density
implementation with multiple compressed pages per storage page.
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Weijie Yang <weijie.yang@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Change the type of the zbud_alloc() size param from unsigned int
to size_t.
Technically, this should not make any difference, as the zbud
implementation already restricts the size to well within either type's
limits; but as zsmalloc (and kmalloc) use size_t, and zpool will use
size_t, this brings the size parameter type in line with zsmalloc/zpool.
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Acked-by: Seth Jennings <sjennings@variantweb.net>
Cc: Weijie Yang <weijie.yang@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
zswap_entry_cache_destroy() is only called by __init init_zswap().
This patch also fixes function name zswap_entry_cache_ s/destory/destroy
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Seth Jennings <sjennings@variantweb.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
- remove unneeded `continue'
- expand the scope if the `if (isloated)' test, to optimise a code path
which is rarely actually taken.
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
add comment which may not be true :(
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
vmstat workers are used for folding counter differentials into the zone,
per node and global counters at certain time intervals. They currently
run at defined intervals on all processors which will cause some holdoff
for processors that need minimal intrusion by the OS.
The current vmstat_update mechanism depends on a deferrable timer firing
every other second by default which registers a work queue item that runs
on the local CPU, with the result that we have 1 interrupt and one
additional schedulable task on each CPU every 2 seconds If a workload
indeed causes VM activity or multiple tasks are running on a CPU, then
there are probably bigger issues to deal with.
However, some workloads dedicate a CPU for a single CPU bound task. This
is done in high performance computing, in high frequency financial
applications, in networking (Intel DPDK, EZchip NPS) and with the advent
of systems with more and more CPUs over time, this may become more and
more common to do since when one has enough CPUs one cares less about
efficiently sharing a CPU with other tasks and more about efficiently
monopolizing a CPU per task.
The difference of having this timer firing and workqueue kernel thread
scheduled per second can be enormous. An artificial test measuring the
worst case time to do a simple "i++" in an endless loop on a bare metal
system and under Linux on an isolated CPU with dynticks and with and
without this patch, have Linux match the bare metal performance (~700
cycles) with this patch and loose by couple of orders of magnitude (~200k
cycles) without it[*]. The loss occurs for something that just calculates
statistics. For networking applications, for example, this could be the
difference between dropping packets or sustaining line rate.
Statistics are important and useful, but it would be great if there would
be a way to not cause statistics gathering produce a huge performance
difference. This patche does just that.
This patch creates a vmstat shepherd worker that monitors the per cpu
differentials on all processors. If there are differentials on a
processor then a vmstat worker local to the processors with the
differentials is created. That worker will then start folding the diffs
in regular intervals. Should the worker find that there is no work to be
done then it will make the shepherd worker monitor the differentials
again.
With this patch it is possible then to have periods longer than
2 seconds without any OS event on a "cpu" (hardware thread).
The patch shows a very minor increased in system performance.
hackbench -s 512 -l 2000 -g 15 -f 25 -P
Results before the patch:
Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
Each sender will pass 2000 messages of 512 bytes
Time: 4.992
Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
Each sender will pass 2000 messages of 512 bytes
Time: 4.971
Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
Each sender will pass 2000 messages of 512 bytes
Time: 5.063
Hackbench after the patch:
Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
Each sender will pass 2000 messages of 512 bytes
Time: 4.973
Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
Each sender will pass 2000 messages of 512 bytes
Time: 4.990
Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
Each sender will pass 2000 messages of 512 bytes
Time: 4.993
Signed-off-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The gfp arg is not used in shmem_add_to_page_cache. Remove this unused
arg.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add a comment describing the circumstances in which __lock_page_or_retry()
will or will not release the mmap_sem when returning 0.
Add comments to lock_page_or_retry()'s callers (filemap_fault(),
do_swap_page()) noting the impact on VM_FAULT_RETRY returns.
Add comments on up the call tree, particularly replacing the false "We
return with mmap_sem still held" comments.
Signed-off-by: Paul Cassella <cassella@cray.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The fair zone allocation policy round-robins allocations between zones
within a node to avoid age inversion problems during reclaim. If the
first allocation fails, the batch counts are reset and a second attempt
made before entering the slow path.
One assumption made with this scheme is that batches expire at roughly the
same time and the resets each time are justified. This assumption does
not hold when zones reach their low watermark as the batches will be
consumed at uneven rates. Allocation failure due to watermark depletion
result in additional zonelist scans for the reset and another watermark
check before hitting the slowpath.
On UMA, the benefit is negligible -- around 0.25%. On 4-socket NUMA
machine it's variable due to the variability of measuring overhead with
the vmstat changes. The system CPU overhead comparison looks like
3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
vanilla vmstat-v5 lowercost-v5
User 746.94 774.56 802.00
System 65336.22 32847.27 40852.33
Elapsed 27553.52 27415.04 27368.46
However it is worth noting that the overall benchmark still completed
faster and intuitively it makes sense to take as few passes as possible
through the zonelists.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
encountered
The purpose of numa_zonelist_order=zone is to preserve lower zones for use
with 32-bit devices. If locality is preferred then the
numa_zonelist_order=node policy should be used. Unfortunately, the fair
zone allocation policy overrides this by skipping zones on remote nodes
until the lower one is found. While this makes sense from a page aging
and performance perspective, it breaks the expected zonelist policy. This
patch restores the expected behaviour for zone-list ordering.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When kswapd is awake reclaiming, the per-cpu stat thresholds are lowered
to get more accurate counts to avoid breaching watermarks. This threshold
update iterates over all possible CPUs which is unnecessary. Only online
CPUs need to be updated. If a new CPU is onlined,
refresh_zone_stat_thresholds() will set the thresholds correctly.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
zone->pages_scanned is a write-intensive cache line during page reclaim
and it's also updated during page free. Move the counter into vmstat to
take advantage of the per-cpu updates and do not update it in the free
paths unless necessary.
On a small UMA machine running tiobench the difference is marginal. On a
4-node machine the overhead is more noticable. Note that automatic NUMA
balancing was disabled for this test as otherwise the system CPU overhead
is unpredictable.
3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
vanillarearrange-v5 vmstat-v5
User 746.94 759.78 774.56
System 65336.22 58350.98 32847.27
Elapsed 27553.52 27282.02 27415.04
Note that the overhead reduction will vary depending on where exactly
pages are allocated and freed.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
reclaim lines
The arrangement of struct zone has changed over time and now it has
reached the point where there is some inappropriate sharing going on. On
x86-64 for example
o The zone->node field is shared with the zone lock and zone->node is
accessed frequently from the page allocator due to the fair zone
allocation policy.
o span_seqlock is almost never used by shares a line with free_area
o Some zone statistics share a cache line with the LRU lock so
reclaim-intensive and allocator-intensive workloads can bounce the cache
line on a stat update
This patch rearranges struct zone to put read-only and read-mostly fields
together and then splits the page allocator intensive fields, the zone
statistics and the page reclaim intensive fields into their own cache
lines. Note that the type of lowmem_reserve changes due to the watermark
calculations being signed and avoiding a signed/unsigned conversion there.
On the test configuration I used the overall size of struct zone shrunk by
one cache line. On smaller machines, this is not likely to be noticable.
However, on a 4-node NUMA machine running tiobench the system CPU overhead
is reduced by this patch.
3.16.0-rc3 3.16.0-rc3
vanillarearrange-v5r9
User 746.94 759.78
System 65336.22 58350.98
Elapsed 27553.52 27282.02
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This was formerly the series "Improve sequential read throughput" which
noted some major differences in performance of tiobench since 3.0. While
there are a number of factors, two that dominated were the introduction of
the fair zone allocation policy and changes to CFQ.
The behaviour of fair zone allocation policy makes more sense than
tiobench as a benchmark and CFQ defaults were not changed due to
insufficient benchmarking.
This series is what's left. It's one functional fix to the fair zone
allocation policy when used on NUMA machines and a reduction of overhead
in general. tiobench was used for the comparison despite its flaws as an
IO benchmark as in this case we are primarily interested in the overhead
of page allocator and page reclaim activity.
On UMA, it makes little difference to overhead
3.16.0-rc3 3.16.0-rc3
vanilla lowercost-v5
User 383.61 386.77
System 403.83 401.74
Elapsed 5411.50 5413.11
On a 4-socket NUMA machine it's a bit more noticable
3.16.0-rc3 3.16.0-rc3
vanilla lowercost-v5
User 746.94 802.00
System 65336.22 40852.33
Elapsed 27553.52 27368.46
This patch (of 6):
The LRU insertion and activate tracepoints take PFN as a parameter forcing
the overhead to the caller. Move the overhead to the tracepoint
fast-assign method to ensure the cost is only incurred when the tracepoint
is active.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
vm_total_pages is calculated by nr_free_pagecache_pages(), which counts
the number of pages which are beyond the high watermark within all zones.
So vm_total_pages is not equal to total number of pages which the VM
controls.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Otherwise we may not notice that pte was softdirty because
pte_mksoft_dirty helper _returns_ new pte but doesn't modify the argument.
In case if page fault happend on dirty filemapping the newly created pte
may loose softdirty bit thus if a userspace program is tracking memory
changes with help of a memory tracker (CONFIG_MEM_SOFT_DIRTY) it might
miss modification of a memory page (which in worts case may lead to
data inconsistency).
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently map_vm_area() takes (struct page *** pages) as third argument,
and after mapping, it moves (*pages) to point to (*pages +
nr_mappped_pages).
It looks like this kind of increment is useless to its caller these days.
The callers don't care about the increments and actually they're trying to
avoid this by passing another copy to map_vm_area().
The caller can always guarantee all the pages can be mapped into vm_area
as specified in first argument and the caller only cares about whether
map_vm_area() fails or not.
This patch cleans up the pointer movement in map_vm_area() and updates
its callers accordingly.
Signed-off-by: WANG Chao <chaowang@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 71e3aac0724f ("thp: transparent hugepage core") adds copy_pte_range
prototype to huge_mm.h. I'm not sure why (or if) this function have been
used outside of memory.c, but it currently isn't. This patch makes
copy_pte_range() static again.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
They are unnecessary: "zero" can be used in place of "hugetlb_zero" and
passing extra2 == NULL is equivalent to infinity.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Three different interfaces alter the maximum number of hugepages for an
hstate:
- /proc/sys/vm/nr_hugepages for global number of hugepages of the default
hstate,
- /sys/kernel/mm/hugepages/hugepages-X/nr_hugepages for global number of
hugepages for a specific hstate, and
- /sys/kernel/mm/hugepages/hugepages-X/nr_hugepages/mempolicy for number of
hugepages for a specific hstate over the set of allowed nodes.
Generalize the code so that a single function handles all of these writes
instead of duplicating the code in two different functions.
This decreases the number of lines of code, but also reduces the size of
.text by about half a percent since set_max_huge_pages() can be inlined.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When a hwpoison page is locked it could change state due to parallel
modifications. The original compound page can be torn down and then this
4k page becomes part of a differently-size compound page is is a
standalone regular page.
Check after the lock if the page is still the same compound page.
We could go back, grab the new head page and try again but it should be
quite rare, so I thought this was safest. A retry loop would be more
difficult to test and may have more side effects.
The hwpoison code by design only tries to handle cases that are reasonably
common in workloads, as visible in page-flags.
I'm not really that concerned about handling this (likely rare case), just
not crashing on it.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When returning from hugetlb_cow(), we always (1) put back the refcount for
each referenced page -- always 'old', and 'new' if allocation was
successful. And (2) retake the page table lock right before returning, as
the callers expects. This logic can be simplified and encapsulated, as
proposed in this patch. In addition to cleaner code, we also shave a few
bytes off the instruction text:
text data bss dec hex filename
28399 462 41328 70189 1122d mm/hugetlb.o-baseline
28367 462 41328 70157 1120d mm/hugetlb.o-patched
Passes libhugetlbfs testcases.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This function always returns 1, thus no need to check return value in
hugetlb_cow(). By doing so, we can get rid of the unnecessary WARN_ON
call. While this logic perhaps existed as a way of identifying future
unmap_ref_private() mishandling, reality is it serves no apparent purpose.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Do we really need an exported alias for __SetPageReferenced()?
Its callers better know what they're doing, in which case the page
would not be already marked referenced. Kill init_page_accessed(),
just __SetPageReferenced() inline.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
debugfs_remove_recursive
Fix checkpatch warning:
"WARNING: debugfs_remove_recursive(NULL) is safe this check is probably not required"
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Historically, we exported shared pages to userspace via sysinfo(2)
sharedram and /proc/meminfo's "MemShared" fields. With the advent of
tmpfs, from kernel v2.4 onward, that old way for accounting shared mem was
deemed inaccurate and we started to export a hard-coded 0 for
sysinfo.sharedram. Later on, during the 2.6 timeframe, "MemShared" got
re-introduced to /proc/meminfo re-branded as "Shmem", but we're still
reporting sysinfo.sharedmem as that old hard-coded zero, which makes the
"shared memory" report inconsistent across interfaces.
This patch leverages the addition of explicit accounting for pages used by
shmem/tmpfs -- "4b02108 mm: oom analysis: add shmem vmstat" -- in order to
make the users of sysinfo(2) and si_meminfo*() friends aware of that
vmstat entry and make them report it consistently across the interfaces,
as well to make sysinfo(2) returned data consistent with our current API
documentation states.
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use VM_WARN_ONCE, per Dave.
Cc: Hugh Dickins <hughd@google.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Print a warning (if CONFIG_DEBUG_VM=y) when memory commitment becomes too
negative.
This shouldn't happen any more - the previous two patches fixed the
committed_as underflow issues.
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
A shared anonymous mapping created without MAP_NORESERVE holds memory
reservation for whole range of shmem segment. Usually there is no way to
change its size, but /proc/<pid>/map_files/... (available if
CONFIG_CHECKPOINT_RESTORE=y) allows that.
This patch adjusts the memory reservation in shmem_setattr().
exploit:
#include <sys/mman.h>
#include <unistd.h>
#include <stdio.h>
int main(int argc, char **argv)
{
unsigned long addr;
char path[100];
/* charge 4KiB */
addr = (unsigned long)mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0);
sprintf(path, "/proc/self/map_files/%lx-%lx", addr, addr + 4096);
truncate(path, 1 << 30);
/* uncharge 1GiB */
}
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
If __shmem_file_setup() fails on struct file allocation it uncharges
memory commitment twice: first by shmem_unacct_size() and second time
implicitly in shmem_evict_inode() when it kills the newly created inode.
This patch removes shmem_unacct_size() from error path if the inode was
already there.
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
tmp_mask in the __vmalloc_area_node() iteration never changes so it can be
moved into function scope and marked with const. This causes the movl and
orl to only be done once per call rather than area->nr_pages times.
nested_gfp can also be marked const.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
only do it for __GFP_WAIT, per David
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
It is not uncommon on busy servers to get stuck hundred of ms in vmalloc()
calls (like file descriptor expansions).
Add a cond_resched() to __vmalloc_area_node() to be gentle to
other tasks.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|