Age | Commit message (Collapse) | Author |
|
Add madvise MADV_NOHUGEPAGE to mark regions that are not important to be
hugepage backed. Return -EINVAL if the vma is not of an anonymous type, or the
feature isn't built into the kernel. Never silently return success.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
Define MADV_NOHUGEPAGE.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
Read compound_trans_order safe. Noop for CONFIG_TRANSPARENT_HUGEPAGE=n.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
hugetlbfs was changed to allow memory failure to migrate the hugetlbfs pages
and that broke THP as split_huge_page was then called on hugetlbfs pages too.
compound_head/order was also run unsafe on THP pages that can be splitted at
any time.
All compound_head() invocations in memory-failure.c that are run on pages that
aren't pinned and that can be freed and reused from under us (while
compound_head is running) are buggy because compound_head can return a dangling
pointer, but I'm not fixing this as this is a generic memory-failure bug not
specific to THP but it applies to hugetlbfs too, so I can fix it later after
THP is merged upstream.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
Add debug checks for invariants that if broken could lead to mapcount vs
page_mapcount debug checks to trigger later in split_huge_page.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
Add hugepage statistics to per-node sysfs meminfo
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
Make sure we scale up nr_rotated when we encounter a referenced
transparent huge page. This ensures pageout scanning balance
is not distorted when there are huge pages on the LRU.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU
statistics, so the Active(anon) and Inactive(anon) statistics in
/proc/meminfo are correct.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
On small systems, the extra memory used by the anti-fragmentation
memory reserve and simply because huge pages are smaller than large
pages can easily outweigh the benefits of less TLB misses.
A less obvious concern is if run on a NUMA machine with asymmetric node sizes
and one of them is very small. The reserve could make the node unusable.
In case of the crashdump kernel, OOMs have been observed due to
the anti-fragmentation memory reserve taking up a large fraction
of the crashdump image.
This patch disables transparent hugepages on systems with less
than 1GB of RAM, but the hugepage subsystem is fully initialized
so administrators can enable THP through /sys if desired.
v2: reduce the limit to 512MB
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Avi Kiviti <avi@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
|
|
It makes no sense not to enable compaction for small order pages as we don't
want to end up with bad order 2 allocations and good and graceful order 9
allocations.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
This takes advantage of memory compaction to properly generate pages of order >
0 if regular page reclaim fails and priority level becomes more severe and we
don't reach the proper watermarks.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
It's unclear why schedule friendly kernel threads can't be taken away by the
CPU through the scheduler itself. It's safer to stop them as they can trigger
memory allocation, if kswapd also freezes itself to avoid generating I/O they
have too.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
For GRU and EPT, we need gup-fast to set referenced bit too (this is why it's
correct to return 0 when shadow_access_mask is zero, it requires gup-fast to
set the referenced bit). qemu-kvm access already sets the young bit in the pte
if it isn't zero-copy, if it's zero copy or a shadow paging EPT minor fault we
relay on gup-fast to signal the page is in use...
We also need to check the young bits on the secondary pagetables for NPT and
not nested shadow mmu as the data may never get accessed again by the primary
pte.
Without this closer accuracy, we'd have to remove the heuristic that avoids
collapsing hugepages in hugepage virtual regions that have not even a single
subpage in use.
->test_young is full backwards compatible with GRU and other usages that don't
have young bits in pagetables set by the hardware and that should nuke the
secondary mmu mappings when ->clear_flush_young runs just like EPT does.
Removing the heuristic that checks the young bit in
khugepaged/collapse_huge_page completely isn't so bad either probably but I
thought it was worth it and this makes it reliable.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
Archs implementing Transparent Hugepage Support must implement a
function called has_transparent_hugepage to be sure the virtual or
physical CPU supports Transparent Hugepages.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
An huge pmd can only be mapped if the corresponding 2M virtual range
is fully contained in the vma. At times the VM calls split_vma twice,
if the first split_vma succeeds and the second fail, the first
split_vma remains in effect and it's not rolled back. For split_vma or
vma_adjust to fail an allocation failure is needed so it's a very
unlikely event (the out of memory killer would normally fire before
any allocation failure is visible to kernel and userland and if an out
of memory condition happens it's unlikely to happen exactly
here). Nevertheless it's safer to ensure that no huge pmd can be left
around if the vma is adjusted in a way that can't fit hugepages
anymore at the new vm_start/vm_end address.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
It's not worth migrating transparent hugepages during compaction. Those
hugepages don't create fragmentation.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
|
|
With transparent hugepage support we need compaction for the "defrag" sysfs
controls to be effective.
At the moment THP hangs the system if COMPACTION isn't selected, as without
COMPACTION lumpy reclaim wouldn't be entirely disabled. So at the moment it's
not orthogonal. When lumpy will be removed from the VM I can remove the select
COMPACTION in theory, but then 99% of THP users would be still doing a mistake
in disabling compaction, even if the mistake won't return in fatal runtime but
just slightly degraded performance. So from a theoretical standpoing forcing
the below select is not needed (the dependency isn't strict nor at compile time
nor at runtime) but from a practical standpoint it is safer.
If anybody really wants THP to run without compaction, it'd be such a weird
setup that editing the Kconfig file to allow it will be surely not a problem.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
|
|
Allow to choose between the always|madvise default for page faults and
khugepaged at config time. madvise guarantees zero risk of higher memory
footprint for applications (applications using madvise(MADV_HUGEPAGE) won't
risk to use any more memory by backing their virtual regions with hugepages).
Initially set the default to N and don't depend on EMBEDDED.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
This tries to be more friendly to filesystem in userland, with userland
backends that allocate memory in the I/O paths and that could deadlock if
khugepaged holds the mmap_sem write mode of the userland backend while
allocating memory. Memory allocation may wait for writeback I/O completion from
the daemon that may be blocked in the mmap_sem read mode if a page fault
happens and the daemon wasn't using mlock for the memory required for the I/O
submission and completion.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
It's mostly a matter of replacing alloc_pages with alloc_pages_vma after
introducing alloc_pages_vma. khugepaged needs special handling as the
allocation has to happen inside collapse_huge_page where the vma is known and
an error has to be returned to the outer loop to sleep alloc_sleep_millisecs in
case of failure. But it retains the more efficient logic of handling allocation
failures in khugepaged in case of CONFIG_NUMA=n.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
With memory compaction in, and lumpy-reclaim disabled, it seems safe enough to
defrag memory during the (synchronous) transparent hugepage page faults
(TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) and not only during khugepaged (async)
hugepage allocations that was already enabled even before memory compaction was
in (TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG).
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
|
|
If transparent hugepage is enabled initialize min_free_kbytes to an optimal
value by default. This moves the hugeadm algorithm in kernel.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
Natively handle huge pmds when changing page tables on behalf of
mprotect().
I left out update_mmu_cache() because we do not need it on x86 anyway
but more importantly the interface works on ptes, not pmds.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
|
|
Flushing the tlb for huge pmds requires the vma's anon_vma,
so pass along the vma instead of the mm, we can always get the latter
when we need it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
|
|
Add pmd_modify() for use with mprotect() on huge pmds.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
|
|
Handle transparent huge page pmd entries natively instead of splitting
them into subpages.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
|
|
Add support for transparent hugepages to x86 32bit.
Share the same VM_ bitflag for VM_MAPPED_COPY. mm/nommu.c will never support
transparent hugepages.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
|
|
PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can be
added to page->flags without overflowing (because of the sparse section bits
increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also has to move
the memory hotplug code from _mapcount to lru.next to avoid any risk of
clashes. We can't use lru.next for PG_buddy removal, but memory hotplug can use
lru.next even more easily than the mapcount instead.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
Skip transhuge pages in ksm for now.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
|
|
register in khugepaged if the vma grows.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
Add khugepaged to relocate fragmented pages into hugepages if new hugepages
become available. (this is indipendent of the defrag logic that will have to
make new hugepages available)
The fundamental reason why khugepaged is unavoidable, is that some
memory can be fragmented and not everything can be relocated. So when
a virtual machine quits and releases gigabytes of hugepages, we want
to use those freely available hugepages to create huge-pmd in the
other virtual machines that may be running on fragmented memory, to
maximize the CPU efficiency at all times. The scan is slow, it takes
nearly zero cpu time, except when it copies data (in which case it
means we definitely want to pay for that cpu time) so it seems a good
tradeoff.
In addition to the hugepages being released by other process releasing memory,
we have the strong suspicion that the performance impact of potentially
defragmenting hugepages during or before each page fault could lead to more
performance inconsistency than allocating small pages at first and having them
collapsed into large pages later... if they prove themselfs to be long lived
mappings (khugepaged scan is slow so short lived mappings have low probability
to run into khugepaged if compared to long lived mappings).
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
|
|
Add hugepage stat information to /proc/vmstat and /proc/meminfo.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
|
|
Add memcg charge/uncharge to hugepage faults in huge_memory.c.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
|
|
By this patch, when a transparent hugepage is charged, not only the head page but
also all the tail pages are committed, IOW pc->mem_cgroup and pc->flags of tail
pages are set.
Without this patch:
- Tail pages are not linked to any memcg's LRU at splitting. This causes many
problems, for example, the charged memcg's directory can never be rmdir'ed
because it doesn't have enough pages to scan to make the usage decrease to 0.
- "rss" field in memory.stat would be incorrect. Moreover, usage_in_bytes in
root cgroup is calculated by the stat not by res_counter(since 2.6.32),
it would be incorrect too.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
Teach memcg to charge/uncharge compound pages.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
|
|
No pmd_trans_huge should ever materialize in migration ptes areas, because
we split the hugepage before migration ptes are instantiated.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
|
|
Remove branches when CONFIG_TRANSPARENT_HUGEPAGE=n.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
|
|
Add madvise MADV_HUGEPAGE to mark regions that are important to be hugepage
backed. Return -EINVAL if the vma is not of an anonymous type, or the feature
isn't built into the kernel. Never silently return success.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
|
|
pte_trans_huge must not leak in certain vmas like the mmio special pfn or
filebacked mappings.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
|
|
This documents how split_huge_page is safe vs new vma inserctions into
the anon_vma that may have already released the anon_vma->lock but not
established pmds yet when split_huge_page starts.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
|
|
Lately I've been working to make KVM use hugepages transparently
without the usual restrictions of hugetlbfs. Some of the restrictions
I'd like to see removed:
1) hugepages have to be swappable or the guest physical memory remains
locked in RAM and can't be paged out to swap
2) if a hugepage allocation fails, regular pages should be allocated
instead and mixed in the same vma without any failure and without
userland noticing
3) if some task quits and more hugepages become available in the
buddy, guest physical memory backed by regular pages should be
relocated on hugepages automatically in regions under
madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
not null)
4) avoidance of reservation and maximization of use of hugepages whenever
possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
1 machine with 1 database with 1 database cache with 1 database cache size
known at boot time. It's definitely not feasible with a virtualization
hypervisor usage like RHEV-H that runs an unknown number of virtual machines
with an unknown size of each virtual machine with an unknown amount of
pagecache that could be potentially useful in the host for guest not using
O_DIRECT (aka cache=off).
hugepages in the virtualization hypervisor (and also in the guest!) are
much more important than in a regular host not using virtualization, becasue
with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
case only the hypervisor uses transparent hugepages, and they decrease the
tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
the linux guest both uses this patch (though the guest will limit the addition
speedup to anonymous regions only for now...). Even more important is that the
tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
paging or no-virtualization scenario. So maximizing the amount of virtual
memory cached by the TLB pays off significantly more with NPT/EPT than without
(even if there would be no significant speedup in the tlb-miss runtime).
The first (and more tedious) part of this work requires allowing the VM to
handle anonymous hugepages mixed with regular pages transparently on regular
anonymous vmas. This is what this patch tries to achieve in the least intrusive
possible way. We want hugepages and hugetlb to be used in a way so that all
applications can benefit without changes (as usual we leverage the KVM
virtualization design: by improving the Linux VM at large, KVM gets the
performance boost too).
The most important design choice is: always fallback to 4k allocation
if the hugepage allocation fails! This is the _very_ opposite of some
large pagecache patches that failed with -EIO back then if a 64k (or
similar) allocation failed...
Second important decision (to reduce the impact of the feature on the
existing pagetable handling code) is that at any time we can split an
hugepage into 512 regular pages and it has to be done with an
operation that can't fail. This way the reliability of the swapping
isn't decreased (no need to allocate memory when we are short on
memory to swap) and it's trivial to plug a split_huge_page* one-liner
where needed without polluting the VM. Over time we can teach
mprotect, mremap and friends to handle pmd_trans_huge natively without
calling split_huge_page*. The fact it can't fail isn't just for swap:
if split_huge_page would return -ENOMEM (instead of the current void)
we'd need to rollback the mprotect from the middle of it (ideally
including undoing the split_vma) which would be a big change and in
the very wrong direction (it'd likely be simpler not to call
split_huge_page at all and to teach mprotect and friends to handle
hugepages instead of rolling them back from the middle). In short the
very value of split_huge_page is that it can't fail.
The collapsing and madvise(MADV_HUGEPAGE) part will remain separated
and incremental and it'll just be an "harmless" addition later if this
initial part is agreed upon. It also should be noted that locking-wise
replacing regular pages with hugepages is going to be very easy if
compared to what I'm doing below in split_huge_page, as it will only
happen when page_count(page) matches page_mapcount(page) if we can
take the PG_lock and mmap_sem in write mode. collapse_huge_page will
be a "best effort" that (unlike split_huge_page) can fail at the
minimal sign of trouble and we can try again later. collapse_huge_page
will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will
work similar to madvise(MADV_MERGEABLE).
The default I like is that transparent hugepages are used at page fault time.
This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The
control knob can be set to three values "always", "madvise", "never" which
mean respectively that hugepages are always used, or only inside
madvise(MADV_HUGEPAGE) regions, or never used.
/sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage
allocation should defrag memory aggressively "always", only inside "madvise"
regions, or "never".
The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
put_page (from get_user_page users that can't use mmu notifier like
O_DIRECT) that runs against a __split_huge_page_refcount instead was a
pain to serialize in a way that would result always in a coherent page
count for both tail and head. I think my locking solution with a
compound_lock taken only after the page_first is valid and is still a
PageHead should be safe but it surely needs review from SMP race point
of view. In short there is no current existing way to serialize the
O_DIRECT final put_page against split_huge_page_refcount so I had to
invent a new one (O_DIRECT loses knowledge on the mapping status by
the time gup_fast returns so...). And I didn't want to impact all
gup/gup_fast users for now, maybe if we change the gup interface
substantially we can avoid this locking, I admit I didn't think too
much about it because changing the gup unpinning interface would be
invasive.
If we ignored O_DIRECT we could stick to the existing compound
refcounting code, by simply adding a
get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu
notifier user) would call it without FOLL_GET (and if FOLL_GET isn't
set we'd just BUG_ON if nobody registered itself in the current task
mmu notifier list yet). But O_DIRECT is fundamental for decent
performance of virtualized I/O on fast storage so we can't avoid it to
solve the race of put_page against split_huge_page_refcount to achieve
a complete hugepage feature for KVM.
Swap and oom works fine (well just like with regular pages ;). MMU
notifier is handled transparently too, with the exception of the young
bit on the pmd, that didn't have a range check but I think KVM will be
fine because the whole point of hugepages is that EPT/NPT will also
use a huge pmd when they notice gup returns pages with PageCompound set,
so they won't care of a range and there's just the pmd young bit to
check in that case.
NOTE: in some cases if the L2 cache is small, this may slowdown and
waste memory during COWs because 4M of memory are accessed in a single
fault instead of 8k (the payoff is that after COW the program can run
faster). So we might want to switch the copy_huge_page (and
clear_huge_page too) to not temporal stores. I also extensively
researched ways to avoid this cache trashing with a full prefault
logic that would cow in 8k/16k/32k/64k up to 1M (I can send those
patches that fully implemented prefault) but I concluded they're not
worth it and they add an huge additional complexity and they remove all tlb
benefits until the full hugepage has been faulted in, to save a little bit of
memory and some cache during app startup, but they still don't improve
substantially the cache-trashing during startup if the prefault happens in >4k
chunks. One reason is that those 4k pte entries copied are still mapped on a
perfectly cache-colored hugepage, so the trashing is the worst one can generate
in those copies (cow of 4k page copies aren't so well colored so they trashes
less, but again this results in software running faster after the page fault).
Those prefault patches allowed things like a pte where post-cow pages were
local 4k regular anon pages and the not-yet-cowed pte entries were pointing in
the middle of some hugepage mapped read-only. If it doesn't payoff
substantially with todays hardware it will payoff even less in the future with
larger l2 caches, and the prefault logic would blot the VM a lot. If one is
emebdded transparent_hugepage can be disabled during boot with sysfs or with
the boot commandline parameter transparent_hugepage=0 (or
transparent_hugepage=2 to restrict hugepages inside madvise regions) that will
ensure not a single hugepage is allocated at boot time. It is simple enough to
just disable transparent hugepage globally and let transparent hugepages be
allocated selectively by applications in the MADV_HUGEPAGE region (both at page
fault time, and if enabled with the collapse_huge_page too through the kernel
daemon).
This patch supports only hugepages mapped in the pmd, archs that have
smaller hugepages will not fit in this patch alone. Also some archs like power
have certain tlb limits that prevents mixing different page size in the same
regions so they will not fit in this framework that requires "graceful
fallback" to basic PAGE_SIZE in case of physical memory fragmentation.
hugetlbfs remains a perfect fit for those because its software limits happen to
match the hardware limits. hugetlbfs also remains a perfect fit for hugepage
sizes like 1GByte that cannot be hoped to be found not fragmented after a
certain system uptime and that would be very expensive to defragment with
relocation, so requiring reservation. hugetlbfs is the "reservation way", the
point of transparent hugepages is not to have any reservation at all and
maximizing the use of cache and hugepages at all times automatically.
Some performance result:
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
ages3
memset page fault 1566023
memset tlb miss 453854
memset second tlb miss 453321
random access tlb miss 41635
random access second tlb miss 41658
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
memset page fault 1566471
memset tlb miss 453375
memset second tlb miss 453320
random access tlb miss 41636
random access second tlb miss 41637
vmx andrea # ./largepages3
memset page fault 1566642
memset tlb miss 453417
memset second tlb miss 453313
random access tlb miss 41630
random access second tlb miss 41647
vmx andrea # ./largepages3
memset page fault 1566872
memset tlb miss 453418
memset second tlb miss 453315
random access tlb miss 41618
random access second tlb miss 41659
vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
vmx andrea # ./largepages3
memset page fault 2182476
memset tlb miss 460305
memset second tlb miss 460179
random access tlb miss 44483
random access second tlb miss 44186
vmx andrea # ./largepages3
memset page fault 2182791
memset tlb miss 460742
memset second tlb miss 459962
random access tlb miss 43981
random access second tlb miss 43988
============
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#define SIZE (3UL*1024*1024*1024)
int main()
{
char *p = malloc(SIZE), *p2;
struct timeval before, after;
gettimeofday(&before, NULL);
memset(p, 0, SIZE);
gettimeofday(&after, NULL);
printf("memset page fault %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
memset(p, 0, SIZE);
gettimeofday(&after, NULL);
printf("memset tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
memset(p, 0, SIZE);
gettimeofday(&after, NULL);
printf("memset second tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
for (p2 = p; p2 < p+SIZE; p2 += 4096)
*p2 = 0;
gettimeofday(&after, NULL);
printf("random access tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
for (p2 = p; p2 < p+SIZE; p2 += 4096)
*p2 = 0;
gettimeofday(&after, NULL);
printf("random access second tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
return 0;
}
============
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
|
|
Not worth throwing away the precious reserved free memory pool for allocations
that can fail gracefully (either through mempool or because they're transhuge
allocations later falling back to 4k allocations).
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
|
|
Transparent hugepage allocations must be allowed not to invoke kswapd or any
other kind of indirect reclaim (especially when the defrag sysfs is control
disabled). It's unacceptable to swap out anonymous pages (potentially
anonymous transparent hugepages) in order to create new transparent hugepages.
This is true for the MADV_HUGEPAGE areas too (swapping out a kvm virtual
machine and so having it suffer an unbearable slowdown, so another one with
guest physical memory marked MADV_HUGEPAGE can run 30% faster if it is running
memory intensive workloads, makes no sense). If a transparent hugepage
allocation fails the slowdown is minor and there is total fallback, so kswapd
should never be asked to swapout memory to allow the high order allocation to
succeed.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
|
|
This should work for both hugetlbfs and transparent hugepages.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
|
|
Move the copy/clear_huge_page functions to common code to share between
hugetlb.c and huge_memory.c.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
|
|
Paging logic that splits the page before it is unmapped and added to swap to
ensure backwards compatibility with the legacy swap code. Eventually swap
should natively pageout the hugepages to increase performance and decrease
seeking and fragmentation of swap space. swapoff can just skip over huge pmd as
they cannot be part of swap yet. In add_to_swap be careful to split the page
only if we got a valid swap entry so we don't split hugepages with a full swap.
In theory we could split pages before isolating them during the lru scan, but
for khugepaged to be safe, I'm relying on either mmap_sem write mode, or
PG_lock taken, so split_huge_page has to run either with mmap_sem read/write
mode or PG_lock taken. Calling it from isolate_lru_page would make locking more
complicated, in addition to that split_huge_page would deadlock if called by
__isolate_lru_page because it has to take the lru lock to add the tail pages.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
|
|
split_huge_page_pmd compat code. Each one of those would need to be expanded to
hundred of lines of complex code without a fully reliable
split_huge_page_pmd design.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
|
|
This increase the size of the mm struct a bit but it is needed to preallocate
one pte for each hugepage so that split_huge_page will not require a fail path.
Guarantee of success is a fundamental property of split_huge_page to avoid
decrasing swapping reliability and to avoid adding -ENOMEM fail paths that
would otherwise force the hugepage-unaware VM code to learn rolling back in the
middle of its pte mangling operations (if something we need it to learn
handling pmd_trans_huge natively rather being capable of rollback). When
split_huge_page runs a pte is needed to succeed the split, to map the newly
splitted regular pages with a regular pte. This way all existing VM code
remains backwards compatible by just adding a split_huge_page* one liner. The
memory waste of those preallocated ptes is negligible and so it is worth it.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
|
|
split_huge_page must transform a compound page to a regular page and needs
ClearPageCompound.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
|
|
Add mmu notifier helpers to handle pmd huge operations.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
|