summaryrefslogtreecommitdiff
path: root/mm/mempolicy.c
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2024-05-19 09:21:03 -0700
committerLinus Torvalds <torvalds@linux-foundation.org>2024-05-19 09:21:03 -0700
commit61307b7be41a1f1039d1d1368810a1d92cb97b44 (patch)
tree639e233e177f8618cd5f86daeb7efc6b095890f0 /mm/mempolicy.c
parent0450d2083be6bdcd18c9535ac50c55266499b2df (diff)
parent76edc534cc289308130272a2ac28694fc9b72a03 (diff)
Merge tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
Diffstat (limited to 'mm/mempolicy.c')
-rw-r--r--mm/mempolicy.c104
1 files changed, 61 insertions, 43 deletions
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0fe77738d971..aec756ae5637 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -509,8 +509,8 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk)
qp->nr_failed++;
return;
}
- folio = pfn_folio(pmd_pfn(*pmd));
- if (is_huge_zero_page(&folio->page)) {
+ folio = pmd_folio(*pmd);
+ if (is_huge_zero_folio(folio)) {
walk->action = ACTION_CONTINUE;
return;
}
@@ -642,12 +642,11 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
* Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio.
* Choosing not to migrate a shared folio is not counted as a failure.
*
- * To check if the folio is shared, ideally we want to make sure
- * every page is mapped to the same process. Doing that is very
- * expensive, so check the estimated sharers of the folio instead.
+ * See folio_likely_mapped_shared() on possible imprecision when we
+ * cannot easily detect if a folio is shared.
*/
if ((flags & MPOL_MF_MOVE_ALL) ||
- (folio_estimated_sharers(folio) == 1 && !hugetlb_pmd_shared(pte)))
+ (!folio_likely_mapped_shared(folio) && !hugetlb_pmd_shared(pte)))
if (!isolate_hugetlb(folio, qp->pagelist))
qp->nr_failed++;
unlock:
@@ -1032,11 +1031,10 @@ static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
* Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio.
* Choosing not to migrate a shared folio is not counted as a failure.
*
- * To check if the folio is shared, ideally we want to make sure
- * every page is mapped to the same process. Doing that is very
- * expensive, so check the estimated sharers of the folio instead.
+ * See folio_likely_mapped_shared() on possible imprecision when we
+ * cannot easily detect if a folio is shared.
*/
- if ((flags & MPOL_MF_MOVE_ALL) || folio_estimated_sharers(folio) == 1) {
+ if ((flags & MPOL_MF_MOVE_ALL) || !folio_likely_mapped_shared(folio)) {
if (folio_isolate_lru(folio)) {
list_add_tail(&folio->lru, foliolist);
node_stat_mod_folio(folio,
@@ -1070,6 +1068,7 @@ static long migrate_to_node(struct mm_struct *mm, int source, int dest,
struct migration_target_control mtc = {
.nid = dest,
.gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE,
+ .reason = MR_SYSCALL,
};
nodes_clear(nmask);
@@ -1227,7 +1226,8 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
h = folio_hstate(src);
gfp = htlb_alloc_mask(h);
nodemask = policy_nodemask(gfp, pol, ilx, &nid);
- return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp);
+ return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp,
+ htlb_allow_alloc_fallback(MR_MEMPOLICY_MBIND));
}
if (folio_test_large(src))
@@ -1504,9 +1504,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
return -EINVAL;
if (*flags & MPOL_F_NUMA_BALANCING) {
- if (*mode != MPOL_BIND)
+ if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY)
+ *flags |= (MPOL_F_MOF | MPOL_F_MORON);
+ else
return -EINVAL;
- *flags |= (MPOL_F_MOF | MPOL_F_MORON);
}
return 0;
}
@@ -2200,9 +2201,9 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
*/
preferred_gfp = gfp | __GFP_NOWARN;
preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
- page = __alloc_pages(preferred_gfp, order, nid, nodemask);
+ page = __alloc_pages_noprof(preferred_gfp, order, nid, nodemask);
if (!page)
- page = __alloc_pages(gfp, order, nid, NULL);
+ page = __alloc_pages_noprof(gfp, order, nid, NULL);
return page;
}
@@ -2217,7 +2218,7 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
*
* Return: The page on success or NULL if allocation fails.
*/
-struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
+struct page *alloc_pages_mpol_noprof(gfp_t gfp, unsigned int order,
struct mempolicy *pol, pgoff_t ilx, int nid)
{
nodemask_t *nodemask;
@@ -2248,7 +2249,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
* First, try to allocate THP only on local node, but
* don't reclaim unnecessarily, just compact.
*/
- page = __alloc_pages_node(nid,
+ page = __alloc_pages_node_noprof(nid,
gfp | __GFP_THISNODE | __GFP_NORETRY, order);
if (page || !(gfp & __GFP_DIRECT_RECLAIM))
return page;
@@ -2261,7 +2262,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
}
}
- page = __alloc_pages(gfp, order, nid, nodemask);
+ page = __alloc_pages_noprof(gfp, order, nid, nodemask);
if (unlikely(pol->mode == MPOL_INTERLEAVE) && page) {
/* skip NUMA_INTERLEAVE_HIT update if numa stats is disabled */
@@ -2292,7 +2293,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
*
* Return: The folio on success or NULL if allocation fails.
*/
-struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
+struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma,
unsigned long addr, bool hugepage)
{
struct mempolicy *pol;
@@ -2300,12 +2301,12 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
struct page *page;
pol = get_vma_policy(vma, addr, order, &ilx);
- page = alloc_pages_mpol(gfp | __GFP_COMP, order,
- pol, ilx, numa_node_id());
+ page = alloc_pages_mpol_noprof(gfp | __GFP_COMP, order,
+ pol, ilx, numa_node_id());
mpol_cond_put(pol);
return page_rmappable_folio(page);
}
-EXPORT_SYMBOL(vma_alloc_folio);
+EXPORT_SYMBOL(vma_alloc_folio_noprof);
/**
* alloc_pages - Allocate pages.
@@ -2321,7 +2322,7 @@ EXPORT_SYMBOL(vma_alloc_folio);
* flags are used.
* Return: The page on success or NULL if allocation fails.
*/
-struct page *alloc_pages(gfp_t gfp, unsigned int order)
+struct page *alloc_pages_noprof(gfp_t gfp, unsigned int order)
{
struct mempolicy *pol = &default_policy;
@@ -2332,16 +2333,16 @@ struct page *alloc_pages(gfp_t gfp, unsigned int order)
if (!in_interrupt() && !(gfp & __GFP_THISNODE))
pol = get_task_policy(current);
- return alloc_pages_mpol(gfp, order,
- pol, NO_INTERLEAVE_INDEX, numa_node_id());
+ return alloc_pages_mpol_noprof(gfp, order, pol, NO_INTERLEAVE_INDEX,
+ numa_node_id());
}
-EXPORT_SYMBOL(alloc_pages);
+EXPORT_SYMBOL(alloc_pages_noprof);
-struct folio *folio_alloc(gfp_t gfp, unsigned int order)
+struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
{
- return page_rmappable_folio(alloc_pages(gfp | __GFP_COMP, order));
+ return page_rmappable_folio(alloc_pages_noprof(gfp | __GFP_COMP, order));
}
-EXPORT_SYMBOL(folio_alloc);
+EXPORT_SYMBOL(folio_alloc_noprof);
static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
struct mempolicy *pol, unsigned long nr_pages,
@@ -2360,13 +2361,13 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
for (i = 0; i < nodes; i++) {
if (delta) {
- nr_allocated = __alloc_pages_bulk(gfp,
+ nr_allocated = alloc_pages_bulk_noprof(gfp,
interleave_nodes(pol), NULL,
nr_pages_per_node + 1, NULL,
page_array);
delta--;
} else {
- nr_allocated = __alloc_pages_bulk(gfp,
+ nr_allocated = alloc_pages_bulk_noprof(gfp,
interleave_nodes(pol), NULL,
nr_pages_per_node, NULL, page_array);
}
@@ -2503,11 +2504,11 @@ static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
preferred_gfp = gfp | __GFP_NOWARN;
preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
- nr_allocated = __alloc_pages_bulk(preferred_gfp, nid, &pol->nodes,
+ nr_allocated = alloc_pages_bulk_noprof(preferred_gfp, nid, &pol->nodes,
nr_pages, NULL, page_array);
if (nr_allocated < nr_pages)
- nr_allocated += __alloc_pages_bulk(gfp, numa_node_id(), NULL,
+ nr_allocated += alloc_pages_bulk_noprof(gfp, numa_node_id(), NULL,
nr_pages - nr_allocated, NULL,
page_array + nr_allocated);
return nr_allocated;
@@ -2519,7 +2520,7 @@ static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
* It can accelerate memory allocation especially interleaving
* allocate memory.
*/
-unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
+unsigned long alloc_pages_bulk_array_mempolicy_noprof(gfp_t gfp,
unsigned long nr_pages, struct page **page_array)
{
struct mempolicy *pol = &default_policy;
@@ -2543,8 +2544,8 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
nid = numa_node_id();
nodemask = policy_nodemask(gfp, pol, NO_INTERLEAVE_INDEX, &nid);
- return __alloc_pages_bulk(gfp, nid, nodemask,
- nr_pages, NULL, page_array);
+ return alloc_pages_bulk_noprof(gfp, nid, nodemask,
+ nr_pages, NULL, page_array);
}
int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
@@ -2718,7 +2719,7 @@ static void sp_free(struct sp_node *n)
* mpol_misplaced - check whether current folio node is valid in policy
*
* @folio: folio to be checked
- * @vma: vm area where folio mapped
+ * @vmf: structure describing the fault
* @addr: virtual address in @vma for shared policy lookup and interleave policy
*
* Lookup current policy node id for vma,addr and "compare to" folio's
@@ -2728,18 +2729,24 @@ static void sp_free(struct sp_node *n)
* Return: NUMA_NO_NODE if the page is in a node that is valid for this
* policy, or a suitable node ID to allocate a replacement folio from.
*/
-int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
+int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
unsigned long addr)
{
struct mempolicy *pol;
pgoff_t ilx;
struct zoneref *z;
int curnid = folio_nid(folio);
+ struct vm_area_struct *vma = vmf->vma;
int thiscpu = raw_smp_processor_id();
- int thisnid = cpu_to_node(thiscpu);
+ int thisnid = numa_node_id();
int polnid = NUMA_NO_NODE;
int ret = NUMA_NO_NODE;
+ /*
+ * Make sure ptl is held so that we don't preempt and we
+ * have a stable smp processor id
+ */
+ lockdep_assert_held(vmf->ptl);
pol = get_vma_policy(vma, addr, folio_order(folio), &ilx);
if (!(pol->flags & MPOL_F_MOF))
goto out;
@@ -2764,15 +2771,26 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
break;
case MPOL_BIND:
- /* Optimize placement among multiple nodes via NUMA balancing */
+ case MPOL_PREFERRED_MANY:
+ /*
+ * Even though MPOL_PREFERRED_MANY can allocate pages outside
+ * policy nodemask we don't allow numa migration to nodes
+ * outside policy nodemask for now. This is done so that if we
+ * want demotion to slow memory to happen, before allocating
+ * from some DRAM node say 'x', we will end up using a
+ * MPOL_PREFERRED_MANY mask excluding node 'x'. In such scenario
+ * we should not promote to node 'x' from slow memory node.
+ */
if (pol->flags & MPOL_F_MORON) {
+ /*
+ * Optimize placement among multiple nodes
+ * via NUMA balancing
+ */
if (node_isset(thisnid, pol->nodes))
break;
goto out;
}
- fallthrough;
- case MPOL_PREFERRED_MANY:
/*
* use current page if in policy nodemask,
* else select nearest allowed node, if any.
@@ -2781,7 +2799,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
if (node_isset(curnid, pol->nodes))
goto out;
z = first_zones_zonelist(
- node_zonelist(numa_node_id(), GFP_HIGHUSER),
+ node_zonelist(thisnid, GFP_HIGHUSER),
gfp_zone(GFP_HIGHUSER),
&pol->nodes);
polnid = zone_to_nid(z->zone);