summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-05-07mm/damon/paddr: do page level access check for pageout DAMOS action on its ownSeongJae Park
'pageout' DAMOS action implementation of 'paddr' DAMON operations set asks reclaim_pages() to do page level access check if the user is not asking DAMOS to do that on its own. Simplify the logic by making the check always be done by 'paddr'. Link: https://lkml.kernel.org/r/20240429224451.67081-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/damon/paddr: avoid unnecessary page level access check for pageout DAMOS ↵SeongJae Park
action Patch series "mm/damon/paddr: simplify page level access re-check for pageout. The 'pageout' DAMOS action implementation of 'paddr' asks reclaim_pages() to do page level access check again. But the user can ask 'paddr' to do the page level access check on its own, using DAMOS filter of 'young page' type. Meanwhile, 'paddr' is the only user of reclaim_pages() that asks the page level access check. Make 'paddr' does the page level access check on its own always, and simplify reclaim_pages() by removing the page level access check request handling logic. As a result of the change for reclaim_pages(), reclaim_folio_list(), which is called by reclaim_pages(), also no more need to do the page level access check. Simplify the function, too. This patch (of 4): 'pageout' DAMOS action implementation of 'paddr' asks reclaim_pages() to do the page level access check. User could ask DAMOS to do the page level access check on its own using 'young page' type DAMOS filter. In the case, pageout DAMOS action unnecessarily asks reclaim_pages() to do the check again. Ask the page level access check only if the scheme is not having the filter. Link: https://lkml.kernel.org/r/20240429224451.67081-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240429224451.67081-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/gup: fix hugepd handling in hugetlb reworkPeter Xu
Commit a12083d721d7 added hugepd handling for gup-slow, reusing gup-fast functions. follow_hugepd() correctly took the vma pointer in, however didn't pass it over into the lower functions, which was overlooked. The issue is gup_fast_hugepte() uses the vma pointer to make the correct decision on whether an unshare is needed for a FOLL_PIN|FOLL_LONGTERM. Now without vma ponter it will constantly return "true" (needs an unshare) for a page cache, even though in the SHARED case it will be wrong to unshare. The other problem is, even if an unshare is needed, it now returns 0 rather than -EMLINK, which will not trigger a follow up FAULT_FLAG_UNSHARE fault. That will need to be fixed too when the unshare is wanted. gup_longterm test didn't expose this issue in the past because it didn't yet test R/O unshare in this case, another separate patch will enable that in future tests. Fix it by passing vma correctly to the bottom, rename gup_fast_hugepte() back to gup_hugepte() as it is shared between the fast/slow paths, and also allow -EMLINK to be returned properly by gup_hugepte() even though gup-fast will take it the same as zero. Link: https://lkml.kernel.org/r/20240430131303.264331-1-peterx@redhat.com Fixes: a12083d721d7 ("mm/gup: handle hugepd for follow_page()") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07selftests: mm: gup_longterm: test unsharing logic when R/O pinningDavid Hildenbrand
In our FOLL_LONGTERM tests, we prefault the page tables for the GUP-fast test cases to be able to find a PTE and exercise the "longterm pinning allowed" logic on the GUP-fast path where possible. For now, we always prefault the page tables writable, resulting in PTEs that are writable. Let's cover more cases to also test if our unsharing logic works as expected (and is able to make progress when there is nothing to unshare) by mprotect'ing the range R/O when R/O-pinning, so we don't get PTEs that are writable. This change would have found an issue introduced by commit a12083d721d7 ("mm/gup: handle hugepd for follow_page()"), whereby R/O pinning was not able to make progress in all cases, because unsharing logic was not provided with the VMA to decide at some point that long-term R/O pinning a !anon page is fine. Link: https://lkml.kernel.org/r/20240430131508.86924-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/hugetlb: align cma on allocation order, not demotion orderFrank van der Linden
Align the CMA area for hugetlb gigantic pages to their size, not the size that they can be demoted to. Otherwise there might be misaligned sections at the start and end of the CMA area that will never be used for hugetlb page allocations. Link: https://lkml.kernel.org/r/20240430161437.2100295-1-fvdl@google.com Fixes: a01f43901cfb ("hugetlb: be sure to free demoted CMA pages to CMA") Signed-off-by: Frank van der Linden <fvdl@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07dax/bus.c: use the right locking mode (read vs write) in size_showVishal Verma
In size_show(), the dax_dev_rwsem only needs a read lock, but was acquiring a write lock. Change it to down_read_interruptible() so it doesn't unnecessarily hold a write lock. Link: https://lkml.kernel.org/r/20240430-vv-dax_abi_fixes-v3-4-e3dcd755774c@intel.com Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local rwsem") Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07dax/bus.c: don't use down_write_killable for non-user processesVishal Verma
Change an instance of down_write_killable() to a simple down_write() where there is no user process that might want to interrupt the operation. Link: https://lkml.kernel.org/r/20240430-vv-dax_abi_fixes-v3-3-e3dcd755774c@intel.com Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local rwsem") Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Reported-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07dax/bus.c: fix locking for unregister_dax_dev / unregister_dax_mapping pathsVishal Verma
Commit c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local rwsem") aimed to undo device_lock() abuses for protecting changes to dax-driver internal data-structures like the dax_region resource tree to device-dax-instance range structures. However, the device_lock() was legitimately enforcing that devices to be deleted were not current actively attached to any driver nor assigned any capacity from the region. As a result of the device_lock restoration in delete_store(), the conditional locking in unregister_dev_dax() and unregister_dax_mapping() can be removed. Link: https://lkml.kernel.org/r/20240430-vv-dax_abi_fixes-v3-2-e3dcd755774c@intel.com Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local rwsem") Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Reported-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07dax/bus.c: replace WARN_ON_ONCE() with lockdep assertsVishal Verma
Patch series "dax/bus.c: Fixups for dax-bus locking", v3. Commit Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local rwsem") introduced a few problems that this series aims to fix. Add back device_lock() where it was correctly used (during device manipulation operations), remove conditional locking in unregister_dax_dev() and unregister_dax_mapping(), use non-interruptible versions of rwsem locks when not called from a user process, and fix up a write vs. read usage of an rwsem. This patch (of 4): In [1], Dan points out that all of the WARN_ON_ONCE() usage in the referenced patch should be replaced with lockdep_assert_held, or lockdep_held_assert_write(). Replace these as appropriate. Link: https://lkml.kernel.org/r/20240430-vv-dax_abi_fixes-v3-0-e3dcd755774c@intel.com Link: https://lore.kernel.org/r/65f0b5ef41817_aa222941a@dwillia2-mobl3.amr.corp.intel.com.notmuch [1] Link: https://lkml.kernel.org/r/20240430-vv-dax_abi_fixes-v3-1-e3dcd755774c@intel.com Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local rwsem") Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Reported-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm: memcg: use READ_ONCE()/WRITE_ONCE() to access stock->nr_pagesBreno Leitao
A memcg pointer in the per-cpu stock can be accessed by drain_all_stock() and consume_stock() in parallel, causing a potential race, which is believed to e harmless. KCSAN shows this data-race clearly in the splat below: BUG: KCSAN: data-race in drain_all_stock.part.0 / try_charge_memcg write to 0xffff88903f8b0788 of 4 bytes by task 35901 on cpu 2: try_charge_memcg (mm/memcontrol.c:2323 mm/memcontrol.c:2746) __mem_cgroup_charge (mm/memcontrol.c:7287 mm/memcontrol.c:7301) do_anonymous_page (mm/memory.c:1054 mm/memory.c:4375 mm/memory.c:4433) __handle_mm_fault (mm/memory.c:3878 mm/memory.c:5300 mm/memory.c:5441) handle_mm_fault (mm/memory.c:5606) do_user_addr_fault (arch/x86/mm/fault.c:1363) exc_page_fault (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1563) asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623) read to 0xffff88903f8b0788 of 4 bytes by task 287 on cpu 27: drain_all_stock.part.0 (mm/memcontrol.c:2433) mem_cgroup_css_offline (mm/memcontrol.c:5398 mm/memcontrol.c:5687) css_killed_work_fn (kernel/cgroup/cgroup.c:5521 kernel/cgroup/cgroup.c:5794) process_one_work (kernel/workqueue.c:3254) worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416) kthread (kernel/kthread.c:388) ret_from_fork (arch/x86/kernel/process.c:147) ret_from_fork_asm (arch/x86/entry/entry_64.S:257) value changed: 0x00000014 -> 0x00000013 This happens because drain_all_stock() is reading stock->nr_pages, while consume_stock() might be updating the same address, causing a potential data-race. Make the shared addresses bulletproof regarding to reads and writes, similarly to what stock->cached_objcg and stock->cached. Annotate all accesses to stock->nr_pages with READ_ONCE()/WRITE_ONCE(). Link: https://lkml.kernel.org/r/20240501095420.679208-1-leitao@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm: fix race between __split_huge_pmd_locked() and GUP-fastRyan Roberts
__split_huge_pmd_locked() can be called for a present THP, devmap or (non-present) migration entry. It calls pmdp_invalidate() unconditionally on the pmdp and only determines if it is present or not based on the returned old pmd. This is a problem for the migration entry case because pmd_mkinvalid(), called by pmdp_invalidate() must only be called for a present pmd. On arm64 at least, pmd_mkinvalid() will mark the pmd such that any future call to pmd_present() will return true. And therefore any lockless pgtable walker could see the migration entry pmd in this state and start interpretting the fields as if it were present, leading to BadThings (TM). GUP-fast appears to be one such lockless pgtable walker. x86 does not suffer the above problem, but instead pmd_mkinvalid() will corrupt the offset field of the swap entry within the swap pte. See link below for discussion of that problem. Fix all of this by only calling pmdp_invalidate() for a present pmd. And for good measure let's add a warning to all implementations of pmdp_invalidate[_ad](). I've manually reviewed all other pmdp_invalidate[_ad]() call sites and believe all others to be conformant. This is a theoretical bug found during code review. I don't have any test case to trigger it in practice. Link: https://lkml.kernel.org/r/20240501143310.1381675-1-ryan.roberts@arm.com Link: https://lore.kernel.org/all/0dd7827a-6334-439a-8fd0-43c98e6af22b@arm.com/ Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/debug_vm_pgtable: test pmd_leaf() behavior with pmd_mkinvalid()Ryan Roberts
An invalidated pmd should still cause pmd_leaf() to return true. Let's test for that to ensure all arches remain consistent. Link: https://lkml.kernel.org/r/20240501144439.1389048-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07memcg: use proper type for mod_memcg_stateShakeel Butt
The memcg stats update functions can take arbitrary integer but the only input which make sense is enum memcg_stat_item and we don't want these functions to be called with arbitrary integer, so replace the parameter type with enum memcg_stat_item and compiler will be able to warn if memcg stat update functions are called with incorrect index value. Link: https://lkml.kernel.org/r/20240501172617.678560-9-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: T.J. Mercier <tjmercier@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07memcg: warn for unexpected events and statsShakeel Butt
To reduce memory usage by the memcg events and stats, the kernel uses indirection table and only allocate stats and events which are being used by the memcg code. To make this more robust, let's add warnings where unexpected stats and events indexes are used. Link: https://lkml.kernel.org/r/20240501172617.678560-8-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: T.J. Mercier <tjmercier@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm: cleanup WORKINGSET_NODES in workingsetShakeel Butt
WORKINGSET_NODES is not exposed in the memcg stats and thus there is no need to use the memcg specific stat update functions for it. In future if we decide to expose WORKINGSET_NODES in the memcg stats, we can revert this patch. Link: https://lkml.kernel.org/r/20240501172617.678560-7-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07memcg: cleanup __mod_memcg_lruvec_stateShakeel Butt
There are no memcg specific stats for NR_SHMEM_PMDMAPPED and NR_FILE_PMDMAPPED. Let's remove them. Link: https://lkml.kernel.org/r/20240501172617.678560-6-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07memcg: reduce memory for the lruvec and memcg statsShakeel Butt
At the moment, the amount of memory allocated for stats related structs in the mem_cgroup corresponds to the size of enum node_stat_item. However not all fields in enum node_stat_item have corresponding memcg stats. So, let's use indirection mechanism similar to the one used for memcg vmstats management. For a given x86_64 config, the size of stats with and without patch is: structs size in bytes w/o with struct lruvec_stats 1128 648 struct lruvec_stats_percpu 752 432 struct memcg_vmstats 1832 1352 struct memcg_vmstats_percpu 1280 960 The memory savings are further compounded by the fact that these structs are allocated for each cpu and for each node. To be precise, for each memcg the memory saved would be: Memory saved = ((21 * 3 * NR_NODES) + (21 * 2 * NR_NODES * NR_CPUS) + (21 * 3) + (21 * 2 * NR_CPUS)) * sizeof(long) Where 21 is the number of fields eliminated. Link: https://lkml.kernel.org/r/20240501172617.678560-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: T.J. Mercier <tjmercier@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm: memcg: account memory used for memcg vmstats and lruvec statsRoman Gushchin
The percpu memory used by memcg's memory statistics is already accounted. For consistency, let's enable accounting for vmstats and lruvec stats as well. Link: https://lkml.kernel.org/r/20240501172617.678560-4-shakeel.butt@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07memcg: dynamically allocate lruvec_statsShakeel Butt
To decouple the dependency of lruvec_stats on NR_VM_NODE_STAT_ITEMS, we need to dynamically allocate lruvec_stats in the mem_cgroup_per_node structure. Also move the definition of lruvec_stats_percpu and lruvec_stats and related functions to the memcontrol.c to facilitate later patches. No functional changes in the patch. Link: https://lkml.kernel.org/r/20240501172617.678560-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: T.J. Mercier <tjmercier@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07memcg: reduce memory size of mem_cgroup_events_indexShakeel Butt
Patch series "memcg: reduce memory consumption by memcg stats", v4. Most of the memory overhead of a memcg object is due to memcg stats maintained by the kernel. Since stats updates happen in performance critical codepaths, the stats are maintained per-cpu and numa specific stats are maintained per-node * per-cpu. This drastically increase the overhead on large machines i.e. large of CPUs and multiple numa nodes. This patch series tries to reduce the overhead by at least not allocating the memory for stats which are not memcg specific. This patch (of 8): mem_cgroup_events_index is a translation table to get the right index of the memcg relevant entry for the general vm_event_item. At the moment, it is defined as integer array. However on a typical system the max entry of vm_event_item (NR_VM_EVENT_ITEMS) is 113, so we don't need to use int as storage type of the array. For now just use int8_t as type and add a BUILD_BUG_ON(). Another benefit of this change is that the translation table fits in 2 cachelines while previously it would require 8 cachelines (assuming 64 bytes cacheline). Link: https://lkml.kernel.org/r/20240501172617.678560-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20240501172617.678560-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07selftests/memfd: fix spelling mistakesSaurav Shah
Fix spelling mistakes in the comments. Link: https://lkml.kernel.org/r/20240501231317.24648-1-sauravshah.31@gmail.com Signed-off-by: Saurav Shah <sauravshah.31@gmail.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jeff Xu <jeffxu@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/hugetlb: document why hugetlb uses folio_mapcount() for COW reuse decisionsDavid Hildenbrand
Let's document why hugetlb still uses folio_mapcount() and is prone to leaking memory between processes, for example using vmsplice() that still uses FOLL_GET. More details can be found in [1], especially around how hugetlb pages cannot really be overcommitted, and why we don't particularly care about these vmsplice() leaks for hugetlb -- in contrast to ordinary memory. [1] https://lore.kernel.org/all/8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com/ Link: https://lkml.kernel.org/r/20240502085259.103784-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Suggested-by: Peter Xu <peterx@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07selftests: mm: cow: flag vmsplice() hugetlb tests as XFAILDavid Hildenbrand
Patch series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". The failing hugetlb vmsplice() COW tests keep confusing people, and having tests that have been failing for years and likely will keep failing for years to come because nobody cares enough is rather suboptimal. Let's mark them as XFAIL and document why fixing them is not that easy as it would appear at first sight. More details can be found in [1], especially around how hugetlb pages cannot really be overcommitted, and why we don't particularly care about these vmsplice() leaks for hugetlb -- in contrast to ordinary memory. [1] https://lore.kernel.org/all/8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com/ This patch (of 2): The vmsplice() hugetlb tests have been failing right from the start, and we documented that in the introducing commit 7dad331be781 ("selftests/vm: anon_cow: hugetlb tests"): Note that some tests cases still fail. This will, for example, be fixed once vmsplice properly uses FOLL_PIN instead of FOLL_GET for pinning. With 2 MiB and 1 GiB hugetlb on x86_64, the expected failures are: Until vmsplice() is changed, these tests will likely keep failing: hugetlb COW reuse logic is harder to change, because using the same COW reuse logic as we use for !hugetlb could harm other (sane) users when running out of free hugetlb pages. More details can be found in [1], especially around how hugetlb pages cannot really be overcommitted, and why we don't particularly care about these vmsplice() leaks for hugetlb -- in contrast to ordinary memory. These (expected) failures keep confusing people, so flag them accordingly. Before: $ ./cow [...] Bail out! 8 out of 778 tests failed # Totals: pass:769 fail:8 xfail:0 xpass:0 skip:1 error:0 $ echo $? 1 After: $ ./cow [...] # Totals: pass:769 fail:0 xfail:8 xpass:0 skip:1 error:0 $ echo $? 0 [1] https://lore.kernel.org/all/8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com/ Link: https://lkml.kernel.org/r/20240502085259.103784-1-david@redhat.com Link: https://lkml.kernel.org/r/20240502085259.103784-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/swapfile: mark racy access on si->highest_bitlinke li
In scan_swap_map_slots(), si->highest_bit can by changed by swap_range_alloc() concurrently. All reads on si->highest_bit except one is either protected by lock or read using READ_ONCE. So mark the one racy read on si->highest_bit as benign using READ_ONCE. This patch is aimed at reducing the number of benign races reported by KCSAN in order to focus future debugging effort on harmful races. Link: https://lkml.kernel.org/r/tencent_912BC3E8B0291DA4A0028AB424076375DA07@qq.com Signed-off-by: linke li <lilinke99@qq.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/rmap: change the type of we_locked from int to boolHao Ge
Change the type of we_locked from int to bool because folio_trylock return bool Link: https://lkml.kernel.org/r/20240428012049.8182-1-gehao@kylinos.cn Signed-off-by: Hao Ge <gehao@kylinos.cn> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/pagemap: make trylock_page return boolHao Ge
Make trylock_page return bool to align the return values of folio_trylock function and it also corresponds to its comment. Link: https://lkml.kernel.org/r/20240428014711.11169-1-gehao@kylinos.cn Signed-off-by: Hao Ge <gehao@kylinos.cn> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/rmap: do not add fully unmapped large folio to deferred split listZi Yan
In __folio_remove_rmap(), a large folio is added to deferred split list if any page in a folio loses its final mapping. But it is possible that the folio is fully unmapped and adding it to deferred split list is unnecessary. For PMD-mapped THPs, that was not really an issue, because removing the last PMD mapping in the absence of PTE mappings would not have added the folio to the deferred split queue. However, for PTE-mapped THPs, which are now more prominent due to mTHP, they are always added to the deferred split queue. One side effect is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be unintentionally increased, making it look like there are many partially mapped folios -- although the whole folio is fully unmapped stepwise. Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs where possible starting from commit b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped folio is unmapped in one go and can avoid being added to deferred split list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be noise when we cannot batch-unmap a complete PTE-mapped folio in one go -- or where this type of batching is not implemented yet, e.g., migration. To avoid the unnecessary addition, folio->_nr_pages_mapped is checked to tell if the whole folio is unmapped. If the folio is already on deferred split list, it will be skipped, too. Note: commit 98046944a159 ("mm: huge_memory: add the missing folio_test_pmd_mappable() for THP split statistics") tried to exclude mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not fix the above issue. A fully unmapped PTE-mapped order-9 THP was still added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE, since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio() the order-9 folio is folio_test_pmd_mappable(). Link: https://lkml.kernel.org/r/20240502132852.862138-1-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lance Yang <ioworker0@gmail.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05Docs/ABI/damon: update for 'youg page' type DAMOS filterSeongJae Park
Update DAMON ABI document for the newly added DAMO filter type, 'young page'. Link: https://lkml.kernel.org/r/20240426195247.100306-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05Docs/admin-guide/mm/damon/usage: update for young page type DAMOS filterSeongJae Park
Update DAMON usage document for the newly added DAMOS filter type, 'young page'. Link: https://lkml.kernel.org/r/20240426195247.100306-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05Docs/mm/damon/design: document 'young page' type DAMOS filterSeongJae Park
Update DAMON design document for the newly added DAMOS filter type, 'young page'. Link: https://lkml.kernel.org/r/20240426195247.100306-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/damon/paddr: implement DAMOS filter type YOUNGSeongJae Park
DAMOS filter of type YOUNG is defined, but not yet implemented by any DAMON operations set. Add the implementation on 'paddr', the DAMON operations set for the physical address space. Link: https://lkml.kernel.org/r/20240426195247.100306-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Tested-by: Honggyu Kim <honggyu.kim@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/damon: add DAMOS filter type YOUNGSeongJae Park
Define yet another DAMOS filter type, YOUNG. Like anon and memcg, the type of filter will be applied to each page in the memory region, and see if the page is accessed since the last check. Based on the 'matching' parameter, the page is filtered out or in. Note that this commit is adding only the type definition. The implementation should be made by DAMON operations sets. A commit for the implementation on 'paddr' DAMON operations set will follow. Link: https://lkml.kernel.org/r/20240426195247.100306-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Tested-by: Honggyu Kim <honggyu.kim@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/damon/paddr: implement damon_folio_mkold()SeongJae Park
damon_pa_mkold() receives physical address, get the folio covering the address, and makes the folio as old. A following commit will reuse the internal logic for marking a given folio as old. To avoid duplication of the code, split the internal logic. Also, change the rmap walker function's name from __damon_pa_mkold() to damon_folio_mkold_one(), following the change of the caller's name and the naming rule that more commonly used by other rmap walkers. Link: https://lkml.kernel.org/r/20240426195247.100306-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Tested-by: Honggyu Kim <honggyu.kim@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/damon/paddr: implement damon_folio_young()SeongJae Park
Patch series "mm/damon: add a DAMOS filter type for page granularity access recheck". DAMON provides its best-effort accuracy-overhead tradeoff under the user-defined ranges of acceptable level of the monitoring accuracy and overhead. A recent discussion for tiered memory management support from DAMON[1] concluded that finding memory regions of specific access pattern with low overhead despite of low accuracy via DAMON first, and then double checking the access of the region again in a finer (e.g., page) granularity could be a useful strategy for some DAMOS schemes. Add a new type of DAMOS filter, namely 'young' for such a case. It checks each page of DAMOS target region is accessed since the last check, and filters it out or in if 'matching' parameter is 'true' or 'false', respectively. Because this is a filter type that applied in page granularity, the support depends on DAMON operations set, similar to 'anon' and 'memcg' DAMOS filter types. Implement the support on the DAMON operations set for the physical address space, 'paddr', since one of the expected usages[1] is based on the physical address space. [1] https://lore.kernel.org/r/20240227235121.153277-1-sj@kernel.org This patch (of 7): damon_pa_young() receives physical address, get the folio covering the address, and show if the folio is accessed since the last check. A following commit will reuse the internal logic for checking access to a given folio. To avoid duplication of the code, split the internal logic. Also, change the rmap walker function's name from __damon_pa_young() to damon_folio_young_one(), following the change of the caller's name and the naming rule that more commonly used by other rmap walkers. Link: https://lkml.kernel.org/r/20240426195247.100306-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240426195247.100306-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Tested-by: Honggyu Kim <honggyu.kim@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: optimise vmf_anon_prepare() for VMAs without an anon_vmaMatthew Wilcox (Oracle)
If the mmap_lock can be taken for read, we can call __anon_vma_prepare() while holding it, saving ourselves a trip back through the fault handler. Link: https://lkml.kernel.org/r/20240426144506.1290619-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: fix some minor per-VMA lock issues in userfaultfdMatthew Wilcox (Oracle)
Rename lock_vma() to uffd_lock_vma() because it really is uffd specific. Remove comment referencing unlock_vma() which doesn't exist. Fix the comment about lock_vma_under_rcu() which I just made incorrect. Link: https://lkml.kernel.org/r/20240426144506.1290619-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: delay the check for a NULL anon_vmaMatthew Wilcox (Oracle)
Instead of checking the anon_vma early in the fault path where all page faults pay the cost, delay it until we know we're going to need the anon_vma to be filled in. This will have a slight negative effect on the first fault in an anonymous VMA, but it shortens every other page fault. It also makes the code slightly cleaner as the anon and file backed fault handling look more similar. The Intel kernel test bot reports a 3x improvement in vm-scalability throughput with the small-allocs-mt test. This is clearly an extreme situation that won't be replicated in any real-world workload, but it's a nice win. https://lore.kernel.org/all/202404261055.c5e24608-oliver.sang@intel.com/ Link: https://lkml.kernel.org/r/20240426144506.1290619-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: assert the mmap_lock is held in __anon_vma_prepare()Matthew Wilcox (Oracle)
Patch series "Improve anon_vma scalability for anon VMAs". We have a 3x throughput improvement reported by Intel's kernel test robot: https://lore.kernel.org/all/202404261055.c5e24608-oliver.sang@intel.com/ This is from delaying taking the mmap_lock for page faults until we actually need the mmap_lock in order to assign an anon_vma to the vma. It cleans up the page fault path a little by making the anon fault handler more similar to the file fault handler. This patch (of 4): Convert the comment into an assertion. Link: https://lkml.kernel.org/r/20240426144506.1290619-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240426144506.1290619-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: simplify thp_vma_allowable_orderMatthew Wilcox
Combine the three boolean arguments into one flags argument for readability. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: remove stale comment __folio_mark_dirtyKemeng Shi
The __folio_mark_dirty will not mark inode dirty any longer. Remove the stale comment of it. Link: https://lkml.kernel.org/r/20240425131724.36778-5-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Howard Cochran <hcochran@kernelspring.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: call __wb_calc_thresh instead of wb_calc_thresh in wb_over_bg_threshKemeng Shi
Call __wb_calc_thresh to calculate wb bg_thresh of gdtc in wb_over_bg_thresh to remove unnecessary wrap in wb_calc_thresh. Link: https://lkml.kernel.org/r/20240425131724.36778-4-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Howard Cochran <hcochran@kernelspring.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: correct calculation of wb's bg_thresh in cgroup domainKemeng Shi
wb_calc_thresh() is calculating wb's share of bg_thresh in the global domain. However in case of cgroup writeback this is not the right thing to do. Consider the following domain hierarchy: global domain (> 20G) / \ cgroup1 (10G) cgroup2 (10G) | | bdi wb1 wb2 and assume wb1 and wb2 have the same bandwidth and the background threshold is set at 10%. The bg_thresh of cgroup1 and cgroup2 is going to be 1G. Now because wb_calc_thresh(mdtc->wb, mdtc->bg_thresh) calculates per-wb threshold in the global domain as (wb bandwidth) / (domain bandwidth) it returns bg_thresh for wb1 as 0.5G although it has nobody to compete against in cgroup1. Fix the problem by calculating wb's share of bg_thresh in the cgroup domain. Test as following: /* make it easier to observe the issue */ echo 300000 > /proc/sys/vm/dirty_expire_centisecs echo 100 > /proc/sys/vm/dirty_writeback_centisecs /* run fio in wb1 */ cd /sys/fs/cgroup echo "+memory +io" > cgroup.subtree_control mkdir group1 cd group1 echo 10G > memory.high echo 10G > memory.max echo $$ > cgroup.procs mkfs.ext4 -F /dev/vdb mount /dev/vdb /bdi1/ fio -name test -filename=/bdi1/file -size=600M -ioengine=libaio -bs=4K \ -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0 /* run fio in wb2 with a new shell */ cd /sys/fs/cgroup mkdir group2 cd group2 echo 10G > memory.high echo 10G > memory.max echo $$ > cgroup.procs mkfs.ext4 -F /dev/vdc mount /dev/vdc /bdi2/ fio -name test -filename=/bdi2/file -size=600M -ioengine=libaio -bs=4K \ -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0 Before fix, the wrttien pages of wb1 and wb2 reported from toos/writeback/wb_monitor.py keep growing. After fix, rare written pages are accumulated. There is no obvious change in fio result. [jack@suse.cz: changelog rewording] Link: https://lkml.kernel.org/r/20240425131724.36778-3-shikemeng@huaweicloud.com Fixes: 74d369443325 ("writeback: Fix performance regression in wb_over_bg_thresh()") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Howard Cochran <hcochran@kernelspring.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: enable __wb_calc_thresh to calculate dirty background thresholdKemeng Shi
Patch series "Fix and cleanups to page-writeback", v2. This series contains some random cleanups and a fix to correct calculation of wb's bg_thresh in cgroup domain. More details can be found respective patches. This patch (of 4): Originally, __wb_calc_thresh always calculate wb's share of dirty throttling threshold. By getting thresh of wb_domain from caller, __wb_calc_thresh could be used for both dirty throttling and dirty background threshold. This is a preparation to correct threshold calculation of wb in cgroup. Link: https://lkml.kernel.org/r/20240425131724.36778-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20240425131724.36778-2-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Howard Cochran <hcochran@kernelspring.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pagesKemeng Shi
Commit 8d92890bd6b85 ("mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead") removed NR_UNSTABLE_NFS and nr_reclaimable only contains dirty page now. Rename nr_reclaimable to nr_dirty properly. Link: https://lkml.kernel.org/r/20240423034643.141219-6-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Brian Foster <bfoster@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05writeback: add wb_monitor.py script to monitor writeback info on bdiKemeng Shi
Add wb_monitor.py script to monitor writeback information on backing dev which makes it easier and more convenient to observe writeback behaviors of running system. The wb_monitor.py script is written based on wq_monitor.py. Following domain hierarchy is tested: global domain (320G) / \ cgroup domain1(10G) cgroup domain2(10G) | | bdi wb1 wb2 The wb_monitor.py script output is as following: ./wb_monitor.py 252:16 -c writeback reclaimable dirtied written avg_bw 252:16_1 0 0 0 0 102400 252:16_4284 672 820064 9230368 8410304 685612 252:16_4325 896 819840 10491264 9671648 652348 252:16 1568 1639904 19721632 18081952 1440360 writeback reclaimable dirtied written avg_bw 252:16_1 0 0 0 0 102400 252:16_4284 672 820064 9230368 8410304 685612 252:16_4325 896 819840 10491264 9671648 652348 252:16 1568 1639904 19721632 18081952 1440360 ... Link: https://lkml.kernel.org/r/20240423034643.141219-5-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Suggested-by: Tejun Heo <tj@kernel.org> Cc: Brian Foster <bfoster@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05writeback: support retrieving per group debug writeback stats of bdiKemeng Shi
Add /sys/kernel/debug/bdi/xxx/wb_stats to show per group writeback stats of bdi. Following domain hierarchy is tested: global domain (320G) / \ cgroup domain1(10G) cgroup domain2(10G) | | bdi wb1 wb2 /* per wb writeback info of bdi is collected */ cat wb_stats WbCgIno: 1 WbWriteback: 0 kB WbReclaimable: 0 kB WbDirtyThresh: 0 kB WbDirtied: 0 kB WbWritten: 0 kB WbWriteBandwidth: 102400 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 state: 1 WbCgIno: 4091 WbWriteback: 1792 kB WbReclaimable: 820512 kB WbDirtyThresh: 6004692 kB WbDirtied: 1820448 kB WbWritten: 999488 kB WbWriteBandwidth: 169020 kBps b_dirty: 0 b_io: 0 b_more_io: 1 b_dirty_time: 0 state: 5 WbCgIno: 4131 WbWriteback: 1120 kB WbReclaimable: 820064 kB WbDirtyThresh: 6004728 kB WbDirtied: 1822688 kB WbWritten: 1002400 kB WbWriteBandwidth: 153520 kBps b_dirty: 0 b_io: 0 b_more_io: 1 b_dirty_time: 0 state: 5 [shikemeng@huaweicloud.com: fix build problems] Link: https://lkml.kernel.org/r/20240423034643.141219-4-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20240423034643.141219-3-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Brian Foster <bfoster@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05writeback: collect stats of all wb of bdi in bdi_debug_stats_showKemeng Shi
Patch series "Improve visibility of writeback", v5. This series tries to improve visilibity of writeback. Patch 1 make /sys/kernel/debug/bdi/xxx/stats show writeback info of whole bdi instead of only writeback info in root cgroup. Patch 2 add a new debug file /sys/kernel/debug/bdi/xxx/wb_stats to show per wb writeback info. Patch 3 add wb_monitor.py to monitor basic writeback info of running system, more info could be added on demand. Patch 4 is a random cleanup. More details can be found in respective patches. Following domain hierarchy is tested: global domain (320G) / \ cgroup domain1(10G) cgroup domain2(10G) | | bdi wb1 wb2 /* all writeback info of bdi is successfully collected */ cat stats BdiWriteback: 4704 kB BdiReclaimable: 1294496 kB BdiDirtyThresh: 204208088 kB DirtyThresh: 195259944 kB BackgroundThresh: 32503588 kB BdiDirtied: 48519296 kB BdiWritten: 47225696 kB BdiWriteBandwidth: 1173892 kBps b_dirty: 1 b_io: 0 b_more_io: 1 b_dirty_time: 0 bdi_list: 1 state: 1 /* per wb writeback info of bdi is collected */ cat /sys/kernel/debug/bdi/252:16/wb_stats WbCgIno: 1 WbWriteback: 0 kB WbReclaimable: 0 kB WbDirtyThresh: 0 kB WbDirtied: 0 kB WbWritten: 0 kB WbWriteBandwidth: 102400 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 state: 1 WbCgIno: 4208 WbWriteback: 59808 kB WbReclaimable: 676480 kB WbDirtyThresh: 6004624 kB WbDirtied: 23348192 kB WbWritten: 22614592 kB WbWriteBandwidth: 593204 kBps b_dirty: 1 b_io: 1 b_more_io: 0 b_dirty_time: 0 state: 7 WbCgIno: 4249 WbWriteback: 144256 kB WbReclaimable: 432096 kB WbDirtyThresh: 6004344 kB WbDirtied: 25727744 kB WbWritten: 25154752 kB WbWriteBandwidth: 577904 kBps b_dirty: 0 b_io: 1 b_more_io: 0 b_dirty_time: 0 state: 7 The wb_monitor.py script output is as following: ./wb_monitor.py 252:16 -c writeback reclaimable dirtied written avg_bw 252:16_1 0 0 0 0 102400 252:16_4284 672 820064 9230368 8410304 685612 252:16_4325 896 819840 10491264 9671648 652348 252:16 1568 1639904 19721632 18081952 1440360 writeback reclaimable dirtied written avg_bw 252:16_1 0 0 0 0 102400 252:16_4284 672 820064 9230368 8410304 685612 252:16_4325 896 819840 10491264 9671648 652348 252:16 1568 1639904 19721632 18081952 1440360 ... This patch (of 5): /sys/kernel/debug/bdi/xxx/stats is supposed to show writeback information of whole bdi, but only writeback information of bdi in root cgroup is collected. So writeback information in non-root cgroup are missing now. To be more specific, considering following case: /* create writeback cgroup */ cd /sys/fs/cgroup echo "+memory +io" > cgroup.subtree_control mkdir group1 cd group1 echo $$ > cgroup.procs /* do writeback in cgroup */ fio -name test -filename=/dev/vdb ... /* get writeback info of bdi */ cat /sys/kernel/debug/bdi/xxx/stats The cat result unexpectedly implies that there is no writeback on target bdi. Fix this by collecting stats of all wb in bdi instead of only wb in root cgroup. Following domain hierarchy is tested: global domain (320G) / \ cgroup domain1(10G) cgroup domain2(10G) | | bdi wb1 wb2 /* all writeback info of bdi is successfully collected */ cat stats BdiWriteback: 2912 kB BdiReclaimable: 1598464 kB BdiDirtyThresh: 167479028 kB DirtyThresh: 195038532 kB BackgroundThresh: 32466728 kB BdiDirtied: 19141696 kB BdiWritten: 17543456 kB BdiWriteBandwidth: 1136172 kBps b_dirty: 2 b_io: 0 b_more_io: 1 b_dirty_time: 0 bdi_list: 1 state: 1 Link: https://lkml.kernel.org/r/20240423034643.141219-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20240423034643.141219-2-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Brian Foster <bfoster@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05selftests/mm: soft-dirty should fail if a testcase failsRyan Roberts
Previously soft-dirty was unconditionally exiting with success, even if one of its testcases failed. Let's fix that so that failure can be reported to automated systems properly. Link: https://lkml.kernel.org/r/20240424105301.3157695-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: vmalloc: dump page owner info if page is already mappedHariom Panthi
In vmap_pte_range, BUG_ON is called when page is already mapped, It doesn't give enough information to debug further. Dumping page owner information alongwith BUG_ON will be more useful in case of multiple page mapping. Example: [ 14.552875] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10b923 [ 14.553440] flags: 0xbffff0000000000(node=0|zone=2|lastcpupid=0x3ffff) [ 14.554001] page_type: 0xffffffff() [ 14.554783] raw: 0bffff0000000000 0000000000000000 dead000000000122 0000000000000000 [ 14.555230] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 [ 14.555768] page dumped because: remapping already mapped page [ 14.556172] page_owner tracks the page as allocated [ 14.556482] page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL), pid 80, tgid 80 (insmod), ts 14552004992, free_ts 0 [ 14.557286] prep_new_page+0xa8/0x10c [ 14.558052] get_page_from_freelist+0x7f8/0x1248 [ 14.558298] __alloc_pages+0x164/0x2b4 [ 14.558514] alloc_pages_mpol+0x88/0x230 [ 14.558904] alloc_pages+0x4c/0x7c [ 14.559157] load_module+0x74/0x1af4 [ 14.559361] __do_sys_init_module+0x190/0x1fc [ 14.559615] __arm64_sys_init_module+0x1c/0x28 [ 14.559883] invoke_syscall+0x44/0x108 [ 14.560109] el0_svc_common.constprop.0+0x40/0xe0 [ 14.560371] do_el0_svc_compat+0x1c/0x34 [ 14.560600] el0_svc_compat+0x2c/0x80 [ 14.560820] el0t_32_sync_handler+0x90/0x140 [ 14.561040] el0t_32_sync+0x194/0x198 [ 14.561329] page_owner free stack trace missing [ 14.562049] ------------[ cut here ]------------ [ 14.562314] kernel BUG at mm/vmalloc.c:113! Link: https://lkml.kernel.org/r/20240424111838.3782931-2-hariom1.p@samsung.com Signed-off-by: Hariom Panthi <hariom1.p@samsung.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Maninder Singh <maninder1.s@samsung.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rohit Thapliyal <r.thapliyal@samsung.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared()David Hildenbrand
We want to limit the use of page_mapcount() to places where absolutely required, to prepare for kernel configs where we won't keep track of per-page mapcounts in large folios. khugepaged is one of the remaining "more challenging" page_mapcount() users, but we might be able to move away from page_mapcount() without resulting in a significant behavior change that would warrant special-casing based on kernel configs. In 2020, we first added support to khugepaged for collapsing COW-shared pages via commit 9445689f3b61 ("khugepaged: allow to collapse a page shared across fork"), followed by support for collapsing PTE-mapped THP in commit 5503fbf2b0b8 ("khugepaged: allow to collapse PTE-mapped compound pages") and limiting the memory waste via the "page_count() > 1" check in commit 71a2c112a0f6 ("khugepaged: introduce 'max_ptes_shared' tunable"). As a default, khugepaged will allow up to half of the PTEs to map shared pages: where page_mapcount() > 1. MADV_COLLAPSE ignores the khugepaged setting. khugepaged does currently not care about swapcache page references, and does not check under folio lock: so in some corner cases the "shared vs. exclusive" detection might be a bit off, making us detect "exclusive" when it's actually "shared". Most of our anonymous folios in the system are usually exclusive. We frequently see sharing of anonymous folios for a short period of time, after which our short-lived suprocesses either quit or exec(). There are some famous examples, though, where child processes exist for a long time, and where memory is COW-shared with a lot of processes (webservers, webbrowsers, sshd, ...) and COW-sharing is crucial for reducing the memory footprint. We don't want to suddenly change the behavior to result in a significant increase in memory waste. Interestingly, khugepaged will only collapse an anonymous THP if at least one PTE is writable. After fork(), that means that something (usually a page fault) populated at least a single exclusive anonymous THP in that PMD range. So ... what happens when we switch to "is this folio mapped shared" instead of "is this page mapped shared" by using folio_likely_mapped_shared()? For "not-COW-shared" folios, small folios and for THPs (large folios) that are completely mapped into at least one process, switching to folio_likely_mapped_shared() will not result in a change. We'll only see a change for COW-shared PTE-mapped THPs that are partially mapped into all involved processes. There are two cases to consider: (A) folio_likely_mapped_shared() returns "false" for a PTE-mapped THP If the folio is detected as exclusive, and it actually is exclusive, there is no change: page_mapcount() == 1. This is the common case without fork() or with short-lived child processes. folio_likely_mapped_shared() might currently still detect a folio as exclusive although it is shared (false negatives): if the first page is not mapped multiple times and if the average per-page mapcount is smaller than 1, implying that (1) the folio is partially mapped and (2) if we are responsible for many mapcounts by mapping many pages others can't ("mostly exclusive") (3) if we are not responsible for many mapcounts by mapping little pages ("mostly shared") it won't make a big impact on the end result. So while we might now detect a page as "exclusive" although it isn't, it's not expected to make a big difference in common cases. (B) folio_likely_mapped_shared() returns "true" for a PTE-mapped THP folio_likely_mapped_shared() will never detect a large anonymous folio as shared although it is exclusive: there are no false positives. If we detect a THP as shared, at least one page of the THP is mapped by another process. It could well be that some pages are actually exclusive. For example, our child processes could have unmapped/COW'ed some pages such that they would now be exclusive to out process, which we now would treat as still-shared. Examples: (1) Parent maps all pages of a THP, child maps some pages. We detect all pages in the parent as shared although some are actually exclusive. (2) Parent maps all but some page of a THP, child maps the remainder. We detect all pages of the THP that the parent maps as shared although they are all exclusive. In (1) we wouldn't collapse a THP right now already: no PTE is writable, because a write fault would have resulted in COW of a single page and the parent would no longer map all pages of that THP. For (2) we would have collapsed a THP in the parent so far, now we wouldn't as long as the child process is still alive: unless the child process unmaps the remaining THP pages or we decide to split that THP. Possibly, the child COW'ed many pages, meaning that it's likely that we can populate a THP for our child first, and then for our parent. For (2), we are making really bad use of the THP in the first place (not even mapped completely in at least one process). If the THP would be completely partially mapped, it would be on the deferred split queue where we would split it lazily later. For short-running child processes, we don't particularly care. For long-running processes, the expectation is that such scenarios are rather rare: further, a THP might be best placed if most data in the PMD range is actually written, implying that we'll have to COW more pages first before khugepaged would collapse it. To summarize, in the common case, this change is not expected to matter much. The more common application of khugepaged operates on exclusive pages, either before fork() or after a child quit. Can we improve (A)? Yes, if we implement more precise tracking of "mapped shared" vs. "mapped exclusively", we could get rid of the false negatives completely. Can we improve (B)? We could count how many pages of a large folio we map inside the current page table and detect that we are responsible for most of the folio mapcount and conclude "as good as exclusive", which might help in some cases. ... but likely, some other mechanism should detect that the THP is not a good use in the scenario (not even mapped completely in a single process) and try splitting that folio lazily etc. We'll move the folio_test_anon() check before our "shared" check, so we might get more expressive results for SCAN_EXCEED_SHARED_PTE: this order of checks now matches the one in __collapse_huge_page_isolate(). Extend documentation. Link: https://lkml.kernel.org/r/20240424122630.495788-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>