summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-05-08codetag: rcu-ify idx to codetag lookupmemalloc-prof-v7Kent Overstreet
This substantially reworks how we store and iterate over codetags in multiple modules. We now provide a stable index for each codetag, which doesn't change as modules are loaded and unloaded - meaning on module load, we find a free range of indices for the new module. Index-to-codetag lookup uses an eytzinger tree, and we use RCU for the eytzinger tree for lockless lookup. The eytzinger tree has one entry per module, where entries store the starting index for each module's range. After eytzinger lookup we use eytzinger_to_inorder(), as the codetag_modules are still a flat array. With the new stable indices, iteration becomes simpler; the iterator just stores the current index and peek() finds the first codetag >= the current index. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08darray: lift from bcachefsKent Overstreet
dynamic arrays - inspired from CCAN darrays, so we don't have to open code them. darray_init() darray_exit() darray_push() darray_make_room() darray_for_each() etc... Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-07eytzinger: Promote to include/linux/Kent Overstreet
eytzinger trees are a faster alternative to binary search. They're a bit more expensive to setup, but lookups perform much better assuming the tree isn't entirely in cache. Binary search is a worst case scenario for branch prediction and prefetching, but eytzinger trees have children adjacent in memory and thus we can prefetch before knowing the result of a comparison. An eytzinger tree is a binary tree laid out in an array, with the same geometry as the usual binary heap construction, but used as a search tree instead. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-07__mod_memcg_lruvec_state-enhance-diagnostics-fixAndrew Morton
VM_WARN_ON_IRQS_ENABLED is broken Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07__mod_memcg_lruvec_state(): enhance diagnosticsAndrew Morton
sysbot reports a warning (https://lkml.kernel.org/r/0000000000007545d00615188a03@google.com). Add additional info to perhaps help with this. Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm: add swappiness= arg to memory.reclaimDan Schatzberg
Allow proactive reclaimers to submit an additional swappiness=<val> argument to memory.reclaim. This overrides the global or per-memcg swappiness setting for that reclaim attempt. For example: echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim will perform reclaim on the rootcg with a swappiness setting of 0 (no swap) regardless of the vm.swappiness sysctl setting. Userspace proactive reclaimers use the memory.reclaim interface to trigger reclaim. The memory.reclaim interface does not allow for any way to effect the balance of file vs anon during proactive reclaim. The only approach is to adjust the vm.swappiness setting. However, there are a few reasons we look to control the balance of file vs anon during proactive reclaim, separately from reactive reclaim: * Swapout should be limited to manage SSD write endurance. In near-OOM situations we are fine with lots of swap-out to avoid OOMs. As these are typically rare events, they have relatively little impact on write endurance. However, proactive reclaim runs continuously and so its impact on SSD write endurance is more significant. Therefore it is desireable to control swap-out for proactive reclaim separately from reactive reclaim * Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap exhaustion. This makes sense if the swap exhaustion is triggered due to reactive reclaim but less so if it is triggered due to proactive reclaim (e.g. one could see OOMs when free memory is ample but anon is just particularly cold). Therefore, it's desireable to have proactive reclaim reduce or stop swap-out before the threshold at which OOM killing occurs. In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before writes to memory.reclaim[2]. This has been in production for nearly two years and has addressed our needs to control proactive vs reactive reclaim behavior but is still not ideal for a number of reasons: * vm.swappiness is a global setting, adjusting it can race/interfere with other system administration that wishes to control vm.swappiness. In our case, we need to disable Senpai before adjusting vm.swappiness. * vm.swappiness is stateful - so a crash or restart of Senpai can leave a misconfigured setting. This requires some additional management to record the "desired" setting and ensure Senpai always adjusts to it. With this patch, we avoid these downsides of adjusting vm.swappiness globally. [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598 Link: https://lkml.kernel.org/r/20240103164841.2800183-3-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Suggested-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Yue Zhao <findns94@gmail.com> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm: add defines for min/max swappinessDan Schatzberg
Patch series "Add swappiness argument to memory.reclaim", v6. This patch proposes augmenting the memory.reclaim interface with a swappiness=<val> argument that overrides the swappiness value for that instance of proactive reclaim. Userspace proactive reclaimers use the memory.reclaim interface to trigger reclaim. The memory.reclaim interface does not allow for any way to effect the balance of file vs anon during proactive reclaim. The only approach is to adjust the vm.swappiness setting. However, there are a few reasons we look to control the balance of file vs anon during proactive reclaim, separately from reactive reclaim: * Swapout should be limited to manage SSD write endurance. In near-OOM situations we are fine with lots of swap-out to avoid OOMs. As these are typically rare events, they have relatively little impact on write endurance. However, proactive reclaim runs continuously and so its impact on SSD write endurance is more significant. Therefore it is desireable to control swap-out for proactive reclaim separately from reactive reclaim * Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap exhaustion. This makes sense if the swap exhaustion is triggered due to reactive reclaim but less so if it is triggered due to proactive reclaim (e.g. one could see OOMs when free memory is ample but anon is just particularly cold). Therefore, it's desireable to have proactive reclaim reduce or stop swap-out before the threshold at which OOM killing occurs. In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before writes to memory.reclaim[2]. This has been in production for nearly two years and has addressed our needs to control proactive vs reactive reclaim behavior but is still not ideal for a number of reasons: * vm.swappiness is a global setting, adjusting it can race/interfere with other system administration that wishes to control vm.swappiness. In our case, we need to disable Senpai before adjusting vm.swappiness. * vm.swappiness is stateful - so a crash or restart of Senpai can leave a misconfigured setting. This requires some additional management to record the "desired" setting and ensure Senpai always adjusts to it. With this patch, we avoid these downsides of adjusting vm.swappiness globally. Previously, this exact interface addition was proposed by Yosry[3]. In response, Roman proposed instead an interface to specify precise file/anon/slab reclaim amounts[4]. More recently Huan also proposed this as well[5] and others similarly questioned if this was the proper interface. Previous proposals sought to use this to allow proactive reclaimers to effectively perform a custom reclaim algorithm by issuing proactive reclaim with different settings to control file vs anon reclaim (e.g. to only reclaim anon from some applications). Responses argued that adjusting swappiness is a poor interface for custom reclaim. In contrast, I argue in favor of a swappiness setting not as a way to implement custom reclaim algorithms but rather to bias the balance of anon vs file due to differences of proactive vs reactive reclaim. In this context, swappiness is the existing interface for controlling this balance and this patch simply allows for it to be configured differently for proactive vs reactive reclaim. Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate for this prupose. Proactive reclaimers are un-aware of the relative age of file vs anon for a cgroup which makes it difficult to manage proactive reclaim of different memory pools. A proactive reclaimer would need some amount of anon reclaim attempts separate from the amount of file reclaim attempts which seems brittle given that it's difficult to observe the impact. [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598 [3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/ [4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/ [5]https://lore.kernel.org/lkml/20231108065818.19932-1-link@vivo.com/ This patch (of 2): We use the constants 0 and 200 in a few places in the mm code when referring to the min and max swappiness. This patch adds MIN_SWAPPINESS and MAX_SWAPPINESS #defines to improve clarity. There are no functional changes. Link: https://lkml.kernel.org/r/20240103164841.2800183-1-schatzberg.dan@gmail.com Link: https://lkml.kernel.org/r/20240103164841.2800183-2-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yue Zhao <findns94@gmail.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm: optimization on page allocation when CMA enabledZhaoyang Huang
According to current CMA utilization policy, an alloc_pages(GFP_USER) could 'steal' UNMOVABLE & RECLAIMABLE page blocks via the help of CMA(pass zone_watermark_ok by counting CMA in but use U&R in rmqueue), which could lead to following alloc_pages(GFP_KERNEL) fail. Solving this by introducing second watermark checking for GFP_MOVABLE, which could have the allocation use CMA when proper. -- Free_pages(30MB) | | -- WMARK_LOW(25MB) | -- Free_CMA(12MB) | | -- Link: https://lkml.kernel.org/r/20231016071245.2865233-1-zhaoyang.huang@unisoc.com Link: https://lkml.kernel.org/r/1683782550-25799-1-git-send-email-zhaoyang.huang@unisoc.com Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: ke.wang <ke.wang@unisoc.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Zhaoyang Huang <huangzhaoyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07filemap: replace pte_offset_map() with pte_offset_map_nolock()ZhangPeng
The vmf->ptl in filemap_fault_recheck_pte_none() is still set from handle_pte_fault(). But at the same time, we did a pte_unmap(vmf->pte). After a pte_unmap(vmf->pte) unmap and rcu_read_unlock(), the page table may be racily changed and vmf->ptl maybe fails to protect the actual page table. Fix this by replacing pte_offset_map() with pte_offset_map_nolock(). Link: https://lkml.kernel.org/r/20240313012913.2395414-1-zhangpeng362@huawei.com Fixes: 58f327f2ce80 ("filemap: avoid unnecessary major faults in filemap_fault()") Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07selftests: cgroup: add tests to verify the zswap writeback pathUsama Arif
Initate writeback with the below steps and check using memory.stat.zswpwb if zswap writeback occurred: 1. Allocate memory. 2. Reclaim memory equal to the amount that was allocated in step 1. This will move it into zswap. 3. Save current zswap usage. 4. Move the memory allocated in step 1 back in from zswap. 5. Set zswap.max to half the amount that was recorded in step 3. 6. Attempt to reclaim memory equal to the amount that was allocated, this will either trigger writeback if its enabled, or reclamation will fail if writeback is disabled as there isn't enough zswap space. Link: https://lkml.kernel.org/r/20240507101443.720190-1-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Suggested-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm: memcg: make alloc_mem_cgroup_per_node_info() return boolXiu Jianfeng
alloc_mem_cgroup_per_node_info() returns int that doesn't map to any errno error code. The only existing caller doesn't really need an error code so change the function to return bool (true on success) because this is slightly less confusing and more consistent with the other code. Link: https://lkml.kernel.org/r/20240507132324.1158510-1-xiujianfeng@huawei.com Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/damon/core: fix return value from damos_wmark_metric_valueAlex Rusuf
damos_wmark_metric_value's return value is 'unsigned long', so returning -EINVAL as 'unsigned long' may turn out to be very different from the expected one (using 2's complement) and treat as usual matric's value. So, fix that, checking if returned value is not 0. Link: https://lkml.kernel.org/r/20240506180238.53842-1-sj@kernel.org Fixes: ee801b7dd782 ("mm/damon/schemes: activate schemes based on a watermarks mechanism") Signed-off-by: Alex Rusuf <yorha.op@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPEDYosry Ahmed
Previously, all NR_VM_EVENT_ITEMS stats were maintained per-memcg, although some of those fields are not exposed anywhere. Commit 14e0f6c957e39 ("memcg: reduce memory for the lruvec and memcg stats") changed this such that we only maintain the stats we actually expose per-memcg via a translation table. Additionally, commit 514462bbe927b ("memcg: warn for unexpected events and stats") added a warning if a per-memcg stat update is attempted for a stat that is not in the translation table. The warning started firing for the NR_{FILE/SHMEM}_PMDMAPPED stat updates in the rmap code. These stats are not maintained per-memcg, and hence are not in the translation table. Do not use __lruvec_stat_mod_folio() when updating NR_FILE_PMDMAPPED and NR_SHMEM_PMDMAPPED. Use __mod_node_page_state() instead, which updates the global per-node stats only. Link: https://lkml.kernel.org/r/20240506192924.271999-1-yosryahmed@google.com Fixes: 514462bbe927 ("memcg: warn for unexpected events and stats") Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: syzbot+9319a4268a640e26b72b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/0000000000001b9d500617c8b23c@google.com Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07selftests: cgroup: remove redundant enabling of memory controllerUsama Arif
Memory controller is already enabled in main which invokes the test, hence this does not need to be done in test_no_kmem_bypass. Link: https://lkml.kernel.org/r/20240502200529.4193651-2-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next treeSeongJae Park
The document mentions any patches for review should based on mm-unstable instead of damon/next. It should be the recommended process, but sometimes patches based on damon/next could be posted for some reasons. Actually, the DAMON-based tiered memory management patchset[1] was written on top of 'young page' DAMOS filter patchset, which was in damon/next tree as of the writing. Allow such case and just ask such things to be clearly specified. [1] https://lore.kernel.org/20240405060858.2818-1-honggyu.kim@sk.com Link: https://lkml.kernel.org/r/20240503180318.72798-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST ↵SeongJae Park
to PT The document says the maintainer is working on only PST. The maintainer respects daylight saving system, though. Update the time zone to PT. Link: https://lkml.kernel.org/r/20240503180318.72798-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07Docs/mm/damon/design: fix build warningSeongJae Park
Commit b7138c7d40b0 ("Docs/mm/damon/design: use a list for supported filters") of mm-unstable tree is causing below warning and error with 'make htmldocs'. Documentation/mm/damon/design.rst:482: ERROR: Unexpected indentation. Documentation/mm/damon/design.rst:483: WARNING: Block quote ends without a blank line; unexpected unindent. The problem caused by wrong indentation for nested list items. Fix the wrong indentation. Link: https://lkml.kernel.org/r/20240507161747.52430-1-sj@kernel.org Fixes: b7138c7d40b0 ("Docs/mm/damon/design: use a list for supported filters") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/20240507162623.4d94d455@canb.auug.org.au Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07Docs/mm/damon/design: use a list for supported filtersSeongJae Park
Filters section is listing currently supported filter types in a normal paragraph. Since the number of types are higher than four, it is not easy to read for only specific types. Use a list for easier finding of specific types. Link: https://lkml.kernel.org/r/20240503180318.72798-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update ↵SeongJae Park
command To update effective size quota of DAMOS schemes on DAMON sysfs file interface, user should write 'update_schemes_effective_quotas' to the kdamond 'state' file. But the document is mistakenly saying the input string as 'update_schemes_effective_bytes'. Fix it (s/bytes/quotas/). Link: https://lkml.kernel.org/r/20240503180318.72798-8-sj@kernel.org Fixes: a6068d6dfa2f ("Docs/admin-guide/mm/damon/usage: document effective_bytes file") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.9.x] Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching ↵SeongJae Park
sysfs file The example usage of DAMOS filter sysfs files, specifically the part of 'matching' file writing for memcg type filter, is wrong. The intention is to exclude pages of a memcg that already getting enough care from a given scheme, but the example is setting the filter to apply the scheme to only the pages of the memcg. Fix it. Link: https://lkml.kernel.org/r/20240503180318.72798-7-sj@kernel.org Fixes: 9b7f9322a530 ("Docs/admin-guide/mm/damon/usage: document DAMOS filters of sysfs") Closes: https://lore.kernel.org/r/20240317191358.97578-1-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.3.x] Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07selftests/damon: classify tests for functionalities and regressionsSeongJae Park
DAMON selftests can be classified into two categories: functionalities and regressions. Functionality tests are for checking if the function is working as specified, while the regression tests are basically reproducers of previously reported and fixed bugs. The tests of the categories are mixed in the selftests Makefile. Separate those for easier understanding of the types of tests. Link: https://lkml.kernel.org/r/20240503180318.72798-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None'SeongJae Park
_damon_sysfs.py is using '==' or '!=' for 'None'. Since 'None' is a singleton, using 'is' or 'is not' is more efficient. Use the more efficient one. Link: https://lkml.kernel.org/r/20240503180318.72798-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mountsSeongJae Park
_damon_sysfs.py assumes sysfs is mounted at /sys. In some systems, that might not be true. Find the mount point from /proc/mounts file content. Link: https://lkml.kernel.org/r/20240503180318.72798-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07selftests/damon/_damon_sysfs: check errors from nr_schemes file readsSeongJae Park
DAMON context staging method in _damon_sysfs.py is not checking the returned error from nr_schemes file read. Check it. Link: https://lkml.kernel.org/r/20240503180318.72798-3-sj@kernel.org Fixes: f5f0e5a2bef9 ("selftests/damon/_damon_sysfs: implement kdamonds start function") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/damon/core: initialize ->esz_bp from damos_quota_init_priv()SeongJae Park
Patch series "mm/damon: misc fixes and improvements". Add miscelleneous and non-urgent fixes and improvements for DAMON code, selftests, and documents. This patch (of 10): damos_quota_init_priv() function should initialize all private fields of struct damos_quota. However, it is not initializing ->esz_bp field. This could result in use of uninitialized variable from damon_feed_loop_next_input() function. There is no such issue at the moment because every caller of the function is passing damos_quota object that already having the field zero value. But we cannot guarantee the future, and the function is not doing what it is promising. A bug is a bug. This fix is for preventing possible future issues. Link: https://lkml.kernel.org/r/20240503180318.72798-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240503180318.72798-2-sj@kernel.org Fixes: 9294a037c015 ("mm/damon/core: implement goal-oriented feedback-driven quota auto-tuning") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07selftests/damon: add a test for DAMOS quota goalSeongJae Park
Add a selftest for DAMOS quota goal. It tests the feature by setting a user_input metric based goal, change the current feedback, and check if the effective quota size is increased and decreased as expected. Link: https://lkml.kernel.org/r/20240502172718.74166-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07selftests/damon/_damon_sysfs: support quota goalsSeongJae Park
Patch series "selftests/damon: add DAMOS quota goal test". Extend DAMON selftest-purpose sysfs wrapper to support DAMOS quota goal, and implement a simple selftest for the feature using it. This patch (of 2): The DAMON sysfs test purpose wrapper, _damon_sysfs.py, is not supporting quota goals. Implement the support for testing the feature. The test will be implemented and added by the following commit. Link: https://lkml.kernel.org/r/20240502172718.74166-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240502172718.74166-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/page-owner: use gfp_nested_mask() instead of open coded maskingDave Chinner
The page-owner tracking code records stack traces during page allocation. To do this, it must do a memory allocation for the stack information from inside an existing memory allocation context. This internal allocation must obey the high level caller allocation constraints to avoid generating false positive warnings that have nothing to do with the code they are instrumenting/tracking (e.g. through lockdep reclaim state tracking) We also don't want recording stack traces to deplete emergency memory reserves - debug code is useless if it creates new issues that can't be replicated when the debug code is disabled. Switch the stack tracking allocation masking to use gfp_nested_mask() to address these issues. gfp_nested_mask() naturally strips GFP_ZONEMASK, too, which greatly simplifies this code. Link: https://lkml.kernel.org/r/20240430054604.4169568-4-david@fromorbit.com Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07stackdepot: use gfp_nested_mask() instead of open coded maskingDave Chinner
The stackdepot code is used by KASAN and lockdep for recoding stack traces. Both of these track allocation context information, and so their internal allocations must obey the caller allocation contexts to avoid generating their own false positive warnings that have nothing to do with the code they are instrumenting/tracking. We also don't want recording stack traces to deplete emergency memory reserves - debug code is useless if it creates new issues that can't be replicated when the debug code is disabled. Switch the stackdepot allocation masking to use gfp_nested_mask() to address these issues. gfp_nested_mask() also strips GFP_ZONEMASK naturally, so that greatly simplifies this code. Link: https://lkml.kernel.org/r/20240430054604.4169568-3-david@fromorbit.com Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm: lift gfp_kmemleak_mask() to gfp.hDave Chinner
Patch series "mm: fix nested allocation context filtering". This patchset is the followup to the comment I made earlier today: https://lore.kernel.org/linux-xfs/ZjAyIWUzDipofHFJ@dread.disaster.area/ Tl;dr: Memory allocations that are done inside the public memory allocation API need to obey the reclaim recursion constraints placed on the allocation by the original caller, including the "don't track recursion for this allocation" case defined by __GFP_NOLOCKDEP. These nested allocations are generally in debug code that is tracking something about the allocation (kmemleak, KASAN, etc) and so are allocating private kernel objects that only that debug system will use. Neither the page-owner code nor the stack depot code get this right. They also also clear GFP_ZONEMASK as a separate operation, which is completely redundant because the constraint filter applied immediately after guarantees that GFP_ZONEMASK bits are cleared. kmemleak gets this filtering right. It preserves the allocation constraints for deadlock prevention and clears all other context flags whilst also ensuring that the nested allocation will fail quickly, silently and without depleting emergency kernel reserves if there is no memory available. This can be made much more robust, immune to whack-a-mole games and the code greatly simplified by lifting gfp_kmemleak_mask() to include/linux/gfp.h and using that everywhere. Also document it so that there is no excuse for not knowing about it when writing new debug code that nests allocations. Tested with lockdep, KASAN + page_owner=on and kmemleak=on over multiple fstests runs with XFS. This patch (of 3): Any "internal" nested allocation done from within an allocation context needs to obey the high level allocation gfp_mask constraints. This is necessary for debug code like KASAN, kmemleak, lockdep, etc that allocate memory for saving stack traces and other information during memory allocation. If they don't obey things like __GFP_NOLOCKDEP or __GFP_NOWARN, they produce false positive failure detections. kmemleak gets this right by using gfp_kmemleak_mask() to pass through the relevant context flags to the nested allocation to ensure that the allocation follows the constraints of the caller context. KASAN recently was foudn to be missing __GFP_NOLOCKDEP due to stack depot allocations, and even more recently the page owner tracking code was also found to be missing __GFP_NOLOCKDEP support. We also don't wan't want KASAN or lockdep to drive the system into OOM kill territory by exhausting emergency reserves. This is something that kmemleak also gets right by adding (__GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN) to the allocation mask. Hence it is clear that we need to define a common nested allocation filter mask for these sorts of third party nested allocations used in debug code. So to start this process, lift gfp_kmemleak_mask() to gfp.h and rename it to gfp_nested_mask(), and convert the kmemleak callers to use it. Link: https://lkml.kernel.org/r/20240430054604.4169568-1-david@fromorbit.com Link: https://lkml.kernel.org/r/20240430054604.4169568-2-david@fromorbit.com Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm-memory-failure-send-sigbus-in-the-event-of-thp-split-fail-fixAndrew Morton
update for folio conversions Cc: Jane Chu <jane.chu@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/memory-failure: send SIGBUS in the event of thp split failJane Chu
When handle hwpoison in a GUP longterm pin'ed thp page, try_to_split_thp_page() will fail. And at this point, there is little else the kernel could do except sending a SIGBUS to the user process, thus give it a chance to recover. Link: https://lkml.kernel.org/r/20240501232458.3919593-4-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/madvise: add MF_ACTION_REQUIRED to madvise(MADV_HWPOISON)Jane Chu
The soft hwpoison injector via madvise(MADV_HWPOISON) operates in a synchrous way in a sense, the injector is also a process under test, and should it have the poisoned page mapped in its address space, it should legitimately get killed as much as in a real UE situation. Link: https://lkml.kernel.org/r/20240501232458.3919593-3-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/memory-failure: try to send SIGBUS even if unmap failedJane Chu
Patch series "Enhance soft hwpoison handling and injection". This series aim at the following enhancement - 1. Let one hwpoison injector, that is, madvise(MADV_HWPOISON) to behave more like as if a real UE occurred. Because the other two injectors such as hwpoison-inject and the 'einj' on x86 can't, and it seems to me we need a better simulation to real UE scenario. 2. For years, if the kernel is unable to unmap a hwpoisoned page, it send a SIGKILL instead of SIGBUS to prevent user process from potentially accessing the page again. But in doing so, the user process also lose important information: vaddr, for recovery. Fortunately, the kernel already has code to kill process re-accessing a hwpoisoned page, so remove the '!unmap_success' check. 3. Right now, if a thp page under GUP longterm pin is hwpoisoned, and kernel cannot split the thp page, memory-failure simply ignores the UE and returns. That's not ideal, it could deliver a SIGBUS with useful information for userspace recovery. This patch (of 3): For years when it comes down to kill a process due to hwpoison, a SIGBUS is delivered only if unmap has been successful. Otherwise, a SIGKILL is delivered. And the reason for that is to prevent the involved process from accessing the hwpoisoned page again. Since then a lot has changed, a hwpoisoned page is marked and upon being re-accessed, the process will be killed immediately. So let's take out the '!unmap_success' factor and try to deliver SIGBUS if possible. Link: https://lkml.kernel.org/r/20240501232458.3919593-1-jane.chu@oracle.com Link: https://lkml.kernel.org/r/20240501232458.3919593-2-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07selftest mm/mseal: fix arm buildJeff Xu
Add include linux/mman.h to fix arm build. Fix a typo. Link: https://lkml.kernel.org/r/20240502225331.3806279-2-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Suggested-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07selftest mm/mseal: fix compile warningJeff Xu
fix compile warning reported by test robot Link: https://lkml.kernel.org/r/20240420003515.345982-2-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reported-by: kernel test robot <yujie.liu@intel.com> Closes: https://lore.kernel.org/r/202404190226.OfJOewV8-lkp@intel.com/ Suggested-by: Pedro Falcato <pedro.falcato@gmail.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07selftests: mm: fix linker error for inline functionAmer Al Shanawany
add 'static' keyword to 'sys_mprotect()' to link properly, otherwise the test won't build. gcc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 /usr/bin/ld: /tmp/cc7yeABp.o: in function `test_seal_elf': seal_elf.c:(.text+0x7af): undefined reference to `sys_mprotect' /usr/bin/ld: seal_elf.c:(.text+0x82c): undefined reference to `sys_mprotect' /usr/bin/ld: seal_elf.c:(.text+0xa4e): undefined reference to `sys_mprotect' Link: https://lkml.kernel.org/r/20240420202346.546444-1-amer.shanawany@gmail.com Signed-off-by: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Jeff Xu <jeffxu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07selftest mm/mseal: style changeJeff Xu
remove "assert" from testcase. remove "return 0" Link: https://lkml.kernel.org/r/20240416220944.2481203-2-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Suggested-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07selftest mm/mseal read-only elf memory segmentJeff Xu
Sealing read-only of elf mapping so it can't be changed by mprotect. Link: https://lkml.kernel.org/r/20240415163527.626541-6-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mseal: add documentationJeff Xu
Add documentation for mseal(). Link: https://lkml.kernel.org/r/20240415163527.626541-5-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07selftest mm/mseal memory sealingJeff Xu
selftest for memory sealing change in mmap() and mseal(). Link: https://lkml.kernel.org/r/20240415163527.626541-4-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mseal: add branch prediction hintJeff Xu
It is unlikely that application calls mm syscall, such as mprotect, on already sealed mappings, adding branch prediction hint. Link: https://lkml.kernel.org/r/20240423192825.1273679-2-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Suggested-by: Pedro Falcato <pedro.falcato@gmail.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mseal: add mseal syscallJeff Xu
The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. Following input during RFC are incooperated into this patch: Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Linus Torvalds: assisting in defining system call signature and scope. Liam R. Howlett: perf optimization. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD. Finally, the idea that inspired this patch comes from Stephen Röttger's work in Chrome V8 CFI. Link: https://lkml.kernel.org/r/20240415163527.626541-3-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mseal: wire up mseal syscallJeff Xu
Patch series "Introduce mseal", v10. This patchset proposes a new mseal() syscall for the Linux kernel. In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits. Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type. Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case. Two system calls are involved in sealing the map: mmap() and mseal(). The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API. Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process. Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED). However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity. Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections. In closing, I would like to formally acknowledge the valuable contributions received during the RFC process, which were instrumental in shaping this patch: Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Liam R. Howlett: perf optimization. Linus Torvalds: assisting in defining system call signature and scope. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD. MM perf benchmarks ================== This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to check the VMAs’ sealing flag, so that no partial update can be made, when any segment within the given memory range is sealed. To measure the performance impact of this loop, two tests are developed. [8] The first is measuring the time taken for a particular system call, by using clock_gettime(CLOCK_MONOTONIC). The second is using PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have similar results. The tests have roughly below sequence: for (i = 0; i < 1000, i++) create 1000 mappings (1 page per VMA) start the sampling for (j = 0; j < 1000, j++) mprotect one mapping stop and save the sample delete 1000 mappings calculates all samples. Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz, 4G memory, Chromebook. Based on the latest upstream code: The first test (measuring time) syscall__ vmas t t_mseal delta_ns per_vma % munmap__ 1 909 944 35 35 104% munmap__ 2 1398 1502 104 52 107% munmap__ 4 2444 2594 149 37 106% munmap__ 8 4029 4323 293 37 107% munmap__ 16 6647 6935 288 18 104% munmap__ 32 11811 12398 587 18 105% mprotect 1 439 465 26 26 106% mprotect 2 1659 1745 86 43 105% mprotect 4 3747 3889 142 36 104% mprotect 8 6755 6969 215 27 103% mprotect 16 13748 14144 396 25 103% mprotect 32 27827 28969 1142 36 104% madvise_ 1 240 262 22 22 109% madvise_ 2 366 442 76 38 121% madvise_ 4 623 751 128 32 121% madvise_ 8 1110 1324 215 27 119% madvise_ 16 2127 2451 324 20 115% madvise_ 32 4109 4642 534 17 113% The second test (measuring cpu cycle) syscall__ vmas cpu cmseal delta_cpu per_vma % munmap__ 1 1790 1890 100 100 106% munmap__ 2 2819 3033 214 107 108% munmap__ 4 4959 5271 312 78 106% munmap__ 8 8262 8745 483 60 106% munmap__ 16 13099 14116 1017 64 108% munmap__ 32 23221 24785 1565 49 107% mprotect 1 906 967 62 62 107% mprotect 2 3019 3203 184 92 106% mprotect 4 6149 6569 420 105 107% mprotect 8 9978 10524 545 68 105% mprotect 16 20448 21427 979 61 105% mprotect 32 40972 42935 1963 61 105% madvise_ 1 434 497 63 63 115% madvise_ 2 752 899 147 74 120% madvise_ 4 1313 1513 200 50 115% madvise_ 8 2271 2627 356 44 116% madvise_ 16 4312 4883 571 36 113% madvise_ 32 8376 9319 943 29 111% Based on the result, for 6.8 kernel, sealing check adds 20-40 nano seconds, or around 50-100 CPU cycles, per VMA. In addition, I applied the sealing to 5.10 kernel: The first test (measuring time) syscall__ vmas t tmseal delta_ns per_vma % munmap__ 1 357 390 33 33 109% munmap__ 2 442 463 21 11 105% munmap__ 4 614 634 20 5 103% munmap__ 8 1017 1137 120 15 112% munmap__ 16 1889 2153 263 16 114% munmap__ 32 4109 4088 -21 -1 99% mprotect 1 235 227 -7 -7 97% mprotect 2 495 464 -30 -15 94% mprotect 4 741 764 24 6 103% mprotect 8 1434 1437 2 0 100% mprotect 16 2958 2991 33 2 101% mprotect 32 6431 6608 177 6 103% madvise_ 1 191 208 16 16 109% madvise_ 2 300 324 24 12 108% madvise_ 4 450 473 23 6 105% madvise_ 8 753 806 53 7 107% madvise_ 16 1467 1592 125 8 108% madvise_ 32 2795 3405 610 19 122% The second test (measuring cpu cycle) syscall__ nbr_vma cpu cmseal delta_cpu per_vma % munmap__ 1 684 715 31 31 105% munmap__ 2 861 898 38 19 104% munmap__ 4 1183 1235 51 13 104% munmap__ 8 1999 2045 46 6 102% munmap__ 16 3839 3816 -23 -1 99% munmap__ 32 7672 7887 216 7 103% mprotect 1 397 443 46 46 112% mprotect 2 738 788 50 25 107% mprotect 4 1221 1256 35 9 103% mprotect 8 2356 2429 72 9 103% mprotect 16 4961 4935 -26 -2 99% mprotect 32 9882 10172 291 9 103% madvise_ 1 351 380 29 29 108% madvise_ 2 565 615 49 25 109% madvise_ 4 872 933 61 15 107% madvise_ 8 1508 1640 132 16 109% madvise_ 16 3078 3323 245 15 108% madvise_ 32 5893 6704 811 25 114% For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30 CPU cycles, there is even decrease in some cases. It might be interesting to compare 5.10 and 6.8 kernel The first test (measuring time) syscall__ vmas t_5_10 t_6_8 delta_ns per_vma % munmap__ 1 357 909 552 552 254% munmap__ 2 442 1398 956 478 316% munmap__ 4 614 2444 1830 458 398% munmap__ 8 1017 4029 3012 377 396% munmap__ 16 1889 6647 4758 297 352% munmap__ 32 4109 11811 7702 241 287% mprotect 1 235 439 204 204 187% mprotect 2 495 1659 1164 582 335% mprotect 4 741 3747 3006 752 506% mprotect 8 1434 6755 5320 665 471% mprotect 16 2958 13748 10790 674 465% mprotect 32 6431 27827 21397 669 433% madvise_ 1 191 240 49 49 125% madvise_ 2 300 366 67 33 122% madvise_ 4 450 623 173 43 138% madvise_ 8 753 1110 357 45 147% madvise_ 16 1467 2127 660 41 145% madvise_ 32 2795 4109 1314 41 147% The second test (measuring cpu cycle) syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma % munmap__ 1 684 1790 1106 1106 262% munmap__ 2 861 2819 1958 979 327% munmap__ 4 1183 4959 3776 944 419% munmap__ 8 1999 8262 6263 783 413% munmap__ 16 3839 13099 9260 579 341% munmap__ 32 7672 23221 15549 486 303% mprotect 1 397 906 509 509 228% mprotect 2 738 3019 2281 1140 409% mprotect 4 1221 6149 4929 1232 504% mprotect 8 2356 9978 7622 953 423% mprotect 16 4961 20448 15487 968 412% mprotect 32 9882 40972 31091 972 415% madvise_ 1 351 434 82 82 123% madvise_ 2 565 752 186 93 133% madvise_ 4 872 1313 442 110 151% madvise_ 8 1508 2271 763 95 151% madvise_ 16 3078 4312 1234 77 140% madvise_ 32 5893 8376 2483 78 142% From 5.10 to 6.8 munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma. mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma. madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma. In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the increase from 5.10 to 6.8 is significantly larger, approximately ten times greater for munmap and mprotect. When I discuss the mm performance with Brian Makin, an engineer who worked on performance, it was brought to my attention that such performance benchmarks, which measuring millions of mm syscall in a tight loop, may not accurately reflect real-world scenarios, such as that of a database service. Also this is tested using a single HW and ChromeOS, the data from another HW or distribution might be different. It might be best to take this data with a grain of salt. This patch (of 5): Wire up mseal syscall for all architectures. Link: https://lkml.kernel.org/r/20240415163527.626541-1-jeffxu@chromium.org Link: https://lkml.kernel.org/r/20240415163527.626541-2-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> [Bug #2] Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07fooAndrew Morton
2024-05-07mailmap: add entry for Barry SongBarry Song
Include a .mailmap entry to synchronize with both my past and current emails. Among them, three business mailboxes are dead. Link: https://lkml.kernel.org/r/20240506042009.10854-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07selftests/mm: fix powerpc ARCH checkMichael Ellerman
In commit 0518dbe97fe6 ("selftests/mm: fix cross compilation with LLVM") the logic to detect the machine architecture in the Makefile was changed to use ARCH, and only fallback to uname -m if ARCH is unset. However the tests of ARCH were not updated to account for the fact that ARCH is "powerpc" for powerpc builds, not "ppc64". Fix it by changing the checks to look for "powerpc", and change the uname -m logic to convert "ppc64.*" into "powerpc". With that fixed the following tests now build for powerpc again: * protection_keys * va_high_addr_switch * virtual_address_range * write_to_hugetlbfs Link: https://lkml.kernel.org/r/20240506115825.66415-1-mpe@ellerman.id.au Fixes: 0518dbe97fe6 ("selftests/mm: fix cross compilation with LLVM") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Mark Brown <broonie@kernel.org> Cc: <stable@vger.kernel.org> [6.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07thp: remove HPAGE_PMD_ORDER minimum assertionMatthew Wilcox (Oracle)
We now handle order-1 folios correctly, so we don't need this assertion any more. Link: https://lkml.kernel.org/r/20240429190114.3126789-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/vmscan: remove ignore_references argument of reclaim_folio_list()SeongJae Park
All reclaim_folio_list() callers are passing 'true' for 'ignore_references' parameter. In other words, the parameter is not really being used. Simplify the code by removing the parameter. Link: https://lkml.kernel.org/r/20240429224451.67081-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/vmscan: remove ignore_references argument of reclaim_pages()SeongJae Park
All reclaim_pages() callers are setting 'ignore_references' parameter 'true'. In other words, the parameter is not really being used. Remove the argument to make it simple. Link: https://lkml.kernel.org/r/20240429224451.67081-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>