From 1b513f613731e2afc05550e8070d79fac80c661e Mon Sep 17 00:00:00 2001 From: ChenXiaoSong Date: Tue, 9 Aug 2022 14:47:30 +0800 Subject: ntfs: fix BUG_ON in ntfs_lookup_inode_by_name() Syzkaller reported BUG_ON as follows: ------------[ cut here ]------------ kernel BUG at fs/ntfs/dir.c:86! invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 3 PID: 758 Comm: a.out Not tainted 5.19.0-next-20220808 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:ntfs_lookup_inode_by_name+0xd11/0x2d10 Code: ff e9 b9 01 00 00 e8 1e fe d6 fe 48 8b 7d 98 49 8d 5d 07 e8 91 85 29 ff 48 c7 45 98 00 00 00 00 e9 5a fb ff ff e8 ff fd d6 fe <0f> 0b e8 f8 fd d6 fe 0f 0b e8 f1 fd d6 fe 48 8b b5 50 ff ff ff 4c RSP: 0018:ffff888079607978 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000008000 RCX: 0000000000000000 RDX: ffff88807cf10000 RSI: ffffffff82a4a081 RDI: 0000000000000003 RBP: ffff888079607a70 R08: 0000000000000001 R09: ffff88807a6d01d7 R10: ffffed100f4da03a R11: 0000000000000000 R12: ffff88800f0fb110 R13: ffff88800f0ee000 R14: ffff88800f0fb000 R15: 0000000000000001 FS: 00007f33b63c7540(0000) GS:ffff888108580000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f33b635c090 CR3: 000000000f39e005 CR4: 0000000000770ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: load_system_files+0x1f7f/0x3620 ntfs_fill_super+0xa01/0x1be0 mount_bdev+0x36a/0x440 ntfs_mount+0x3a/0x50 legacy_get_tree+0xfb/0x210 vfs_get_tree+0x8f/0x2f0 do_new_mount+0x30a/0x760 path_mount+0x4de/0x1880 __x64_sys_mount+0x2b3/0x340 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f33b62ff9ea Code: 48 8b 0d a9 f4 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 76 f4 0b 00 f7 d8 64 89 01 48 RSP: 002b:00007ffd0c471aa8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f33b62ff9ea RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffd0c471be0 RBP: 00007ffd0c471c60 R08: 00007ffd0c471ae0 R09: 00007ffd0c471c24 R10: 0000000000000000 R11: 0000000000000202 R12: 000055bac5afc160 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Modules linked in: ---[ end trace 0000000000000000 ]--- Fix this by adding sanity check on extended system files' directory inode to ensure that it is directory, just like ntfs_extend_init() when mounting ntfs3. Link: https://lkml.kernel.org/r/20220809064730.2316892-1-chenxiaosong2@huawei.com Signed-off-by: ChenXiaoSong Cc: Anton Altaparmakov Cc: Signed-off-by: Andrew Morton --- fs/ntfs/super.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c index 5ae8de09b271..001f4e053c85 100644 --- a/fs/ntfs/super.c +++ b/fs/ntfs/super.c @@ -2092,7 +2092,8 @@ get_ctx_vol_failed: // TODO: Initialize security. /* Get the extended system files' directory inode. */ vol->extend_ino = ntfs_iget(sb, FILE_Extend); - if (IS_ERR(vol->extend_ino) || is_bad_inode(vol->extend_ino)) { + if (IS_ERR(vol->extend_ino) || is_bad_inode(vol->extend_ino) || + !S_ISDIR(vol->extend_ino->i_mode)) { if (!IS_ERR(vol->extend_ino)) iput(vol->extend_ino); ntfs_error(sb, "Failed to load $Extend."); -- cgit v1.2.3 From 3d36424b3b5850bd92f3e89b953a430d7cfc88ef Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 24 Aug 2022 12:14:50 +0100 Subject: mm/page_alloc: fix race condition between build_all_zonelists and page allocation Patrick Daly reported the following problem; NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation [0] - ZONE_MOVABLE [1] - ZONE_NORMAL [2] - NULL For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent memory_offline operation removes the last page from ZONE_MOVABLE, build_all_zonelists() & build_zonerefs_node() will update node_zonelists as shown below. Only populated zones are added. NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation [0] - ZONE_NORMAL [1] - NULL [2] - NULL The race is simple -- page allocation could be in progress when a memory hot-remove operation triggers a zonelist rebuild that removes zones. The allocation request will still have a valid ac->preferred_zoneref that is now pointing to NULL and triggers an OOM kill. This problem probably always existed but may be slightly easier to trigger due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") which distinguishes between zones that are completely unpopulated versus zones that have valid pages not managed by the buddy allocator (e.g. reserved, memblock, ballooning etc). Memory hotplug had multiple stages with timing considerations around managed/present page updates, the zonelist rebuild and the zone span updates. As David Hildenbrand puts it memory offlining adjusts managed+present pages of the zone essentially in one go. If after the adjustments, the zone is no longer populated (present==0), we rebuild the zone lists. Once that's done, we try shrinking the zone (start+spanned pages) -- which results in zone_start_pfn == 0 if there are no more pages. That happens *after* rebuilding the zonelists via remove_pfn_range_from_zone(). The only requirement to fix the race is that a page allocation request identifies when a zonelist rebuild has happened since the allocation request started and no page has yet been allocated. Use a seqlock_t to track zonelist updates with a lockless read-side of the zonelist and protecting the rebuild and update of the counter with a spinlock. [akpm@linux-foundation.org: make zonelist_update_seq static] Link: https://lkml.kernel.org/r/20220824110900.vh674ltxmzb3proq@techsingularity.net Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") Signed-off-by: Mel Gorman Reported-by: Patrick Daly Acked-by: Michal Hocko Reviewed-by: David Hildenbrand Cc: [4.9+] Signed-off-by: Andrew Morton --- mm/page_alloc.c | 53 +++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 43 insertions(+), 10 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e5486d47406e..1678431fb4c4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4708,6 +4708,30 @@ void fs_reclaim_release(gfp_t gfp_mask) EXPORT_SYMBOL_GPL(fs_reclaim_release); #endif +/* + * Zonelists may change due to hotplug during allocation. Detect when zonelists + * have been rebuilt so allocation retries. Reader side does not lock and + * retries the allocation if zonelist changes. Writer side is protected by the + * embedded spin_lock. + */ +static DEFINE_SEQLOCK(zonelist_update_seq); + +static unsigned int zonelist_iter_begin(void) +{ + if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) + return read_seqbegin(&zonelist_update_seq); + + return 0; +} + +static unsigned int check_retry_zonelist(unsigned int seq) +{ + if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) + return read_seqretry(&zonelist_update_seq, seq); + + return seq; +} + /* Perform direct synchronous page reclaim */ static unsigned long __perform_reclaim(gfp_t gfp_mask, unsigned int order, @@ -5001,6 +5025,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, int compaction_retries; int no_progress_loops; unsigned int cpuset_mems_cookie; + unsigned int zonelist_iter_cookie; int reserve_flags; /* @@ -5011,11 +5036,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM))) gfp_mask &= ~__GFP_ATOMIC; -retry_cpuset: +restart: compaction_retries = 0; no_progress_loops = 0; compact_priority = DEF_COMPACT_PRIORITY; cpuset_mems_cookie = read_mems_allowed_begin(); + zonelist_iter_cookie = zonelist_iter_begin(); /* * The fast path uses conservative alloc_flags to succeed only until @@ -5187,9 +5213,13 @@ retry: goto retry; - /* Deal with possible cpuset update races before we start OOM killing */ - if (check_retry_cpuset(cpuset_mems_cookie, ac)) - goto retry_cpuset; + /* + * Deal with possible cpuset update races or zonelist updates to avoid + * a unnecessary OOM kill. + */ + if (check_retry_cpuset(cpuset_mems_cookie, ac) || + check_retry_zonelist(zonelist_iter_cookie)) + goto restart; /* Reclaim has failed us, start killing things */ page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); @@ -5209,9 +5239,13 @@ retry: } nopage: - /* Deal with possible cpuset update races before we fail */ - if (check_retry_cpuset(cpuset_mems_cookie, ac)) - goto retry_cpuset; + /* + * Deal with possible cpuset update races or zonelist updates to avoid + * a unnecessary OOM kill. + */ + if (check_retry_cpuset(cpuset_mems_cookie, ac) || + check_retry_zonelist(zonelist_iter_cookie)) + goto restart; /* * Make sure that __GFP_NOFAIL request doesn't leak out and make sure @@ -6514,9 +6548,8 @@ static void __build_all_zonelists(void *data) int nid; int __maybe_unused cpu; pg_data_t *self = data; - static DEFINE_SPINLOCK(lock); - spin_lock(&lock); + write_seqlock(&zonelist_update_seq); #ifdef CONFIG_NUMA memset(node_load, 0, sizeof(node_load)); @@ -6553,7 +6586,7 @@ static void __build_all_zonelists(void *data) #endif } - spin_unlock(&lock); + write_sequnlock(&zonelist_update_seq); } static noinline void __init -- cgit v1.2.3 From b14d067e850c19921cec2200bd8d179edf6a1aa6 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 26 Aug 2022 10:17:54 -0700 Subject: xfs: quiet notify_failure EOPNOTSUPP cases Patch series "mm, xfs, dax: Fixes for memory_failure() handling". I failed to run the memory error injection section of the ndctl test suite on linux-next prior to the merge window and as a result some bugs were missed. While the new enabling targeted reflink enabled XFS filesystems the bugs cropped up in the surrounding cases of DAX error injection on ext4-fsdax and device-dax. One new assumption / clarification in this set is the notion that if a filesystem's ->notify_failure() handler returns -EOPNOTSUPP, then it must be the case that the fsdax usage of page->index and page->mapping are valid. I am fairly certain this is true for xfs_dax_notify_failure(), but would appreciate another set of eyes. This patch (of 4): XFS always registers dax_holder_operations regardless of whether the filesystem is capable of handling the notifications. The expectation is that if the notify_failure handler cannot run then there are no scenarios where it needs to run. In other words the expected semantic is that page->index and page->mapping are valid for memory_failure() when the conditions that cause -EOPNOTSUPP in xfs_dax_notify_failure() are present. A fallback to the generic memory_failure() path is expected so do not warn when that happens. Link: https://lkml.kernel.org/r/166153426798.2758201.15108211981034512993.stgit@dwillia2-xfh.jf.intel.com Link: https://lkml.kernel.org/r/166153427440.2758201.6709480562966161512.stgit@dwillia2-xfh.jf.intel.com Fixes: 6f643c57d57c ("xfs: implement ->notify_failure() for XFS") Signed-off-by: Dan Williams Reviewed-by: Christoph Hellwig Cc: Shiyang Ruan Cc: Darrick J. Wong Cc: Al Viro Cc: Dave Chinner Cc: Goldwyn Rodrigues Cc: Jane Chu Cc: Matthew Wilcox Cc: Miaohe Lin Cc: Naoya Horiguchi Cc: Ritesh Harjani Signed-off-by: Andrew Morton --- fs/xfs/xfs_notify_failure.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c index 69d9c83ea4b2..01e2721589c4 100644 --- a/fs/xfs/xfs_notify_failure.c +++ b/fs/xfs/xfs_notify_failure.c @@ -181,7 +181,7 @@ xfs_dax_notify_failure( } if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) { - xfs_warn(mp, + xfs_debug(mp, "notify_failure() not supported on realtime device!"); return -EOPNOTSUPP; } @@ -194,7 +194,7 @@ xfs_dax_notify_failure( } if (!xfs_has_rmapbt(mp)) { - xfs_warn(mp, "notify_failure() needs rmapbt enabled!"); + xfs_debug(mp, "notify_failure() needs rmapbt enabled!"); return -EOPNOTSUPP; } -- cgit v1.2.3 From fd63612ae81159bd7e59762de478889315463ee8 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 26 Aug 2022 10:18:01 -0700 Subject: xfs: fix SB_BORN check in xfs_dax_notify_failure() The SB_BORN flag is stored in the vfs superblock, not xfs_sb. Link: https://lkml.kernel.org/r/166153428094.2758201.7936572520826540019.stgit@dwillia2-xfh.jf.intel.com Fixes: 6f643c57d57c ("xfs: implement ->notify_failure() for XFS") Signed-off-by: Dan Williams Reviewed-by: Christoph Hellwig Cc: Shiyang Ruan Cc: Darrick J. Wong Cc: Al Viro Cc: Dave Chinner Cc: Goldwyn Rodrigues Cc: Jane Chu Cc: Matthew Wilcox Cc: Miaohe Lin Cc: Naoya Horiguchi Cc: Ritesh Harjani Signed-off-by: Andrew Morton --- fs/xfs/xfs_notify_failure.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c index 01e2721589c4..5b1f9a24ed59 100644 --- a/fs/xfs/xfs_notify_failure.c +++ b/fs/xfs/xfs_notify_failure.c @@ -175,7 +175,7 @@ xfs_dax_notify_failure( u64 ddev_start; u64 ddev_end; - if (!(mp->m_sb.sb_flags & SB_BORN)) { + if (!(mp->m_super->s_flags & SB_BORN)) { xfs_warn(mp, "filesystem is not ready for notify_failure()!"); return -EIO; } -- cgit v1.2.3 From 65d3440e8d2e9e024ef2357f8ff021611b068c99 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 26 Aug 2022 10:18:07 -0700 Subject: mm/memory-failure: fix detection of memory_failure() handlers Some pagemap types, like MEMORY_DEVICE_GENERIC (device-dax) do not even have pagemap ops which results in crash signatures like this: BUG: kernel NULL pointer dereference, address: 0000000000000010 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 8000000205073067 P4D 8000000205073067 PUD 2062b3067 PMD 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 22 PID: 4535 Comm: device-dax Tainted: G OE N 6.0.0-rc2+ #59 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:memory_failure+0x667/0xba0 [..] Call Trace: ? _printk+0x58/0x73 do_madvise.part.0.cold+0xaf/0xc5 Check for ops before checking if the ops have a memory_failure() handler. Link: https://lkml.kernel.org/r/166153428781.2758201.1990616683438224741.stgit@dwillia2-xfh.jf.intel.com Fixes: 33a8f7f2b3a3 ("pagemap,pmem: introduce ->memory_failure()") Signed-off-by: Dan Williams Acked-by: Naoya Horiguchi Reviewed-by: Miaohe Lin Reviewed-by: Christoph Hellwig Cc: Shiyang Ruan Cc: Darrick J. Wong Cc: Al Viro Cc: Dave Chinner Cc: Goldwyn Rodrigues Cc: Jane Chu Cc: Matthew Wilcox Cc: Ritesh Harjani Signed-off-by: Andrew Morton --- include/linux/memremap.h | 5 +++++ mm/memory-failure.c | 2 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 19010491a603..c3b4cc84877b 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -139,6 +139,11 @@ struct dev_pagemap { }; }; +static inline bool pgmap_has_memory_failure(struct dev_pagemap *pgmap) +{ + return pgmap->ops && pgmap->ops->memory_failure; +} + static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap) { if (pgmap->flags & PGMAP_ALTMAP_VALID) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 14439806b5ef..8a4294afbfa0 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1928,7 +1928,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, * Call driver's implementation to handle the memory failure, otherwise * fall back to generic handler. */ - if (pgmap->ops->memory_failure) { + if (pgmap_has_memory_failure(pgmap)) { rc = pgmap->ops->memory_failure(pgmap, pfn, 1, flags); /* * Fall back to generic handler too if operation is not -- cgit v1.2.3 From ac87ca0ea0102ec5dfd17c95eb9fa87a8bf54d61 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 26 Aug 2022 10:18:14 -0700 Subject: mm/memory-failure: fall back to vma_address() when ->notify_failure() fails In the case where a filesystem is polled to take over the memory failure and receives -EOPNOTSUPP it indicates that page->index and page->mapping are valid for reverse mapping the failure address. Introduce FSDAX_INVALID_PGOFF to distinguish when add_to_kill() is being called from mf_dax_kill_procs() by a filesytem vs the typical memory_failure() path. Otherwise, vma_pgoff_address() is called with an invalid fsdax_pgoff which then trips this failing signature: kernel BUG at mm/memory-failure.c:319! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 13 PID: 1262 Comm: dax-pmd Tainted: G OE N 6.0.0-rc2+ #62 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:add_to_kill.cold+0x19d/0x209 [..] Call Trace: collect_procs.part.0+0x2c4/0x460 memory_failure+0x71b/0xba0 ? _printk+0x58/0x73 do_madvise.part.0.cold+0xaf/0xc5 Link: https://lkml.kernel.org/r/166153429427.2758201.14605968329933175594.stgit@dwillia2-xfh.jf.intel.com Fixes: c36e20249571 ("mm: introduce mf_dax_kill_procs() for fsdax case") Signed-off-by: Dan Williams Acked-by: Naoya Horiguchi Reviewed-by: Miaohe Lin Reviewed-by: Christoph Hellwig Cc: Shiyang Ruan Cc: Darrick J. Wong Cc: Al Viro Cc: Dave Chinner Cc: Goldwyn Rodrigues Cc: Jane Chu Cc: Matthew Wilcox Cc: Ritesh Harjani Signed-off-by: Andrew Morton --- mm/memory-failure.c | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 8a4294afbfa0..e424a9dac749 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -345,13 +345,17 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma, * not much we can do. We just print a message and ignore otherwise. */ +#define FSDAX_INVALID_PGOFF ULONG_MAX + /* * Schedule a process for later kill. * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM. * - * Notice: @fsdax_pgoff is used only when @p is a fsdax page. - * In other cases, such as anonymous and file-backend page, the address to be - * killed can be caculated by @p itself. + * Note: @fsdax_pgoff is used only when @p is a fsdax page and a + * filesystem with a memory failure handler has claimed the + * memory_failure event. In all other cases, page->index and + * page->mapping are sufficient for mapping the page back to its + * corresponding user virtual address. */ static void add_to_kill(struct task_struct *tsk, struct page *p, pgoff_t fsdax_pgoff, struct vm_area_struct *vma, @@ -367,11 +371,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p, tk->addr = page_address_in_vma(p, vma); if (is_zone_device_page(p)) { - /* - * Since page->mapping is not used for fsdax, we need - * calculate the address based on the vma. - */ - if (p->pgmap->type == MEMORY_DEVICE_FS_DAX) + if (fsdax_pgoff != FSDAX_INVALID_PGOFF) tk->addr = vma_pgoff_address(fsdax_pgoff, 1, vma); tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr); } else @@ -523,7 +523,8 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, if (!page_mapped_in_vma(page, vma)) continue; if (vma->vm_mm == t->mm) - add_to_kill(t, page, 0, vma, to_kill); + add_to_kill(t, page, FSDAX_INVALID_PGOFF, vma, + to_kill); } } read_unlock(&tasklist_lock); @@ -559,7 +560,8 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, * to be informed of all such data corruptions. */ if (vma->vm_mm == t->mm) - add_to_kill(t, page, 0, vma, to_kill); + add_to_kill(t, page, FSDAX_INVALID_PGOFF, vma, + to_kill); } } read_unlock(&tasklist_lock); -- cgit v1.2.3 From 818c4fdaa98769ce2d0bf64bc50a6c9af89290ea Mon Sep 17 00:00:00 2001 From: Naohiro Aota Date: Wed, 24 Aug 2022 17:47:26 +0900 Subject: x86/mm: disable instrumentations of mm/pgprot.c Commit 4867fbbdd6b3 ("x86/mm: move protection_map[] inside the platform") moved accesses to protection_map[] from mem_encrypt_amd.c to pgprot.c. As a result, the accesses are now targets of KASAN (and other instrumentations), leading to the crash during the boot process. Disable the instrumentations for pgprot.c like commit 67bb8e999e0a ("x86/mm: Disable various instrumentations of mm/mem_encrypt.c and mm/tlb.c"). Before this patch, my AMD machine cannot boot since v6.0-rc1 with KASAN enabled, without anything printed. After the change, it successfully boots up. Fixes: 4867fbbdd6b3 ("x86/mm: move protection_map[] inside the platform") Link: https://lkml.kernel.org/r/20220824084726.2174758-1-naohiro.aota@wdc.com Signed-off-by: Naohiro Aota Cc: Anshuman Khandual Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Thomas Gleixner Signed-off-by: Andrew Morton --- arch/x86/mm/Makefile | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index f8220fd2c169..829c1409ffbd 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -4,10 +4,12 @@ KCOV_INSTRUMENT_tlb.o := n KCOV_INSTRUMENT_mem_encrypt.o := n KCOV_INSTRUMENT_mem_encrypt_amd.o := n KCOV_INSTRUMENT_mem_encrypt_identity.o := n +KCOV_INSTRUMENT_pgprot.o := n KASAN_SANITIZE_mem_encrypt.o := n KASAN_SANITIZE_mem_encrypt_amd.o := n KASAN_SANITIZE_mem_encrypt_identity.o := n +KASAN_SANITIZE_pgprot.o := n # Disable KCSAN entirely, because otherwise we get warnings that some functions # reference __initdata sections. @@ -17,6 +19,7 @@ ifdef CONFIG_FUNCTION_TRACER CFLAGS_REMOVE_mem_encrypt.o = -pg CFLAGS_REMOVE_mem_encrypt_amd.o = -pg CFLAGS_REMOVE_mem_encrypt_identity.o = -pg +CFLAGS_REMOVE_pgprot.o = -pg endif obj-y := init.o init_$(BITS).o fault.o ioremap.o extable.o mmap.o \ -- cgit v1.2.3 From 60bae73708963de4a17231077285bd9ff2f41c44 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Fri, 2 Sep 2022 10:35:51 +1000 Subject: mm/migrate_device.c: flush TLB while holding PTL When clearing a PTE the TLB should be flushed whilst still holding the PTL to avoid a potential race with madvise/munmap/etc. For example consider the following sequence: CPU0 CPU1 ---- ---- migrate_vma_collect_pmd() pte_unmap_unlock() madvise(MADV_DONTNEED) -> zap_pte_range() pte_offset_map_lock() [ PTE not present, TLB not flushed ] pte_unmap_unlock() [ page is still accessible via stale TLB ] flush_tlb_range() In this case the page may still be accessed via the stale TLB entry after madvise returns. Fix this by flushing the TLB while holding the PTL. Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Link: https://lkml.kernel.org/r/9f801e9d8d830408f2ca27821f606e09aa856899.1662078528.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple Reported-by: Nadav Amit Reviewed-by: "Huang, Ying" Acked-by: David Hildenbrand Acked-by: Peter Xu Cc: Alex Sierra Cc: Ben Skeggs Cc: Felix Kuehling Cc: huang ying Cc: Jason Gunthorpe Cc: John Hubbard Cc: Karol Herbst Cc: Logan Gunthorpe Cc: Lyude Paul Cc: Matthew Wilcox Cc: Paul Mackerras Cc: Ralph Campbell Cc: Signed-off-by: Andrew Morton --- mm/migrate_device.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 27fb37d65476..6a5ef9f0da6a 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -254,13 +254,14 @@ next: migrate->dst[migrate->npages] = 0; migrate->src[migrate->npages++] = mpfn; } - arch_leave_lazy_mmu_mode(); - pte_unmap_unlock(ptep - 1, ptl); /* Only flush the TLB if we actually modified any entries */ if (unmapped) flush_tlb_range(walk->vma, start, end); + arch_leave_lazy_mmu_mode(); + pte_unmap_unlock(ptep - 1, ptl); + return 0; } -- cgit v1.2.3 From a3589e1d5fe39c3d9fdd291b111524b93d08bc32 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Fri, 2 Sep 2022 10:35:52 +1000 Subject: mm/migrate_device.c: add missing flush_cache_page() Currently we only call flush_cache_page() for the anon_exclusive case, however in both cases we clear the pte so should flush the cache. Link: https://lkml.kernel.org/r/5676f30436ab71d1a587ac73f835ed8bd2113ff5.1662078528.git-series.apopple@nvidia.com Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Signed-off-by: Alistair Popple Reviewed-by: David Hildenbrand Acked-by: Peter Xu Cc: Alex Sierra Cc: Ben Skeggs Cc: Felix Kuehling Cc: huang ying Cc: "Huang, Ying" Cc: Jason Gunthorpe Cc: John Hubbard Cc: Karol Herbst Cc: Logan Gunthorpe Cc: Lyude Paul Cc: Matthew Wilcox Cc: Nadav Amit Cc: Paul Mackerras Cc: Ralph Campbell Cc: Signed-off-by: Andrew Morton --- mm/migrate_device.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 6a5ef9f0da6a..4cc849c3b54c 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -193,9 +193,9 @@ again: bool anon_exclusive; pte_t swp_pte; + flush_cache_page(vma, addr, pte_pfn(*ptep)); anon_exclusive = PageAnon(page) && PageAnonExclusive(page); if (anon_exclusive) { - flush_cache_page(vma, addr, pte_pfn(*ptep)); ptep_clear_flush(vma, addr, ptep); if (page_try_share_anon_rmap(page)) { -- cgit v1.2.3 From fd35ca3d12cc9922d7d9a35f934e72132dbc4853 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Fri, 2 Sep 2022 10:35:53 +1000 Subject: mm/migrate_device.c: copy pte dirty bit to page migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that installs migration entries directly if it can lock the migrating page. When removing a dirty pte the dirty bit is supposed to be carried over to the underlying page to prevent it being lost. Currently migrate_vma_*() can only be used for private anonymous mappings. That means loss of the dirty bit usually doesn't result in data loss because these pages are typically not file-backed. However pages may be backed by swap storage which can result in data loss if an attempt is made to migrate a dirty page that doesn't yet have the PageDirty flag set. In this case migration will fail due to unexpected references but the dirty pte bit will be lost. If the page is subsequently reclaimed data won't be written back to swap storage as it is considered uptodate, resulting in data loss if the page is subsequently accessed. Prevent this by copying the dirty bit to the page when removing the pte to match what try_to_migrate_one() does. Link: https://lkml.kernel.org/r/dd48e4882ce859c295c1a77612f66d198b0403f9.1662078528.git-series.apopple@nvidia.com Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Signed-off-by: Alistair Popple Acked-by: Peter Xu Reviewed-by: "Huang, Ying" Reported-by: "Huang, Ying" Acked-by: David Hildenbrand Cc: Alex Sierra Cc: Ben Skeggs Cc: Felix Kuehling Cc: huang ying Cc: Jason Gunthorpe Cc: John Hubbard Cc: Karol Herbst Cc: Logan Gunthorpe Cc: Lyude Paul Cc: Matthew Wilcox Cc: Nadav Amit Cc: Paul Mackerras Cc: Ralph Campbell Cc: Signed-off-by: Andrew Morton --- mm/migrate_device.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 4cc849c3b54c..dbf6c7a7a7c9 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include #include @@ -196,7 +197,7 @@ again: flush_cache_page(vma, addr, pte_pfn(*ptep)); anon_exclusive = PageAnon(page) && PageAnonExclusive(page); if (anon_exclusive) { - ptep_clear_flush(vma, addr, ptep); + pte = ptep_clear_flush(vma, addr, ptep); if (page_try_share_anon_rmap(page)) { set_pte_at(mm, addr, ptep, pte); @@ -206,11 +207,15 @@ again: goto next; } } else { - ptep_get_and_clear(mm, addr, ptep); + pte = ptep_get_and_clear(mm, addr, ptep); } migrate->cpages++; + /* Set the dirty flag on the folio now the pte is gone. */ + if (pte_dirty(pte)) + folio_mark_dirty(page_folio(page)); + /* Setup special migration page table entry */ if (mpfn & MIGRATE_PFN_WRITE) entry = make_writable_migration_entry( -- cgit v1.2.3 From 1552fd3ef7dbe07208b8ae84a0a6566adf7dfc9d Mon Sep 17 00:00:00 2001 From: Greg Kroah-Hartman Date: Fri, 2 Sep 2022 19:11:49 +0000 Subject: mm/damon/dbgfs: fix memory leak when using debugfs_lookup() When calling debugfs_lookup() the result must have dput() called on it, otherwise the memory will leak over time. Fix this up by properly calling dput(). Link: https://lkml.kernel.org/r/20220902191149.112434-1-sj@kernel.org Fixes: 75c1c2b53c78b ("mm/damon/dbgfs: support multiple contexts") Signed-off-by: Greg Kroah-Hartman Signed-off-by: SeongJae Park Cc: Signed-off-by: Andrew Morton --- mm/damon/dbgfs.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index cfdf63132d5a..4e51466c4e74 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -884,6 +884,7 @@ static int dbgfs_rm_context(char *name) struct dentry *root, *dir, **new_dirs; struct damon_ctx **new_ctxs; int i, j; + int ret = 0; if (damon_nr_running_ctxs()) return -EBUSY; @@ -898,14 +899,16 @@ static int dbgfs_rm_context(char *name) new_dirs = kmalloc_array(dbgfs_nr_ctxs - 1, sizeof(*dbgfs_dirs), GFP_KERNEL); - if (!new_dirs) - return -ENOMEM; + if (!new_dirs) { + ret = -ENOMEM; + goto out_dput; + } new_ctxs = kmalloc_array(dbgfs_nr_ctxs - 1, sizeof(*dbgfs_ctxs), GFP_KERNEL); if (!new_ctxs) { - kfree(new_dirs); - return -ENOMEM; + ret = -ENOMEM; + goto out_new_dirs; } for (i = 0, j = 0; i < dbgfs_nr_ctxs; i++) { @@ -925,7 +928,13 @@ static int dbgfs_rm_context(char *name) dbgfs_ctxs = new_ctxs; dbgfs_nr_ctxs--; - return 0; + goto out_dput; + +out_new_dirs: + kfree(new_dirs); +out_dput: + dput(dir); + return ret; } static ssize_t dbgfs_rm_context_write(struct file *file, -- cgit v1.2.3 From 283c05f66dcaecb0c6d5c55dd7e4dab5c631dd92 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Fri, 2 Sep 2022 20:19:23 +0100 Subject: tools: fix compilation after gfp_types.h split When gfp_types.h was split from gfp.h, it broke the radix test suite. Fix the test suite by using gfp_types.h in the tools gfp.h header. Link: https://lkml.kernel.org/r/20220902191923.1735933-1-willy@infradead.org Fixes: cb5a065b4ea9 (headers/deps: mm: Split out of ) Signed-off-by: Matthew Wilcox (Oracle) Reported-by: Liam R. Howlett Reviewed-by: Yury Norov Cc: Ingo Molnar Signed-off-by: Andrew Morton --- tools/include/linux/gfp.h | 21 +-------------------- tools/include/linux/gfp_types.h | 1 + 2 files changed, 2 insertions(+), 20 deletions(-) create mode 100644 tools/include/linux/gfp_types.h diff --git a/tools/include/linux/gfp.h b/tools/include/linux/gfp.h index b238dbc9eb85..6a10ff5f5be9 100644 --- a/tools/include/linux/gfp.h +++ b/tools/include/linux/gfp.h @@ -3,26 +3,7 @@ #define _TOOLS_INCLUDE_LINUX_GFP_H #include - -#define __GFP_BITS_SHIFT 26 -#define __GFP_BITS_MASK ((gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) - -#define __GFP_HIGH 0x20u -#define __GFP_IO 0x40u -#define __GFP_FS 0x80u -#define __GFP_NOWARN 0x200u -#define __GFP_ZERO 0x8000u -#define __GFP_ATOMIC 0x80000u -#define __GFP_ACCOUNT 0x100000u -#define __GFP_DIRECT_RECLAIM 0x400000u -#define __GFP_KSWAPD_RECLAIM 0x2000000u - -#define __GFP_RECLAIM (__GFP_DIRECT_RECLAIM | __GFP_KSWAPD_RECLAIM) - -#define GFP_ZONEMASK 0x0fu -#define GFP_ATOMIC (__GFP_HIGH | __GFP_ATOMIC | __GFP_KSWAPD_RECLAIM) -#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS) -#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM) +#include static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags) { diff --git a/tools/include/linux/gfp_types.h b/tools/include/linux/gfp_types.h new file mode 100644 index 000000000000..5f9f1ed190a0 --- /dev/null +++ b/tools/include/linux/gfp_types.h @@ -0,0 +1 @@ +#include "../../../include/linux/gfp_types.h" -- cgit v1.2.3 From b9eb7776e8a64f9d53e8fa1cd27acb3024c74893 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Fri, 2 Sep 2022 20:26:38 +0100 Subject: mm: fix VM_BUG_ON in __delete_from_swap_cache() Patch series "Folio fixes for 6.0". This patch (of 2): The recent folio conversion changed the VM_BUG_ON() to dump the folio we're storing instead of the entry we retrieved. This was a mistake; the entry we retrieved is the more interesting page to dump. Link: https://lkml.kernel.org/r/20220902192639.1737108-1-willy@infradead.org Link: https://lkml.kernel.org/r/20220902192639.1737108-2-willy@infradead.org Fixes: ceff9d3354e9 ("mm/swap: convert __delete_from_swap_cache() to a folio") Signed-off-by: Matthew Wilcox (Oracle) Cc: Hugh Dickins Signed-off-by: Andrew Morton --- mm/swap_state.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index e166051566f4..41afa6d45b23 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -151,7 +151,7 @@ void __delete_from_swap_cache(struct folio *folio, for (i = 0; i < nr; i++) { void *entry = xas_store(&xas, shadow); - VM_BUG_ON_FOLIO(entry != folio, folio); + VM_BUG_ON_PAGE(entry != folio, entry); set_page_private(folio_page(folio, i), 0); xas_next(&xas); } -- cgit v1.2.3 From 36a3b14b5febdaf0e7f70c4ca6f62c8ea75fabfe Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Fri, 2 Sep 2022 20:26:39 +0100 Subject: vmscan: check folio_test_private(), not folio_get_private() These two predicates are the same for file pages, but are not the same for anonymous pages. Link: https://lkml.kernel.org/r/20220902192639.1737108-3-willy@infradead.org Fixes: 07f67a8dedc0 ("mm/vmscan: convert shrink_active_list() to use a folio") Signed-off-by: Matthew Wilcox (Oracle) Reported-by: Hugh Dickins Signed-off-by: Andrew Morton --- mm/vmscan.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index b2b1431352dc..382dbe97329f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2550,8 +2550,8 @@ static void shrink_active_list(unsigned long nr_to_scan, } if (unlikely(buffer_heads_over_limit)) { - if (folio_get_private(folio) && folio_trylock(folio)) { - if (folio_get_private(folio)) + if (folio_test_private(folio) && folio_trylock(folio)) { + if (folio_test_private(folio)) filemap_release_folio(folio, 0); folio_unlock(folio); } -- cgit v1.2.3 From 4eb5bbde3ccb710d3b85bfb13466612e56393369 Mon Sep 17 00:00:00 2001 From: Binyi Han Date: Sun, 4 Sep 2022 00:46:47 -0700 Subject: mm: fix dereferencing possible ERR_PTR Smatch checker complains that 'secretmem_mnt' dereferencing possible ERR_PTR(). Let the function return if 'secretmem_mnt' is ERR_PTR, to avoid deferencing it. Link: https://lkml.kernel.org/r/20220904074647.GA64291@cloud-MacBookPro Fixes: 1507f51255c9f ("mm: introduce memfd_secret system call to create "secret" memory areas") Signed-off-by: Binyi Han Reviewed-by: Andrew Morton Cc: Mike Rapoport Cc: Ammar Faizi Cc: Hagen Paul Pfeifer Cc: James Bottomley Cc: Signed-off-by: Andrew Morton --- mm/secretmem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/secretmem.c b/mm/secretmem.c index e3e9590c6fb3..3f7154099795 100644 --- a/mm/secretmem.c +++ b/mm/secretmem.c @@ -285,7 +285,7 @@ static int secretmem_init(void) secretmem_mnt = kern_mount(&secretmem_fs); if (IS_ERR(secretmem_mnt)) - ret = PTR_ERR(secretmem_mnt); + return PTR_ERR(secretmem_mnt); /* prevent secretmem mappings from ever getting PROT_EXEC */ secretmem_mnt->mnt_flags |= MNT_NOEXEC; -- cgit v1.2.3 From 70cbc3cc78a997d8247b50389d37c4e1736019da Mon Sep 17 00:00:00 2001 From: Yang Shi Date: Wed, 7 Sep 2022 11:01:43 -0700 Subject: mm: gup: fix the fast GUP race against THP collapse Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer sufficient to handle concurrent GUP-fast in all cases, it only handles traditional IPI-based GUP-fast correctly. On architectures that send an IPI broadcast on TLB flush, it works as expected. But on the architectures that do not use IPI to broadcast TLB flush, it may have the below race: CPU A CPU B THP collapse fast GUP gup_pmd_range() <-- see valid pmd gup_pte_range() <-- work on pte pmdp_collapse_flush() <-- clear pmd and flush __collapse_huge_page_isolate() check page pinned <-- before GUP bump refcount pin the page check PTE <-- no change __collapse_huge_page_copy() copy data to huge page ptep_clear() install huge pmd for the huge page return the stale page discard the stale page The race can be fixed by checking whether PMD is changed or not after taking the page pin in fast GUP, just like what it does for PTE. If the PMD is changed it means there may be parallel THP collapse, so GUP should back off. Also update the stale comment about serializing against fast GUP in khugepaged. Link: https://lkml.kernel.org/r/20220907180144.555485-1-shy828301@gmail.com Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") Acked-by: David Hildenbrand Acked-by: Peter Xu Signed-off-by: Yang Shi Reviewed-by: John Hubbard Cc: "Aneesh Kumar K.V" Cc: Hugh Dickins Cc: Jason Gunthorpe Cc: "Kirill A. Shutemov" Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Signed-off-by: Andrew Morton --- mm/gup.c | 34 ++++++++++++++++++++++++++++------ mm/khugepaged.c | 10 ++++++---- 2 files changed, 34 insertions(+), 10 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 5abdaf487460..00926abb4426 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2345,8 +2345,28 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, } #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, - unsigned int flags, struct page **pages, int *nr) +/* + * Fast-gup relies on pte change detection to avoid concurrent pgtable + * operations. + * + * To pin the page, fast-gup needs to do below in order: + * (1) pin the page (by prefetching pte), then (2) check pte not changed. + * + * For the rest of pgtable operations where pgtable updates can be racy + * with fast-gup, we need to do (1) clear pte, then (2) check whether page + * is pinned. + * + * Above will work for all pte-level operations, including THP split. + * + * For THP collapse, it's a bit more complicated because fast-gup may be + * walking a pgtable page that is being freed (pte is still valid but pmd + * can be cleared already). To avoid race in such condition, we need to + * also check pmd here to make sure pmd doesn't change (corresponds to + * pmdp_collapse_flush() in the THP collapse code path). + */ +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, + unsigned long end, unsigned int flags, + struct page **pages, int *nr) { struct dev_pagemap *pgmap = NULL; int nr_start = *nr, ret = 0; @@ -2392,7 +2412,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, goto pte_unmap; } - if (unlikely(pte_val(pte) != pte_val(*ptep))) { + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || + unlikely(pte_val(pte) != pte_val(*ptep))) { gup_put_folio(folio, 1, flags); goto pte_unmap; } @@ -2439,8 +2460,9 @@ pte_unmap: * get_user_pages_fast_only implementation that can pin pages. Thus it's still * useful to have gup_huge_pmd even if we can't operate on ptes. */ -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, - unsigned int flags, struct page **pages, int *nr) +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, + unsigned long end, unsigned int flags, + struct page **pages, int *nr) { return 0; } @@ -2764,7 +2786,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr, PMD_SHIFT, next, flags, pages, nr)) return 0; - } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr)) + } else if (!gup_pte_range(pmd, pmdp, addr, next, flags, pages, nr)) return 0; } while (pmdp++, addr = next, addr != end); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 01f71786d530..70b7ac66411c 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1083,10 +1083,12 @@ static void collapse_huge_page(struct mm_struct *mm, pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ /* - * After this gup_fast can't run anymore. This also removes - * any huge TLB entry from the CPU so we won't allow - * huge and small TLB entries for the same virtual address - * to avoid the risk of CPU bugs in that area. + * This removes any huge TLB entry from the CPU so we won't allow + * huge and small TLB entries for the same virtual address to + * avoid the risk of CPU bugs in that area. + * + * Parallel fast GUP is fine since fast GUP will back off when + * it detects PMD is changed. */ _pmd = pmdp_collapse_flush(vma, address, pmd); spin_unlock(pmd_ptl); -- cgit v1.2.3 From bedf03416913d88c796288f9dca109a53608c745 Mon Sep 17 00:00:00 2001 From: Yang Shi Date: Wed, 7 Sep 2022 11:01:44 -0700 Subject: powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will move to use RCU instead of disabling local interrupts in fast-GUP. Using an IPI is the old-styled way of serializing against fast-GUP although it still works as expected now. And fast-GUP now fixed the potential race with THP collapse by checking whether PMD is changed or not. So IPI broadcast in radix pmd collapse flush is not necessary anymore. But it is still needed for hash TLB. Link: https://lkml.kernel.org/r/20220907180144.555485-2-shy828301@gmail.com Suggested-by: Aneesh Kumar K.V Signed-off-by: Yang Shi Acked-by: David Hildenbrand Acked-by: Peter Xu Cc: Christophe Leroy Cc: Hugh Dickins Cc: Jason Gunthorpe Cc: John Hubbard Cc: "Kirill A. Shutemov" Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Signed-off-by: Andrew Morton --- arch/powerpc/mm/book3s64/radix_pgtable.c | 9 --------- 1 file changed, 9 deletions(-) diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index 698274109c91..e712f80fe189 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -937,15 +937,6 @@ pmd_t radix__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addre pmd = *pmdp; pmd_clear(pmdp); - /* - * pmdp collapse_flush need to ensure that there are no parallel gup - * walk after this call. This is needed so that we can have stable - * page ref count when collapsing a page. We don't allow a collapse page - * if we have gup taken on the page. We can ensure that by sending IPI - * because gup walk happens with IRQ disabled. - */ - serialize_against_pte_lookup(vma->vm_mm); - radix__flush_tlb_collapsed_pmd(vma->vm_mm, address); return pmd; -- cgit v1.2.3 From 58d426a7ba92870d489686dfdb9d06b66815a2ab Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Thu, 8 Sep 2022 08:12:04 -0700 Subject: mm: fix madivse_pageout mishandling on non-LRU page MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MADV_PAGEOUT tries to isolate non-LRU pages and gets a warning from isolate_lru_page below. Fix it by checking PageLRU in advance. ------------[ cut here ]------------ trying to isolate tail page WARNING: CPU: 0 PID: 6175 at mm/folio-compat.c:158 isolate_lru_page+0x130/0x140 Modules linked in: CPU: 0 PID: 6175 Comm: syz-executor.0 Not tainted 5.18.12 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:isolate_lru_page+0x130/0x140 Link: https://lore.kernel.org/linux-mm/485f8c33.2471b.182d5726afb.Coremail.hantianshuo@iie.ac.cn/ Link: https://lkml.kernel.org/r/20220908151204.762596-1-minchan@kernel.org Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT") Signed-off-by: Minchan Kim Reported-by: 韩天ç`• Suggested-by: Yang Shi Acked-by: Yang Shi Cc: Signed-off-by: Andrew Morton --- mm/madvise.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 5f0f0948a50e..9ff51650f4f0 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -451,8 +451,11 @@ regular_page: continue; } - /* Do not interfere with other mappings of this page */ - if (page_mapcount(page) != 1) + /* + * Do not interfere with other mappings of this page and + * non-LRU page. + */ + if (!PageLRU(page) || page_mapcount(page) != 1) continue; VM_BUG_ON_PAGE(PageTransCompound(page), page); -- cgit v1.2.3 From 2b7aa91ba0e86b8643f5d3c83874c80599c731d7 Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Thu, 8 Sep 2022 13:11:50 +0900 Subject: mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() NULL pointer dereference is triggered when calling thp split via debugfs on the system with offlined memory blocks. With debug option enabled, the following kernel messages are printed out: page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: unmovable page page:000000007d7ab72e is uninitialized and poisoned page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) ------------[ cut here ]------------ kernel BUG at include/linux/mm.h:1248! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 ... RIP: 0010:split_huge_pages_write+0xcf4/0xe30 This shows that page_to_nid() in page_zone() is unexpectedly called for an offlined memmap. Use pfn_to_online_page() to get struct page in PFN walker. Link: https://lkml.kernel.org/r/20220908041150.3430269-1-naoya.horiguchi@linux.dev Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Signed-off-by: Naoya Horiguchi Co-developed-by: David Hildenbrand Signed-off-by: David Hildenbrand Reviewed-by: Yang Shi Acked-by: Michal Hocko Reviewed-by: Miaohe Lin Reviewed-by: Oscar Salvador Acked-by: Kirill A. Shutemov Cc: Matthew Wilcox Cc: Muchun Song Cc: [5.10+] Signed-off-by: Andrew Morton --- mm/huge_memory.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e9414ee57c5b..f42bb51e023a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void) max_zone_pfn = zone_end_pfn(zone); for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) { int nr_pages; - if (!pfn_valid(pfn)) - continue; - page = pfn_to_page(pfn); - if (!get_page_unless_zero(page)) + page = pfn_to_online_page(pfn); + if (!page || !get_page_unless_zero(page)) continue; if (zone != page_zone(page)) -- cgit v1.2.3 From 37dcc673d065d9823576cd9f2484a72531e1cba6 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 9 Sep 2022 15:08:29 +0200 Subject: frontswap: don't call ->init if no ops are registered If no frontswap module (i.e. zswap) was registered, frontswap_ops will be NULL. In such situation, swapon crashes with the following stack trace: Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000000 Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=00000020a4fab000 [0000000000000000] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 96000004 [#1] SMP Modules linked in: zram fsl_dpaa2_eth pcs_lynx phylink ahci_qoriq crct10dif_ce ghash_ce sbsa_gwdt fsl_mc_dpio nvme lm90 nvme_core at803x xhci_plat_hcd rtc_fsl_ftm_alarm xgmac_mdio ahci_platform i2c_imx ip6_tables ip_tables fuse Unloaded tainted modules: cppc_cpufreq():1 CPU: 10 PID: 761 Comm: swapon Not tainted 6.0.0-rc2-00454-g22100432cf14 #1 Hardware name: SolidRun Ltd. SolidRun CEX7 Platform, BIOS EDK II Jun 21 2022 pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : frontswap_init+0x38/0x60 lr : __do_sys_swapon+0x8a8/0x9f4 sp : ffff80000969bcf0 x29: ffff80000969bcf0 x28: ffff37bee0d8fc00 x27: ffff80000a7f5000 x26: fffffcdefb971e80 x25: ffffaba797453b90 x24: 0000000000000064 x23: ffff37c1f209d1a8 x22: ffff37bee880e000 x21: ffffaba797748560 x20: ffff37bee0d8fce4 x19: ffffaba797748488 x18: 0000000000000014 x17: 0000000030ec029a x16: ffffaba795a479b0 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000030 x12: 0000000000000001 x11: ffff37c63c0aba18 x10: 0000000000000000 x9 : ffffaba7956b8c88 x8 : ffff80000969bcd0 x7 : 0000000000000000 x6 : 0000000000000000 x5 : 0000000000000001 x4 : 0000000000000000 x3 : ffffaba79730f000 x2 : ffff37bee0d8fc00 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: frontswap_init+0x38/0x60 __do_sys_swapon+0x8a8/0x9f4 __arm64_sys_swapon+0x28/0x3c invoke_syscall+0x78/0x100 el0_svc_common.constprop.0+0xd4/0xf4 do_el0_svc+0x38/0x4c el0_svc+0x34/0x10c el0t_64_sync_handler+0x11c/0x150 el0t_64_sync+0x190/0x194 Code: d000e283 910003fd f9006c41 f946d461 (f9400021) ---[ end trace 0000000000000000 ]--- Link: https://lkml.kernel.org/r/20220909130829.3262926-1-hch@lst.de Fixes: 1da0d94a3ec8 ("frontswap: remove support for multiple ops") Reported-by: Nathan Chancellor Signed-off-by: Christoph Hellwig Signed-off-by: Liu Shixin Cc: Konrad Rzeszutek Wilk Cc: Signed-off-by: Andrew Morton --- mm/frontswap.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/frontswap.c b/mm/frontswap.c index 1a97610308cb..279e55b4ed87 100644 --- a/mm/frontswap.c +++ b/mm/frontswap.c @@ -125,6 +125,9 @@ void frontswap_init(unsigned type, unsigned long *map) * p->frontswap set to something valid to work properly. */ frontswap_map_set(sis, map); + + if (!frontswap_enabled()) + return; frontswap_ops->init(type); } -- cgit v1.2.3 From 70427f6e9ecfc8c5f977b21dd9f846b3bda02500 Mon Sep 17 00:00:00 2001 From: Sergei Antonov Date: Thu, 8 Sep 2022 23:48:09 +0300 Subject: mm: bring back update_mmu_cache() to finish_fault() Running this test program on ARMv4 a few times (sometimes just once) reproduces the bug. int main() { unsigned i; char paragon[SIZE]; void* ptr; memset(paragon, 0xAA, SIZE); ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_SHARED, -1, 0); if (ptr == MAP_FAILED) return 1; printf("ptr = %p\n", ptr); for (i=0;i<10000;i++){ memset(ptr, 0xAA, SIZE); if (memcmp(ptr, paragon, SIZE)) { printf("Unexpected bytes on iteration %u!!!\n", i); break; } } munmap(ptr, SIZE); } In the "ptr" buffer there appear runs of zero bytes which are aligned by 16 and their lengths are multiple of 16. Linux v5.11 does not have the bug, "git bisect" finds the first bad commit: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths") Before the commit update_mmu_cache() was called during a call to filemap_map_pages() as well as finish_fault(). After the commit finish_fault() lacks it. Bring back update_mmu_cache() to finish_fault() to fix the bug. Also call update_mmu_tlb() only when returning VM_FAULT_NOPAGE to more closely reproduce the code of alloc_set_pte() function that existed before the commit. On many platforms update_mmu_cache() is nop: x86, see arch/x86/include/asm/pgtable ARMv6+, see arch/arm/include/asm/tlbflush.h So, it seems, few users ran into this bug. Link: https://lkml.kernel.org/r/20220908204809.2012451-1-saproj@gmail.com Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths") Signed-off-by: Sergei Antonov Acked-by: Kirill A. Shutemov Cc: Will Deacon Cc: Signed-off-by: Andrew Morton --- mm/memory.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 4ba73f5aa8bb..a78814413ac0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4386,14 +4386,20 @@ vm_fault_t finish_fault(struct vm_fault *vmf) vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); - ret = 0; + /* Re-check under ptl */ - if (likely(!vmf_pte_changed(vmf))) + if (likely(!vmf_pte_changed(vmf))) { do_set_pte(vmf, page, vmf->address); - else + + /* no need to invalidate: a not-present page won't be cached */ + update_mmu_cache(vma, vmf->address, vmf->pte); + + ret = 0; + } else { + update_mmu_tlb(vma, vmf->address, vmf->pte); ret = VM_FAULT_NOPAGE; + } - update_mmu_tlb(vma, vmf->address, vmf->pte); pte_unmap_unlock(vmf->pte, vmf->ptl); return ret; } -- cgit v1.2.3 From dac22531bbd4af2426c4e29e05594415ccfa365d Mon Sep 17 00:00:00 2001 From: Maurizio Lombardi Date: Fri, 15 Jul 2022 14:50:13 +0200 Subject: mm: prevent page_frag_alloc() from corrupting the memory A number of drivers call page_frag_alloc() with a fragment's size > PAGE_SIZE. In low memory conditions, __page_frag_cache_refill() may fail the order 3 cache allocation and fall back to order 0; In this case, the cache will be smaller than the fragment, causing memory corruptions. Prevent this from happening by checking if the newly allocated cache is large enough for the fragment; if not, the allocation will fail and page_frag_alloc() will return NULL. Link: https://lkml.kernel.org/r/20220715125013.247085-1-mlombard@redhat.com Fixes: b63ae8ca096d ("mm/net: Rename and move page fragment handling from net/ to mm/") Signed-off-by: Maurizio Lombardi Reviewed-by: Alexander Duyck Cc: Chen Lin Cc: Jakub Kicinski Cc: Signed-off-by: Andrew Morton --- mm/page_alloc.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1678431fb4c4..d04211f0ef0b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5740,6 +5740,18 @@ refill: /* reset page count bias and offset to start of new frag */ nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1; offset = size - fragsz; + if (unlikely(offset < 0)) { + /* + * The caller is trying to allocate a fragment + * with fragsz > PAGE_SIZE but the cache isn't big + * enough to satisfy the request, this may + * happen in low memory conditions. + * We don't release the cache page because + * it could make memory pressure worse + * so we simply return NULL here. + */ + return NULL; + } } nc->pagecnt_bias--; -- cgit v1.2.3 From 317314527d173e1f139ceaf8cb87cb1746abf240 Mon Sep 17 00:00:00 2001 From: Doug Berger Date: Wed, 14 Sep 2022 12:09:17 -0700 Subject: mm/hugetlb: correct demote page offset logic With gigantic pages it may not be true that struct page structures are contiguous across the entire gigantic page. The nth_page macro is used here in place of direct pointer arithmetic to correct for this. Mike said: : This error could cause addressing exceptions. However, this is only : possible in configurations where CONFIG_SPARSEMEM && : !CONFIG_SPARSEMEM_VMEMMAP. Such a configuration option is rare and : unknown to be the default anywhere. Link: https://lkml.kernel.org/r/20220914190917.3517663-1-opendmb@gmail.com Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support") Signed-off-by: Doug Berger Reviewed-by: Mike Kravetz Reviewed-by: Oscar Salvador Reviewed-by: Anshuman Khandual Cc: Muchun Song Cc: Signed-off-by: Andrew Morton --- mm/hugetlb.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index e070b8593b37..0bdfc7e1c933 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3420,6 +3420,7 @@ static int demote_free_huge_page(struct hstate *h, struct page *page) { int i, nid = page_to_nid(page); struct hstate *target_hstate; + struct page *subpage; int rc = 0; target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order); @@ -3453,15 +3454,16 @@ static int demote_free_huge_page(struct hstate *h, struct page *page) mutex_lock(&target_hstate->resize_lock); for (i = 0; i < pages_per_huge_page(h); i += pages_per_huge_page(target_hstate)) { + subpage = nth_page(page, i); if (hstate_is_gigantic(target_hstate)) - prep_compound_gigantic_page_for_demote(page + i, + prep_compound_gigantic_page_for_demote(subpage, target_hstate->order); else - prep_compound_page(page + i, target_hstate->order); - set_page_private(page + i, 0); - set_page_refcounted(page + i); - prep_new_huge_page(target_hstate, page + i, nid); - put_page(page + i); + prep_compound_page(subpage, target_hstate->order); + set_page_private(subpage, 0); + set_page_refcounted(subpage); + prep_new_huge_page(target_hstate, subpage, nid); + put_page(subpage); } mutex_unlock(&target_hstate->resize_lock); -- cgit v1.2.3 From 77677cdbc2aa4b5d5d839562793d3d126201d18d Mon Sep 17 00:00:00 2001 From: Shuai Xue Date: Wed, 14 Sep 2022 14:49:35 +0800 Subject: mm,hwpoison: check mm when killing accessing process The GHES code calls memory_failure_queue() from IRQ context to queue work into workqueue and schedule it on the current CPU. Then the work is processed in memory_failure_work_func() by kworker and calls memory_failure(). When a page is already poisoned, commit a3f5d80ea401 ("mm,hwpoison: send SIGBUS with error virutal address") make memory_failure() call kill_accessing_process() that: - holds mmap locking of current->mm - does pagetable walk to find the error virtual address - and sends SIGBUS to the current process with error info. However, the mm of kworker is not valid, resulting in a null-pointer dereference. So check mm when killing the accessing process. [akpm@linux-foundation.org: remove unrelated whitespace alteration] Link: https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com Fixes: a3f5d80ea401 ("mm,hwpoison: send SIGBUS with error virutal address") Signed-off-by: Shuai Xue Reviewed-by: Miaohe Lin Acked-by: Naoya Horiguchi Cc: Huang Ying Cc: Baolin Wang Cc: Bixuan Cui Cc: Signed-off-by: Andrew Morton --- mm/memory-failure.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index e424a9dac749..e7ac570dda75 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -745,6 +745,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn, }; priv.tk.tsk = p; + if (!p->mm) + return -EFAULT; + mmap_read_lock(p->mm); ret = walk_page_range(p->mm, 0, TASK_SIZE, &hwp_walk_ops, (void *)&priv); -- cgit v1.2.3 From 80e2b584f3abfc31c3fe5573007f0d1d10810fde Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Tue, 13 Sep 2022 22:39:13 -0400 Subject: mm/page_isolation: fix isolate_single_pageblock() isolation behavior set_migratetype_isolate() does not allow isolating MIGRATE_CMA pageblocks unless it is used for CMA allocation. isolate_single_pageblock() did not have the same behavior when it is used together with set_migratetype_isolate() in start_isolate_page_range(). This allows alloc_contig_range() with migratetype other than MIGRATE_CMA, like MIGRATE_MOVABLE (used by alloc_contig_pages()), to isolate first and last pageblock but fail the rest. The failure leads to changing migratetype of the first and last pageblock to MIGRATE_MOVABLE from MIGRATE_CMA, corrupting the CMA region. This can happen during gigantic page allocations. Like Doug said here: https://lore.kernel.org/linux-mm/a3363a52-883b-dcd1-b77f-f2bb378d6f2d@gmail.com/T/#u, for gigantic page allocations, the user would notice no difference, since the allocation on CMA region will fail as well as it did before. But it might hurt the performance of device drivers that use CMA, since CMA region size decreases. Fix it by passing migratetype into isolate_single_pageblock(), so that set_migratetype_isolate() used by isolate_single_pageblock() will prevent the isolation happening. Link: https://lkml.kernel.org/r/20220914023913.1855924-1-zi.yan@sent.com Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity") Signed-off-by: Zi Yan Reported-by: Doug Berger Cc: David Hildenbrand Cc: Doug Berger Cc: Mike Kravetz Cc: Signed-off-by: Andrew Morton --- mm/page_isolation.c | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 9d73dc38e3d7..eb3a68ca92ad 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -288,6 +288,7 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages) * @isolate_before: isolate the pageblock before the boundary_pfn * @skip_isolation: the flag to skip the pageblock isolation in second * isolate_single_pageblock() + * @migratetype: migrate type to set in error recovery. * * Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one * pageblock. When not all pageblocks within a page are isolated at the same @@ -302,9 +303,9 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages) * the in-use page then splitting the free page. */ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, - gfp_t gfp_flags, bool isolate_before, bool skip_isolation) + gfp_t gfp_flags, bool isolate_before, bool skip_isolation, + int migratetype) { - unsigned char saved_mt; unsigned long start_pfn; unsigned long isolate_pageblock; unsigned long pfn; @@ -328,13 +329,13 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, start_pfn = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES), zone->zone_start_pfn); - saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock)); + if (skip_isolation) { + int mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock)); - if (skip_isolation) - VM_BUG_ON(!is_migrate_isolate(saved_mt)); - else { - ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags, - isolate_pageblock, isolate_pageblock + pageblock_nr_pages); + VM_BUG_ON(!is_migrate_isolate(mt)); + } else { + ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), migratetype, + flags, isolate_pageblock, isolate_pageblock + pageblock_nr_pages); if (ret) return ret; @@ -475,7 +476,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, failed: /* restore the original migratetype */ if (!skip_isolation) - unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt); + unset_migratetype_isolate(pfn_to_page(isolate_pageblock), migratetype); return -EBUSY; } @@ -537,7 +538,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, bool skip_isolation = false; /* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */ - ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false, skip_isolation); + ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false, + skip_isolation, migratetype); if (ret) return ret; @@ -545,7 +547,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, skip_isolation = true; /* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */ - ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true, skip_isolation); + ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true, + skip_isolation, migratetype); if (ret) { unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype); return ret; -- cgit v1.2.3 From 59298997df89e19aad426d4ae0a7e5037074da5a Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Mon, 19 Sep 2022 13:16:48 -0700 Subject: x86/uaccess: avoid check_object_size() in copy_from_user_nmi() The check_object_size() helper under CONFIG_HARDENED_USERCOPY is designed to skip any checks where the length is known at compile time as a reasonable heuristic to avoid "likely known-good" cases. However, it can only do this when the copy_*_user() helpers are, themselves, inline too. Using find_vmap_area() requires taking a spinlock. The check_object_size() helper can call find_vmap_area() when the destination is in vmap memory. If show_regs() is called in interrupt context, it will attempt a call to copy_from_user_nmi(), which may call check_object_size() and then find_vmap_area(). If something in normal context happens to be in the middle of calling find_vmap_area() (with the spinlock held), the interrupt handler will hang forever. The copy_from_user_nmi() call is actually being called with a fixed-size length, so check_object_size() should never have been called in the first place. Given the narrow constraints, just replace the __copy_from_user_inatomic() call with an open-coded version that calls only into the sanitizers and not check_object_size(), followed by a call to raw_copy_from_user(). [akpm@linux-foundation.org: no instrument_copy_from_user() in my tree...] Link: https://lkml.kernel.org/r/20220919201648.2250764-1-keescook@chromium.org Link: https://lore.kernel.org/all/CAOUHufaPshtKrTWOz7T7QFYUNVGFm0JBjvM700Nhf9qEL9b3EQ@mail.gmail.com Fixes: 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns") Signed-off-by: Kees Cook Reported-by: Yu Zhao Reported-by: Florian Lehner Suggested-by: Andrew Morton Acked-by: Peter Zijlstra (Intel) Tested-by: Florian Lehner Cc: Matthew Wilcox Cc: Josh Poimboeuf Cc: Dave Hansen Cc: Signed-off-by: Andrew Morton --- arch/x86/lib/usercopy.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/lib/usercopy.c b/arch/x86/lib/usercopy.c index ad0139d25401..f1bb18617156 100644 --- a/arch/x86/lib/usercopy.c +++ b/arch/x86/lib/usercopy.c @@ -44,7 +44,7 @@ copy_from_user_nmi(void *to, const void __user *from, unsigned long n) * called from other contexts. */ pagefault_disable(); - ret = __copy_from_user_inatomic(to, from, n); + ret = raw_copy_from_user(to, from, n); pagefault_enable(); return ret; -- cgit v1.2.3