Age | Commit message (Collapse) | Author |
|
Hook alloc_workqueue and alloc_workqueue_attrs() so that they're
accounted to the callsite. Since we're doing allocations on behalf of
another subsystem, this helps when using memory allocation profiling to
check for leaks.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Case folding is often applied to subtrees and not on an entire
filesystem.
Disallowing layers from filesystems that support case folding is over
limiting.
Replace the rule that case-folding capable are not allowed as layers
with a rule that case folded directories are not allowed in a merged
directory stack.
Should case folding be enabled on an underlying directory while
overlayfs is mounted the outcome is generally undefined.
Specifically in ovl_lookup(), we check the base underlying directory
and fail with -ESTALE and write a warning to kmsg if an underlying
directory case folding is enabled.
Suggested-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/linux-fsdevel/20250520051600.1903319-1-kent.overstreet@linux.dev/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
REQ_FUA means "skip the drive cache", and it can be used with reads to.
If there was a checksum error, we want to retry the whole read path, not
read it from cache again.
Suggested-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a sysfs attribute for checking whether read fua appears to behave
properly on a device.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
FUA is also allowed with reads, not just writes.
The specified behaviour is:
- If the location being read from in the drive cache is dirty, it's
flushed
- Read is serviced from media, not cache
It's documented in the NVME specification, and the nvme driver already
passes through REQ_FUA for reads, not just writes, so there's no reason
for the block layer to be disallowing it.
To validate behaviour, a simple test was run on a variety of hardware
that checks latency of repeated reads to the same location (cached
reads), random reads (uncached), and FUA reads to the same location.
Data:
- Samsung consumer SSDs
Reads appear to not be cached
- Seagate SCSI hard drives (ST20000NM002D)
Reads are cached, and FUA reads appear to work correctly
Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/
Link: https://lore.kernel.org/linux-bcachefs/26585.34711.506258.318405@quad.stoffel.home/T/#m5fffbc0e1c68cf0479c94b9f4ac1bef297333782
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This adds shrinker.to_text() methods for our shrinkers and hooks them up
to our existing to_text() functions.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Previously, we added shrinker_to_text() and hooked it up to the OOM
report - now, the same report is available via debugfs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This patch:
- Changes show_mem() to always report on slab usage
- Instead of reporting on all slabs, we only report on top 10 slabs,
and in sorted order
- Also reports on shrinkers, with the new shrinkers_to_text().
Shrinkers need to be included in OOM/allocation failure reporting
because they're responsible for memory reclaim - if a shrinker isn't
giving up its memory, we need to know which one and why.
More OOM reporting can be moved to show_mem.c and improved, this patch
is only a start.
New example output on OOM/memory allocation failure:
00177 Mem-Info:
00177 active_anon:13706 inactive_anon:32266 isolated_anon:16
00177 active_file:1653 inactive_file:1822 isolated_file:0
00177 unevictable:0 dirty:0 writeback:0
00177 slab_reclaimable:6242 slab_unreclaimable:11168
00177 mapped:3824 shmem:3 pagetables:1266 bounce:0
00177 kernel_misc_reclaimable:0
00177 free:4362 free_pcp:35 free_cma:0
00177 Node 0 active_anon:54824kB inactive_anon:129064kB active_file:6612kB inactive_file:7288kB unevictable:0kB isolated(anon):64kB isolated(file):0kB mapped:15296kB dirty:0kB writeback:0kB shmem:12kB writeback_tmp:0kB kernel_stack:3392kB pagetables:5064kB all_unreclaimable? no
00177 DMA free:2232kB boost:0kB min:88kB low:108kB high:128kB reserved_highatomic:0KB active_anon:2924kB inactive_anon:6596kB active_file:428kB inactive_file:384kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
00177 lowmem_reserve[]: 0 426 426 426
00177 DMA32 free:15092kB boost:5836kB min:8432kB low:9080kB high:9728kB reserved_highatomic:0KB active_anon:52196kB inactive_anon:122392kB active_file:6176kB inactive_file:7068kB unevictable:0kB writepending:0kB present:507760kB managed:441816kB mlocked:0kB bounce:0kB free_pcp:72kB local_pcp:0kB free_cma:0kB
00177 lowmem_reserve[]: 0 0 0 0
00177 DMA: 284*4kB (UM) 53*8kB (UM) 21*16kB (U) 11*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2248kB
00177 DMA32: 2765*4kB (UME) 375*8kB (UME) 57*16kB (UM) 5*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15132kB
00177 4656 total pagecache pages
00177 1031 pages in swap cache
00177 Swap cache stats: add 6572399, delete 6572173, find 488603/3286476
00177 Free swap = 509112kB
00177 Total swap = 2097148kB
00177 130938 pages RAM
00177 0 pages HighMem/MovableOnly
00177 16644 pages reserved
00177 Unreclaimable slab info:
00177 9p-fcall-cache total: 8.25 MiB active: 8.25 MiB
00177 kernfs_node_cache total: 2.15 MiB active: 2.15 MiB
00177 kmalloc-64 total: 2.08 MiB active: 2.07 MiB
00177 task_struct total: 1.95 MiB active: 1.95 MiB
00177 kmalloc-4k total: 1.50 MiB active: 1.50 MiB
00177 signal_cache total: 1.34 MiB active: 1.34 MiB
00177 kmalloc-2k total: 1.16 MiB active: 1.16 MiB
00177 bch_inode_info total: 1.02 MiB active: 922 KiB
00177 perf_event total: 1.02 MiB active: 1.02 MiB
00177 biovec-max total: 992 KiB active: 960 KiB
00177 Shrinkers:
00177 super_cache_scan: objects: 127
00177 super_cache_scan: objects: 106
00177 jbd2_journal_shrink_scan: objects: 32
00177 ext4_es_scan: objects: 32
00177 bch2_btree_cache_scan: objects: 8
00177 nr nodes: 24
00177 nr dirty: 0
00177 cannibalize lock: 0000000000000000
00177
00177 super_cache_scan: objects: 8
00177 super_cache_scan: objects: 1
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: linux-mm@kvack.org
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a few new shrinker stats.
number of objects requested to free, number of objects freed:
Shrinkers won't necessarily free all objects requested for a variety of
reasons, but if the two counts are wildly different something is likely
amiss.
.scan_objects runtime:
If one shrinker is taking an excessive amount of time to free
objects that will block kswapd from running other shrinkers.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: linux-mm@kvack.org
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This adds a new callback method to shrinkers which they can use to
describe anything relevant to memory reclaim about their internal state,
for example object dirtyness.
This patch also adds shrinkers_to_text(), which reports on the top 10
shrinkers - by object count - in sorted order, to be used in OOM
reporting.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: linux-mm@kvack.org
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
From david@fromorbit.com Tue Aug 27 23:32:26 2024
> > + if (!mutex_trylock(&shrinker_mutex)) {
> > + seq_buf_puts(out, "(couldn't take shrinker lock)");
> > + return;
> > + }
>
> Please don't use the shrinker_mutex like this. There can be tens of
> thousands of entries in the shrinker list (because memcgs) and
> holding the shrinker_mutex for long running traversals like this is
> known to cause latency problems for memcg reaping. If we are at
> ENOMEM, the last thing we want to be doing is preventing memcgs from
> being reaped.
>
> > + list_for_each_entry(shrinker, &shrinker_list, list) {
> > + struct shrink_control sc = { .gfp_mask = GFP_KERNEL, };
>
> This iteration and counting setup is neither node or memcg aware.
> For node aware shrinkers, this will only count the items freeable
> on node 0, and ignore all the other memory in the system. For memcg
> systems, it will also only scan the root memcg and so miss counting
> any memory in memcg owned caches.
>
> IOWs, the shrinker iteration mechanism needs to iterate both by NUMA
> node and by memcg. On large machines with multiple nodes and hosting
> thousands of memcgs, a total shrinker state iteration is has to walk
> a -lot- of structures.
>
> And example of this is drop_slab() - called from
> /proc/sys/vm/drop_caches(). It does this to iterate all the
> shrinkers for all the nodes and memcgs in the system:
>
> static unsigned long drop_slab_node(int nid)
> {
> unsigned long freed = 0;
> struct mem_cgroup *memcg = NULL;
>
> memcg = mem_cgroup_iter(NULL, NULL, NULL);
> do {
> freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
> } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
>
> return freed;
> }
>
> void drop_slab(void)
> {
> int nid;
> int shift = 0;
> unsigned long freed;
>
> do {
> freed = 0;
> for_each_online_node(nid) {
> if (fatal_signal_pending(current))
> return;
>
> freed += drop_slab_node(nid);
> }
> } while ((freed >> shift++) > 1);
> }
>
> Hence any iteration for finding the 10 largest shrinkable caches in
> the system needs to do something similar. Only, it needs to iterate
> memcgs first and then aggregate object counts across all nodes for
> shrinkers that are NUMA aware.
>
> Because it needs direct access to the shrinkers, it will need to use
> the RCU lock + refcount method of traversal because that's the only
> safe way to go from memcg to shrinker instance. IOWs, it
> needs to mirror the code in shrink_slab/shrink_slab_memcg to obtain
> a safe reference to the relevant shrinker so it can call
> ->count_objects() and store a refcounted pointer to the shrinker(s)
> that will get printed out after the scan is done....
>
> Once the shrinker iteration is sorted out, I'll look further at the
> rest of the code in this patch...
>
> -Dave.
|
|
This adds a seq_buf wrapper for string_get_size().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
For some reason arm64's Poly1305 code got changed to ignore the padbit
argument. As a result, the output is incorrect when the message length
is not a multiple of 16 (which is not reached with the standard
ChaCha20Poly1305, but bcachefs could reach this). Fix this.
Fixes: a59e5468a921 ("crypto: arm64/poly1305 - Add block-only interface")
Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fix the error_reads mode - it's incompatible with corrupt_bio_byte, but
that's only enabled if corrupt_bio_byte is nonzero.
Cc: Benjamin Marzinski <bmarzins@redhat.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: dm-devel@lists.linux.dev
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
On filesystems that predate persistent cursors, inode_generation keys
will be present (no longer needed because the cursor tracks the current
generation).
But the new inode allocation code skipped allocating slots with
inode_generation keys, which led to a lot of unnecessary scanning, which
this patch fixes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Use bio_add_folio_nofail() to replace the unfailable bio_add_folio()
operation.
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
For the part of directly mapping the kernel virtual address, there is no
need to increase to bio page-by-page. It can be directly replaced by
bio_add_virt_nofail().
For the address part of the vmalloc region, its physical address is
discontinuous and needs to be increased page-by-page to bio. The helper
function bio_add_vmalloc() can be used to simplify the implementation of
bch2_bio_map().
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202506261026.ZxtJ7yeV-lkp@intel.com/
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fix a performance bug when doing many unlinks.
The btree has optimizations to ensure we don't have too many whiteouts
to scan in peek() before we find a real key to return, but unflushed key
cache deletions break this.
To fix this, tweak the existing code for redirecting updates that create
a key to the underlying btree so that we can use it for deletions as
well.
Reported-by: John Schoenick <johns@valvesoftware.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We require that if a key exists in the key cache it also be present in
the underlying btree, for cache coherency reasons.
So checking the key cache on whiteout is unnecessary. This is part of
fixing a major performance bug when doing many unlinks all in a row -
we end up scanning through a ton of key cache whiteouts before peek()
can return a real key.
Reported-by: John Schoenick <johns@valvesoftware.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Don't delete dirents or extents if it's the wrong type of inode.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
new helper
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
buf->u64s is all we used.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a new fsck flag, FSCK_ERR_SILENT, to suppress logging the error in
dmesg.
Use this for allocator async repair.
Also, make sure that we _do_ still log silent error correction in the
journal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
the 'ip' parameter to bch2_trans_kmalloc() is used for bump allocator
tracing: when we exceed the bump allocator limit, it dumps a list of
allocations and what function did them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Convert standard errcodes to private error codes, and return them with
bch_err_throw(), for better debugging.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Kill some temporaries on the stack that were completely unnecessary.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add an option for completely disabling casefolding on a filesystem, as a
workaround for overlayfs.
This should only be needed as a temporary workaround, until the
overlayfs fix arrives.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Reported-by: syzbot+cc7567f096079cb4146f@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Checking for invalid IDs was introduced in 9e7cfb35e266 ("bcachefs: Check for invalid btree IDs")
to prevent an invalid shift later, but since 141526548052 ("bcachefs: Bad btree roots are now autofix")
which made btree_root_bkey_invalid autofix, the fsck_err_on call didn't
do anything.
We can mark this err type (invalid_btree_id) autofix as well, so it gets
handled.
Reported-by: syzbot+029d1989099aa5ae3e89@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=029d1989099aa5ae3e89
Fixes: 141526548052 ("bcachefs: Bad btree roots are now autofix")
Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fix a 6.16 regression from the recovery pass rework, which introduced a
bug where calling bch2_run_explicit_recovery_pass() would only return
the error code to rewind recovery for the first call that scheduled that
recovery pass.
If the error code from the first call was swallowed (because it was
called by an asynchronous codepath), subsequent calls would go "ok, this
pass is already marked as needing to run" and return 0.
Fixing this ensures that check_topology bails out to run btree_node_scan
before doing any repair.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Previously, calling bch2_btree_has_scanned_nodes() when btree node
scan hadn't actually run would erroniously return false - causing us to
think a btree was entirely gone.
This fixes a 6.16 regression from moving the scheduling of btree node
scan out of bch2_btree_lost_data() (fixing the bug where we'd schedule
it persistently in the superblock) and only scheduling it when
check_toploogy() is asking for scanned btree nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Autofix is specified in btree_gc.c if it's not an important btree.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
wait_on_allocator() emits debug info when we hang trying to allocate.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Reported-by: syzbot+d540192e763531d307ff@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Before calling bch2_indirect_extent_missing_error(), we have to
calculate the missing range, which is the intersection of the reflink
pointer and the non-indirect-extent we found.
The calculation didn't take into account that the returned extent may
span the iter position, leading to an infinite loop when we
(unnecessarily) resized the extent we were returning to one that didn't
extend past the offset we were looking up.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|