Age | Commit message (Collapse) | Author |
|
Make sure we return a standard error code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
For now we only have one key type in these btrees, but forward
compatibility means we do have to check.
Reported-by: syzbot+b4cb4a6988aced0cec4b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes exceeding the bump allocator limit when the allocator finds
many buckets that need repair - they're repaired asynchronously, which
means that every error logged a message in the bump allocator, without
committing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Don't change buf->size on error - this would usually be a transaction
restart, but it could also be -ENOMEM - when we've exceeded the bump
allocator max).
Fixes: 247abee6ae6d ("bcachefs: btree_trans_subbuf")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The normal fsck code doesn't call key_visible_in_snapshot() with an
empty list of snapshot IDs seen (the current snapshot ID will always be
on the list), but str_hash_repair_key() ->
bch2_get_snapshot_overwrites() can, and that's totally fine as long as
we check for it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
check_directory_structure runs after check_dirents, so it expects that
it won't see any inodes with missing backpointers - normally.
But online fsck can't run check_dirents yet, or the user might only be
running a specific pass, so we need to be careful that this isn't an
error. If an inode is unreachable, that's handled by a separate pass.
Also, add a new 'bch2_inode_has_backpointer()' helper, since we were
doing this inconsistently.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
btree node scrub was sometimes failing to rewrite nodes with errors;
bch2_btree_node_rewrite() can return a transaction restart and we
weren't checking - the lockrestart_do() needs to wrap the entire
operation.
And there's a better helper it should've been using,
bch2_btree_node_rewrite_key(), which makes all this more convenient.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We can only pass negative error codes to bch2_err_str(); if it's a
positive integer it's not an error and we trip an assert.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
A path exists in a particular snapshot: we should do the pathwalk in the
snapshot ID of the inode we started from, _not_ change snapshot ID as we
walk inodes and dirents.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We can easily go from inode number -> path now, which makes for more
useful log messages.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Log the inode's new path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When we find a directory connectivity problem, we should do the repair
in the oldest snapshot that has the issue - so that we don't end up
duplicating work or making a real mess of things.
Oldest snapshot IDs have the highest integer value, so - just walk
inodes in reverse order.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
bch_subvolume.fs_path_parent needs to be updated as well, it should
match inode.bi_parent_subvol.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The dirent will be in a different snapshot if the inode is a subvolume
root.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The bch2_subvolume_get_snapshot() call needs to happen before the dirent
lookup - the dirent is in the parent subvolume.
Also, check for loops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
kthread creation checks for pending signals, which is _very_ annoying if
we have to do a long recovery and don't go rw until we've done
significant work.
Check if we'll be going rw and pre-allocate kthreads/workqueues.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Print out more info when we find a key (extent, dirent, xattr) for a
missing inode - was there a good inode in an older snapshot, full(ish)
list of keys for that missing inode, so we can make better decisions on
how to repair.
If it looks like it should've been deleted, autofix it. If we ever hit
the non-autofix cases, we'll want to write more repair code (possibly
reconstituting the inode).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
After cd3cdb1ef706 ("Single err message for btree node reads"),
all errors caused __btree_err to return -BCH_ERR_fsck_fix no matter what
the actual error type was if the recovery pass was scanning for btree
nodes. This lead to the code continuing despite things like bad node
formats when they earlier would have caused a jump to fsck_err, because
btree_err only jumps when the return from __btree_err does not match
fsck_fix. Ultimately this lead to undefined behavior by attempting to
unpack a key based on an invalid format.
Make only errors of type -BCH_ERR_btree_node_read_err_fixable cause
__btree_err to return -BCH_ERR_fsck_fix when scanning for btree nodes.
Reported-by: syzbot+cfd994b9cdf00446fd54@syzkaller.appspotmail.com
Fixes: cd3cdb1ef706 ("bcachefs: Single err message for btree node reads")
Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
btree_interior_update_pool has not been initialized before the
filesystem becomes read-write, thus mempool_alloc in bch2_btree_update_start
will trigger pool->alloc NULL pointer dereference in mempool_alloc_noprof
Reported-by: syzbot+2f3859bd28f20fa682e6@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
In syzbot's crash, the bset's u64s is larger than the btree node.
Reported-by: syzbot+bfaeaa8e26281970158d@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Dead code cleanup.
Link: https://lore.kernel.org/linux-bcachefs/20250612224059.39fddd07@batman.local.home/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a mount option for rewinding the journal, bringing the entire
filesystem to where it was at a previous point in time.
This is for extreme disaster recovery scenarios - it's not intended as
an undelete operation.
The option takes a journal sequence number; the desired sequence number
can be determined with 'bcachefs list_journal'
Caveats:
- The 'journal_transaction_names' option must have been enabled (it's on
by default). The option controls emitting of extra debug info in the
journal, so we can see what individual transactions were doing;
It also enables journalling of keys being overwritten, which is what
we rely on here.
- A full fsck run will be automatically triggered since alloc info will
be inconsistent. Only leaf node updates to non-alloc btrees are
rewound, since rewinding interior btree updates isn't possible or
desirable.
- We can't do anything about data that was deleted and overwritten.
Lots of metadata updates after the point in time we're rewinding to
shouldn't cause a problem, since we segragate data and metadata
allocations (this is in order to make repair by btree node scan
practical on larger filesystems; there's a small 64-bit per device
bitmap in the superblock of device ranges with btree nodes, and we try
to keep this small).
However, having discards enabled will cause problems, since buckets
are discarded as soon as they become empty (this is why we don't
implement fstrim: we don't need it).
Hopefully, this feature will be a one-off thing that's never used
again: this was implemented for recovering from the "vfs i_nlink 0 ->
subvol deletion" bug, and that bug was unusually disastrous and
additional safeguards have since been implemented.
But if it does turn out that we need this more in the future, I'll
have to implement an option so that empty buckets aren't discarded
immediately - lagging by perhaps 1% of device capacity.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fix the case where we're deleting in a different snapshot and need to
emit a whiteout - that requires a regular BTREE_ITER_filter_snapshots
iterator.
Also, only delete the part of the extent that extents past i_size.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
the inode btree uses the offset field for the inum, not the inode field.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When the inode was a whiteout, we were inserting a new whiteout at the
wrong (old) snapshot.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Check against version_incompat_allowed, not version_incompat.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Prep work for journal rewind, where the seq we're replaying from may be
different than the last journal entry's last_seq.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Previously, we weren't checking the result of the skiplist walk, just
the is_ancestor bitmap.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We need to start searching from search_key - _not_ path->pos, which will
point to the key we found in the btree
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
this code is rarely invoked, so - we had a few bugs left from basing it
off of bch2_journal_keys_peek_max()...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When there is commit error that need split btree leaf, fsck might change
the value of trans->journal_entries.u64s, when retry commit, the value of
trans->journal_u64s would be incorrect, which will lead to trans->journal_res.u64s
underflow, and then out of bounds write will occur:
[ 464.496970][T11969] Call trace:
[ 464.496973][T11969] show_stack+0x3c/0x88 (C)
[ 464.496995][T11969] dump_stack_lvl+0xf8/0x178
[ 464.497014][T11969] dump_stack+0x20/0x30
[ 464.497031][T11969] __bch2_trans_log_str+0x344/0x350
[ 464.497048][T11969] bch2_trans_log_str+0x3c/0x60
[ 464.497065][T11969] __bch2_fsck_err+0x11bc/0x1390
[ 464.497083][T11969] bch2_check_discard_freespace_key+0xad4/0x10d0
[ 464.497100][T11969] bch2_bucket_alloc_freelist+0x99c/0x1130
[ 464.497117][T11969] bch2_bucket_alloc_trans+0x79c/0xcb8
[ 464.497133][T11969] bch2_bucket_alloc_set_trans+0x378/0xc20
[ 464.497151][T11969] __open_bucket_add_buckets+0x7fc/0x1c00
[ 464.497168][T11969] open_bucket_add_buckets+0x184/0x3a8
[ 464.497185][T11969] bch2_alloc_sectors_start_trans+0xa04/0x1da0
[ 464.497203][T11969] bch2_btree_reserve_get+0x6e0/0xef0
[ 464.497220][T11969] bch2_btree_update_start+0x1618/0x2600
[ 464.497239][T11969] bch2_btree_split_leaf+0xcc/0x730
[ 464.497258][T11969] bch2_trans_commit_error+0x22c/0xc30
[ 464.497276][T11969] __bch2_trans_commit+0x207c/0x4e30
[ 464.497292][T11969] bch2_journal_replay+0x9e0/0x1420
[ 464.497305][T11969] __bch2_run_recovery_passes+0x458/0xf98
[ 464.497318][T11969] bch2_run_recovery_passes+0x280/0x478
[ 464.497331][T11969] bch2_fs_recovery+0x24f0/0x3a28
[ 464.497344][T11969] bch2_fs_start+0xb80/0x1248
[ 464.497358][T11969] bch2_fs_get_tree+0xe94/0x1708
[ 464.497377][T11969] vfs_get_tree+0x84/0x2d0
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Just like the EBUG_ON in bch2_journal_add_entry().
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Now the alloc_req is allocated from the bump allocator, if there is
reallocation, the memory of alloc_req would be frees, fix by delaying the
reallocation to transaction restart, it has to restart anyway.
Reported-by: syzbot+2887a13a5c387e616a68@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Allocating new memory when mempool is exhausted is too complicated, just
return ENOMEM is fine. memcpy is not needed, since there might be
pointers point to the old memory, that's the bug.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We've been seeing some livelock-ish behavior in the index update part of
the main write path, and while we've got low level btree path
tracepoints, we've been lacking high level btree iterator tracepoints.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a tracepoint for when we insert only part of an extent, due to too
many overwrites.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Move warnings about linux/export.h from W=1 to W=2
- Fix structure type overrides in gendwarfksyms
* tag 'kbuild-fixes-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
gendwarfksyms: Fix structure type overrides
kbuild: move warnings about linux/export.h from W=1 to W=2
|
|
As we always iterate through the entire die_map when expanding
type strings, recursively processing referenced types in
type_expand_child() is not actually necessary. Furthermore,
the type_string kABI rule added in commit c9083467f7b9
("gendwarfksyms: Add a kABI rule to override type strings") can
fail to override type strings for structures due to a missing
kabi_get_type_string() check in this function.
Fix the issue by dropping the unnecessary recursion and moving
the override check to type_expand(). Note that symbol versions
are otherwise unchanged with this patch.
Fixes: c9083467f7b9 ("gendwarfksyms: Add a kABI rule to override type strings")
Reported-by: Giuliano Procida <gprocida@google.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
This hides excessive warnings, as nobody builds with W=2.
Fixes: a934a57a42f6 ("scripts/misc-check: check missing #include <linux/export.h> when W=1")
Fixes: 7d95680d64ac ("scripts/misc-check: check unnecessary #include <linux/export.h> when W=1")
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Pull smb client fixes from Steve French:
- SMB3.1.1 POSIX extensions fix for char remapping
- Fix for repeated directory listings when directory leases enabled
- deferred close handle reuse fix
* tag 'v6.16-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: improve directory cache reuse for readdir operations
smb: client: fix perf regression with deferred closes
smb: client: disable path remapping with POSIX extensions
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu fix from Joerg Roedel:
- Fix PTE size calculation for NVidia Tegra
* tag 'iommu-fixes-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
iommu/tegra: Fix incorrect size calculation
|
|
Pull block fixes from Jens Axboe:
- Fix for a deadlock on queue freeze with zoned writes
- Fix for zoned append emulation
- Two bio folio fixes, for sparsemem and for very large folios
- Fix for a performance regression introduced in 6.13 when plug
insertion was changed
- Fix for NVMe passthrough handling for polled IO
- Document the ublk auto registration feature
- loop lockdep warning fix
* tag 'block-6.16-20250614' of git://git.kernel.dk/linux:
nvme: always punt polled uring_cmd end_io work to task_work
Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists
block: Fix bvec_set_folio() for very large folios
bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP
block: use plug request list tail for one-shot backmerge attempt
block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work
block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion
ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
loop: move lo_set_size() out of queue freeze
|
|
Pull io_uring fixes from Jens Axboe:
- Fix for a race between SQPOLL exit and fdinfo reading.
It's slim and I was only able to reproduce this with an artificial
delay in the kernel. Followup sparse fix as well to unify the access
to ->thread.
- Fix for multiple buffer peeking, avoiding truncation if possible.
- Run local task_work for IOPOLL reaping when the ring is exiting.
This currently isn't done due to an assumption that polled IO will
never need task_work, but a fix on the block side is going to change
that.
* tag 'io_uring-6.16-20250614' of git://git.kernel.dk/linux:
io_uring: run local task_work from ring exit IOPOLL reaping
io_uring/kbuf: don't truncate end buffer for multiple buffer peeks
io_uring: consistently use rcu semantics with sqpoll thread
io_uring: fix use-after-free of sq->thread in __io_uring_show_fdinfo()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux
Pull Rust fix from Miguel Ojeda:
- 'hrtimer': fix future compile error when the 'impl_has_hr_timer!'
macro starts to get called
* tag 'rust-fixes-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
rust: time: Fix compile error in impl_has_hr_timer macro
|