summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-08-06xfs: grab active perag ref when reading AG headersDarrick J. Wong
This patch prepares scrub to deal with the possibility of tearing down entire AGs by changing the order of resource acquisition to match the rest of the XFS codebase. In other words, scrub now grabs AG resources in order of: perag structure, then AGI/AGF/AGFL buffers, then btree cursors; and releases them in reverse order. This requires us to distinguish xchk_ag_init callers -- some are responding to a user request to check AG metadata, in which case we can return ENOENT to userspace; but other callers have an ondisk reference to an AG that they're trying to cross-reference. In this second case, the lack of an AG means there's ondisk corruption, since ondisk metadata cannot point into nonexistent space. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-08-06xfs: automatic resource cleanup of for_each_perag*Darrick J. Wong
Jan Kara pointed me to a magic gcc attribute that schedules a finalizer function to run against any automatic variable that's going out of scope. Provided nobody minds compiling XFS with gnu99 instead of c89, we can use this capability to define a loop-scope iterator variable that will trigger the automatic finalizer, which means that we don't have to burden for_each_perag* callers with the necessity of manually putting the perag structures when exiting the loop body. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-08-06xfs: drop experimental warnings for bigtime and inobtcountDarrick J. Wong
These two features were merged a year ago, userspace tooling have been merged, and no serious errors have been reported by the developers. Drop the experimental tag to encourage wider testing. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2021-08-06xfs: fix silly whitespace problems with kernel libxfsDarrick J. Wong
Fix a few whitespace errors such as spaces at the end of the line, etc. This gets us back to something more closely resembling parity. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-08-06xfs: throttle inode inactivation queuing on memory reclaimdeferred-inactivation-5.15_2021-08-06Darrick J. Wong
Now that we defer inode inactivation, we've decoupled the process of unlinking or closing an inode from the process of inactivating it. In theory this should lead to better throughput since we now inactivate the queued inodes in batches instead of one at a time. Unfortunately, one of the primary risks with this decoupling is the loss of rate control feedback between the frontend and background threads. In other words, a rm -rf /* thread can run the system out of memory if it can queue inodes for inactivation and jump to a new CPU faster than the background threads can actually clear the deferred work. The workers can get scheduled off the CPU if they have to do IO, etc. To solve this problem, we configure a shrinker so that it will activate the /second/ time the shrinkers are called. The custom shrinker will queue all percpu deferred inactivation workers immediately and set a flag to force frontend callers who are releasing a vfs inode to wait for the inactivation workers. On my test VM with 560M of RAM and a 2TB filesystem, this seems to solve most of the OOMing problem when deleting 10 million inodes. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-06xfs: avoid buffer deadlocks when walking fs inodesDarrick J. Wong
When we're servicing an INUMBERS or BULKSTAT request or running quotacheck, grab an empty transaction so that we can use its inherent recursive buffer locking abilities to detect inode btree cycles without hitting ABBA buffer deadlocks. This patch requires the deferred inode inactivation patchset because xfs_irele cannot directly call xfs_inactive when the iwalk itself has an (empty) transaction. Found by fuzzing an inode btree pointer to introduce a cycle into the tree (xfs/365). Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-08-06xfs: use background worker pool when transactions can't get free spaceDarrick J. Wong
In xfs_trans_alloc, if the block reservation call returns ENOSPC, we call xfs_blockgc_free_space with a NULL icwalk structure to try to free space. Each frontend thread that encounters this situation starts its own walk of the inode cache to see if it can find anything, which is wasteful since we don't have any additional selection criteria. For this one common case, create a function that reschedules all pending background work immediately and flushes the workqueue so that the scan can run in parallel. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-06xfs: don't run speculative preallocation gc when fs is frozenDarrick J. Wong
Now that we have the infrastructure to switch background workers on and off at will, fix the block gc worker code so that we don't actually run the worker when the filesystem is frozen, same as we do for deferred inactivation. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-06xfs: flush inode inactivation work when compiling usage statisticsDarrick J. Wong
Users have come to expect that the space accounting information in statfs and getquota reports are fairly accurate. Now that we inactivate inodes from a background queue, these numbers can be thrown off by whatever resources are singly-owned by the inodes in the queue. Flush the pending inactivations when userspace asks for a space usage report. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-06xfs: inactivate inodes any time we try to free speculative preallocationsDarrick J. Wong
Other parts of XFS have learned to call xfs_blockgc_free_{space,quota} to try to free speculative preallocations when space is tight. This means that file writes, transaction reservation failures, quota limit enforcement, and the EOFBLOCKS ioctl all call this function to free space when things are tight. Since inode inactivation is now a background task, this means that the filesystem can be hanging on to unlinked but not yet freed space. Add this to the list of things that xfs_blockgc_free_* makes writer threads scan for when they cannot reserve space. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-06xfs: queue inactivation immediately when free realtime extents are tightDarrick J. Wong
Now that we have made the inactivation of unlinked inodes a background task to increase the throughput of file deletions, we need to be a little more careful about how long of a delay we can tolerate. Similar to the patch doing this for free space on the data device, if the file being inactivated is a realtime file and the realtime volume is running low on free extents, we want to run the worker ASAP so that the realtime allocator can make better decisions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-06xfs: queue inactivation immediately when quota is nearing enforcementDarrick J. Wong
Now that we have made the inactivation of unlinked inodes a background task to increase the throughput of file deletions, we need to be a little more careful about how long of a delay we can tolerate. Specifically, if the dquots attached to the inode being inactivated are nearing any kind of enforcement boundary, we want to queue that inactivation work immediately so that users don't get EDQUOT/ENOSPC errors even after they deleted a bunch of files to stay within quota. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-06xfs: queue inactivation immediately when free space is tightDarrick J. Wong
Now that we have made the inactivation of unlinked inodes a background task to increase the throughput of file deletions, we need to be a little more careful about how long of a delay we can tolerate. On a mostly empty filesystem, the risk of the allocator making poor decisions due to fragmentation of the free space on account a lengthy delay in background updates is minimal because there's plenty of space. However, if free space is tight, we want to deallocate unlinked inodes as quickly as possible to avoid fallocate ENOSPC and to give the allocator the best shot at optimal allocations for new writes. Therefore, queue the percpu worker immediately if the filesystem is more than 95% full. This follows the same principle that XFS becomes less aggressive about speculative allocations and lazy cleanup (and more precise about accounting) when nearing full. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-06xfs: per-cpu deferred inode inactivation queuesDave Chinner
Move inode inactivation to background work contexts so that it no longer runs in the context that releases the final reference to an inode. This will allow process work that ends up blocking on inactivation to continue doing work while the filesytem processes the inactivation in the background. A typical demonstration of this is unlinking an inode with lots of extents. The extents are removed during inactivation, so this blocks the process that unlinked the inode from the directory structure. By moving the inactivation to the background process, the userspace applicaiton can keep working (e.g. unlinking the next inode in the directory) while the inactivation work on the previous inode is done by a different CPU. The implementation of the queue is relatively simple. We use a per-cpu lockless linked list (llist) to queue inodes for inactivation without requiring serialisation mechanisms, and a work item to allow the queue to be processed by a CPU bound worker thread. We also keep a count of the queue depth so that we can trigger work after a number of deferred inactivations have been queued. The use of a bound workqueue with a single work depth allows the workqueue to run one work item per CPU. We queue the work item on the CPU we are currently running on, and so this essentially gives us affine per-cpu worker threads for the per-cpu queues. THis maintains the effective CPU affinity that occurs within XFS at the AG level due to all objects in a directory being local to an AG. Hence inactivation work tends to run on the same CPU that last accessed all the objects that inactivation accesses and this maintains hot CPU caches for unlink workloads. A depth of 32 inodes was chosen to match the number of inodes in an inode cluster buffer. This hopefully allows sequential allocation/unlink behaviours to defering inactivation of all the inodes in a single cluster buffer at a time, further helping maintain hot CPU and buffer cache accesses while running inactivations. A hard per-cpu queue throttle of 256 inode has been set to avoid runaway queuing when inodes that take a long to time inactivate are being processed. For example, when unlinking inodes with large numbers of extents that can take a lot of processing to free. Signed-off-by: Dave Chinner <dchinner@redhat.com> [djwong: tweak comments and tracepoints, convert opflags to state bits] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-06xfs: detach dquots from inode if we don't need to inactivate itDarrick J. Wong
If we don't need to inactivate an inode, we can detach the dquots and move on to reclamation. This isn't strictly required here; it's a preparation patch for deferred inactivation per reviewer request[1] to move the creation of xfs_inode_needs_inactivation into a separate change. Eventually this !need_inactive chunk will turn into the code path for inodes that skip xfs_inactive and go straight to memory reclaim. [1] https://lore.kernel.org/linux-xfs/20210609012838.GW2945738@locust/T/#mca6d958521cb88bbc1bfe1a30767203328d410b5 Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-06xfs: move xfs_inactive call to xfs_inode_mark_reclaimableDarrick J. Wong
Move the xfs_inactive call and all the other debugging checks and stats updates into xfs_inode_mark_reclaimable because most of that are implementation details about the inode cache. This is preparation for deferred inactivation that is coming up. We also move it around xfs_icache.c in preparation for deferred inactivation. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-06xfs: introduce all-mounts list for cpu hotplug notificationsDave Chinner
The inode inactivation and CIL tracking percpu structures are per-xfs_mount structures. That means when we get a CPU dead notification, we need to then iterate all the per-cpu structure instances to process them. Rather than keeping linked lists of per-cpu structures in each subsystem, add a list of all xfs_mounts that the generic xfs_cpu_dead() function will iterate and call into each subsystem appropriately. This allows us to handle both per-mount and global XFS percpu state from xfs_cpu_dead(), and avoids the need to link subsystem structures that can be easily found from the xfs_mount into their own global lists. Signed-off-by: Dave Chinner <dchinner@redhat.com> [djwong: expand some comments about mount list setup ordering rules] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-06xfs: introduce CPU hotplug infrastructureDave Chinner
We need to move to per-cpu state for both deferred inode inactivation and CIL tracking, but to do that we need to handle CPUs being removed from the system by the hot-plug code. Introduce generic XFS infrastructure to handle CPU hotplug events that is set up at module init time and torn down at module exit time. Initially, we only need CPU dead notifications, so we only set up a callback for these notifications. The infrastructure can be updated in future for other CPU hotplug state machine notifications easily if ever needed. Signed-off-by: Dave Chinner <dchinner@redhat.com> [djwong: rearrange some macros, fix function prototypes] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-06xfs: remove the active vs running quota differentiationremove-quotaoff-5.15_2021-08-06Christoph Hellwig
These only made a difference when quotaoff supported disabling quota accounting on a mounted file system, so we can switch everyone to use a single set of flags and helpers now. Note that the *QUOTA_ON naming for the helpers is kept as it was the much more commonly used one. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-06xfs: remove the flags argument to xfs_qm_dquot_walkChristoph Hellwig
We always purge all dquots now, so drop the argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-06xfs: remove xfs_dqrele_all_inodesChristoph Hellwig
xfs_dqrele_all_inodes is unused now, remove it and all supporting code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-06xfs: remove support for disabling quota accounting on a mounted file systemChristoph Hellwig
Disabling quota accounting is hairy, racy code with all kinds of pitfalls. And it has a very strange mind set, as quota accounting (unlike enforcement) really is a propery of the on-disk format. There is no good use case for supporting this. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-01Linux 5.14-rc4v5.14-rc4Linus Torvalds
2021-08-01Merge tag 'perf-tools-fixes-for-v5.14-2021-08-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux Pull perf tools fixes from Arnaldo Carvalho de Melo: - Revert "perf map: Fix dso->nsinfo refcounting", this makes 'perf top' abort, uncovering a design flaw on how namespace information is kept. The fix for that is more than we can do right now, leave it for the next merge window. - Split --dump-raw-trace by AUX records for ARM's CoreSight, fixing up the decoding of some records. - Fix PMU alias matching. Thanks to James Clark and John Garry for these fixes. * tag 'perf-tools-fixes-for-v5.14-2021-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: Revert "perf map: Fix dso->nsinfo refcounting" perf pmu: Fix alias matching perf cs-etm: Split --dump-raw-trace by AUX records
2021-08-01Merge tag 'powerpc-5.14-4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Don't use r30 in VDSO code, to avoid breaking existing Go lang programs. - Change an export symbol to allow non-GPL modules to use spinlocks again. Thanks to Paul Menzel, and Srikar Dronamraju. * tag 'powerpc-5.14-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/vdso: Don't use r30 to avoid breaking Go lang powerpc/pseries: Fix regression while building external modules
2021-08-01Merge tag 'xfs-5.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Darrick Wong: "This contains a bunch of bug fixes in XFS. Dave and I have been busy the last couple of weeks to find and fix as many log recovery bugs as we can find; here are the results so far. Go fstests -g recoveryloop! ;) - Fix a number of coordination bugs relating to cache flushes for metadata writeback, cache flushes for multi-buffer log writes, and FUA writes for single-buffer log writes - Fix a bug with incorrect replay of attr3 blocks - Fix unnecessary stalls when flushing logs to disk - Fix spoofing problems when recovering realtime bitmap blocks" * tag 'xfs-5.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: prevent spoofing of rtbitmap blocks when recovering buffers xfs: limit iclog tail updates xfs: need to see iclog flags in tracing xfs: Enforce attr3 buffer recovery order xfs: logging the on disk inode LSN can make it go backwards xfs: avoid unnecessary waits in xfs_log_force_lsn() xfs: log forces imply data device cache flushes xfs: factor out forced iclog flushes xfs: fix ordering violation between cache flushes and tail updates xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog xfs: external logs need to flush data device xfs: flush data dev on external log write
2021-07-31Merge tag '5.14-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "Three cifs/smb3 fixes, including two for stable, and a fix for an fallocate problem noticed by Clang" * tag '5.14-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: add missing parsing of backupuid smb3: rc uninitialized in one fallocate path SMB3: fix readpage for large swap cache
2021-07-30Merge tag 'net-5.14-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.14-rc4, including fixes from bpf, can, WiFi (mac80211) and netfilter trees. Current release - regressions: - mac80211: fix starting aggregation sessions on mesh interfaces Current release - new code bugs: - sctp: send pmtu probe only if packet loss in Search Complete state - bnxt_en: add missing periodic PHC overflow check - devlink: fix phys_port_name of virtual port and merge error - hns3: change the method of obtaining default ptp cycle - can: mcba_usb_start(): add missing urb->transfer_dma initialization Previous releases - regressions: - set true network header for ECN decapsulation - mlx5e: RX, avoid possible data corruption w/ relaxed ordering and LRO - phy: re-add check for PHY_BRCM_DIS_TXCRXC_NOENRGY on the BCM54811 PHY - sctp: fix return value check in __sctp_rcv_asconf_lookup Previous releases - always broken: - bpf: - more spectre corner case fixes, introduce a BPF nospec instruction for mitigating Spectre v4 - fix OOB read when printing XDP link fdinfo - sockmap: fix cleanup related races - mac80211: fix enabling 4-address mode on a sta vif after assoc - can: - raw: raw_setsockopt(): fix raw_rcv panic for sock UAF - j1939: j1939_session_deactivate(): clarify lifetime of session object, avoid UAF - fix number of identical memory leaks in USB drivers - tipc: - do not blindly write skb_shinfo frags when doing decryption - fix sleeping in tipc accept routine" * tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits) gve: Update MAINTAINERS list can: esd_usb2: fix memory leak can: ems_usb: fix memory leak can: usb_8dev: fix memory leak can: mcba_usb_start(): add missing urb->transfer_dma initialization can: hi311x: fix a signedness bug in hi3110_cmd() MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver bpf: Fix leakage due to insufficient speculative store bypass mitigation bpf: Introduce BPF nospec instruction for mitigating Spectre v4 sis900: Fix missing pci_disable_device() in probe and remove net: let flow have same hash in two directions nfc: nfcsim: fix use after free during module unload tulip: windbond-840: Fix missing pci_disable_device() in probe and remove sctp: fix return value check in __sctp_rcv_asconf_lookup nfc: s3fwrn5: fix undefined parameter values in dev_err() net/mlx5: Fix mlx5_vport_tbl_attr chain from u16 to u32 net/mlx5e: Fix nullptr in mlx5e_hairpin_get_mdev() net/mlx5: Unload device upon firmware fatal error net/mlx5e: Fix page allocation failure for ptp-RQ over SF net/mlx5e: Fix page allocation failure for trap-RQ over SF ...
2021-07-30Merge tag 'acpi-5.14-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fixes from Rafael Wysocki: "These revert a recent IRQ resources handling modification that turned out to be problematic, fix suspend-to-idle handling on AMD platforms to take upcoming systems into account properly and fix the retrieval of the DPTF attributes of the PCH FIVR. Specifics: - Revert recent change of the ACPI IRQ resources handling that attempted to improve the ACPI IRQ override selection logic, but introduced serious regressions on some systems (Hui Wang). - Fix up quirks for AMD platforms in the suspend-to-idle support code so as to take upcoming systems using uPEP HID AMDI007 into account as appropriate (Mario Limonciello). - Fix the code retrieving DPTF attributes of the PCH FIVR so that it agrees on the return data type with the ACPI control method evaluated for this purpose (Srinivas Pandruvada)" * tag 'acpi-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: DPTF: Fix reading of attributes Revert "ACPI: resources: Add checks for ACPI IRQ override" ACPI: PM: Add support for upcoming AMD uPEP HID AMDI007
2021-07-30pipe: make pipe writes always wake up readersLinus Torvalds
Since commit 1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup logic") we have sanitized the pipe write logic, and would only try to wake up readers if they needed it. In particular, if the pipe already had data in it before the write, there was no point in trying to wake up a reader, since any existing readers must have been aware of the pre-existing data already. Doing extraneous wakeups will only cause potential thundering herd problems. However, it turns out that some Android libraries have misused the EPOLL interface, and expected "edge triggered" be to "any new write will trigger it". Even if there was no edge in sight. Quoting Sandeep Patil: "The commit 1b6b26ae7053 ('pipe: fix and clarify pipe write wakeup logic') changed pipe write logic to wakeup readers only if the pipe was empty at the time of write. However, there are libraries that relied upon the older behavior for notification scheme similar to what's described in [1] One such library 'realm-core'[2] is used by numerous Android applications. The library uses a similar notification mechanism as GNU Make but it never drains the pipe until it is full. When Android moved to v5.10 kernel, all applications using this library stopped working. The library has since been fixed[3] but it will be a while before all applications incorporate the updated library" Our regression rule for the kernel is that if applications break from new behavior, it's a regression, even if it was because the application did something patently wrong. Also note the original report [4] by Michal Kerrisk about a test for this epoll behavior - but at that point we didn't know of any actual broken use case. So add the extraneous wakeup, to approximate the old behavior. [ I say "approximate", because the exact old behavior was to do a wakeup not for each write(), but for each pipe buffer chunk that was filled in. The behavior introduced by this change is not that - this is just "every write will cause a wakeup, whether necessary or not", which seems to be sufficient for the broken library use. ] It's worth noting that this adds the extraneous wakeup only for the write side, while the read side still considers the "edge" to be purely about reading enough from the pipe to allow further writes. See commit f467a6a66419 ("pipe: fix and clarify pipe read wakeup logic") for the pipe read case, which remains that "only wake up if the pipe was full, and we read something from it". Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1] Link: https://github.com/realm/realm-core [2] Link: https://github.com/realm/realm-core/issues/4666 [3] Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4] Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/ Reported-by: Sandeep Patil <sspatil@android.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-30Revert "perf map: Fix dso->nsinfo refcounting"Arnaldo Carvalho de Melo
This makes 'perf top' abort in some cases, and the right fix will involve surgery that is too much to do at this stage, so revert for now and fix it in the next merge window. This reverts commit 2d6b74baa7147251c30a46c4996e8cc224aa2dc5. Cc: Riccardo Mancini <rickyman7@gmail.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Krister Johansen <kjlx@templeofstupid.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2021-07-30Merge branches 'acpi-resources' and 'acpi-dptf'Rafael J. Wysocki
* acpi-resources: Revert "ACPI: resources: Add checks for ACPI IRQ override" * acpi-dptf: ACPI: DPTF: Fix reading of attributes
2021-07-30Merge tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - gendisk freeing fix (Christoph) - blk-iocost wake ordering fix (Tejun) - tag allocation error handling fix (John) - loop locking fix. While this isn't the prettiest fix in the world, nobody has any good alternatives for 5.14. Something to likely revisit for 5.15. (Tetsuo) * tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block: block: delay freeing the gendisk blk-iocost: fix operation ordering in iocg_wake_fn() blk-mq-sched: Fix blk_mq_sched_alloc_tags() error handling loop: reintroduce global lock for safe loop_validate_file() traversal
2021-07-30Merge tag 'io_uring-5.14-2021-07-30' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: - A fix for block backed reissue (me) - Reissue context hardening (me) - Async link locking fix (Pavel) * tag 'io_uring-5.14-2021-07-30' of git://git.kernel.dk/linux-block: io_uring: fix poll requests leaking second poll entries io_uring: don't block level reissue off completion path io_uring: always reissue from task_work context io_uring: fix race in unified task_work running io_uring: fix io_prep_async_link locking
2021-07-30Merge tag 'libata-5.14-2021-07-30' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull libata fixlets from Jens Axboe: - A fix for PIO highmem (Christoph) - Kill HAVE_IDE as it's now unused (Lukas) * tag 'libata-5.14-2021-07-30' of git://git.kernel.dk/linux-block: arch: Kconfig: clean up obsolete use of HAVE_IDE libata: fix ata_pio_sector for CONFIG_HIGHMEM
2021-07-30Merge tag 'for-5.14-rc3-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix -Warray-bounds warning, to help external patchset to make it default treewide - fix writeable device accounting (syzbot report) - fix fsync and log replay after a rename and inode eviction - fix potentially lost error code when submitting multiple bios for compressed range * tag 'for-5.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: calculate number of eb pages properly in csum_tree_block btrfs: fix rw device counting in __btrfs_free_extra_devids btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction btrfs: mark compressed range uptodate only if all bio succeed
2021-07-30Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid Pull HID fixes from Jiri Kosina: - resume timing fix for intel-ish driver (Ye Xiang) - fix for using incorrect MMIO register in amd_sfh driver (Dylan MacKenzie) - Cintiq 24HDT / 27QHDT regression fix and touch processing fix for Wacom driver (Jason Gerecke) - device removal bugfix for ft260 driver (Michael Zaidman) - other small assorted fixes * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: HID: ft260: fix device removal due to USB disconnect HID: wacom: Skip processing of touches with negative slot values HID: wacom: Re-enable touch by default for Cintiq 24HDT / 27QHDT HID: Kconfig: Fix spelling mistake "Uninterruptable" -> "Uninterruptible" HID: apple: Add support for Keychron K1 wireless keyboard HID: fix typo in Kconfig HID: ft260: fix format type warning in ft260_word_show() HID: amd_sfh: Use correct MMIO register for DMA address HID: asus: Remove check for same LED brightness on set HID: intel-ish-hid: use async resume function
2021-07-30Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "7 patches. Subsystems affected by this patch series: lib, ocfs2, and mm (slub, migration, and memcg)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook() slub: fix unreclaimable slab stat for bulk free mm/migrate: fix NR_ISOLATED corruption on 64-bit mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code ocfs2: issue zeroout to EOF blocks ocfs2: fix zero out valid data lib/test_string.c: move string selftest in the Runtime Testing menu
2021-07-30Merge tag 'linux-can-fixes-for-5.14-20210730' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2021-07-30 The first patch is by me and adds Yasushi SHOJI as a reviewer for the Microchip CAN BUS Analyzer Tool driver. Dan Carpenter's patch fixes a signedness bug in the hi311x driver. Pavel Skripkin provides 4 patches, the first targets the mcba_usb driver by adding the missing urb->transfer_dma initialization, which was broken in a previous commit. The last 3 patches fix a memory leak in the usb_8dev, ems_usb and esd_usb2 driver. * tag 'linux-can-fixes-for-5.14-20210730' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can: can: esd_usb2: fix memory leak can: ems_usb: fix memory leak can: usb_8dev: fix memory leak can: mcba_usb_start(): add missing urb->transfer_dma initialization can: hi311x: fix a signedness bug in hi3110_cmd() MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver ==================== Link: https://lore.kernel.org/r/20210730070526.1699867-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-30mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook()Wang Hai
When I use kfree_rcu() to free a large memory allocated by kmalloc_node(), the following dump occurs. BUG: kernel NULL pointer dereference, address: 0000000000000020 [...] Oops: 0000 [#1] SMP [...] Workqueue: events kfree_rcu_work RIP: 0010:__obj_to_index include/linux/slub_def.h:182 [inline] RIP: 0010:obj_to_index include/linux/slub_def.h:191 [inline] RIP: 0010:memcg_slab_free_hook+0x120/0x260 mm/slab.h:363 [...] Call Trace: kmem_cache_free_bulk+0x58/0x630 mm/slub.c:3293 kfree_bulk include/linux/slab.h:413 [inline] kfree_rcu_work+0x1ab/0x200 kernel/rcu/tree.c:3300 process_one_work+0x207/0x530 kernel/workqueue.c:2276 worker_thread+0x320/0x610 kernel/workqueue.c:2422 kthread+0x13d/0x160 kernel/kthread.c:313 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 When kmalloc_node() a large memory, page is allocated, not slab, so when freeing memory via kfree_rcu(), this large memory should not be used by memcg_slab_free_hook(), because memcg_slab_free_hook() is is used for slab. Using page_objcgs_check() instead of page_objcgs() in memcg_slab_free_hook() to fix this bug. Link: https://lkml.kernel.org/r/20210728145655.274476-1-wanghai38@huawei.com Fixes: 270c6a71460e ("mm: memcontrol/slab: Use helpers to access slab page's memcg_data") Signed-off-by: Wang Hai <wanghai38@huawei.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-30slub: fix unreclaimable slab stat for bulk freeShakeel Butt
SLUB uses page allocator for higher order allocations and update unreclaimable slab stat for such allocations. At the moment, the bulk free for SLUB does not share code with normal free code path for these type of allocations and have missed the stat update. So, fix the stat update by common code. The user visible impact of the bug is the potential of inconsistent unreclaimable slab stat visible through meminfo and vmstat. Link: https://lkml.kernel.org/r/20210728155354.3440560-1-shakeelb@google.com Fixes: 6a486c0ad4dc ("mm, sl[ou]b: improve memory accounting") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-30mm/migrate: fix NR_ISOLATED corruption on 64-bitAneesh Kumar K.V
Similar to commit 2da9f6305f30 ("mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit") avoid using unsigned int for nr_pages. With unsigned int type the large unsigned int converts to a large positive signed long. Symptoms include CMA allocations hanging forever due to alloc_contig_range->...->isolate_migratepages_block waiting forever in "while (unlikely(too_many_isolated(pgdat)))". Link: https://lkml.kernel.org/r/20210728042531.359409-1-aneesh.kumar@linux.ibm.com Fixes: c5fc5c3ae0c8 ("mm: migrate: account THP NUMA migration counters correctly") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reported-by: Michael Ellerman <mpe@ellerman.id.au> Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-30mm: memcontrol: fix blocking rstat function called from atomic cgroup1 ↵Johannes Weiner
thresholding code Dan Carpenter reports: The patch 2d146aa3aa84: "mm: memcontrol: switch to rstat" from Apr 29, 2021, leads to the following static checker warning: kernel/cgroup/rstat.c:200 cgroup_rstat_flush() warn: sleeping in atomic context mm/memcontrol.c 3572 static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) 3573 { 3574 unsigned long val; 3575 3576 if (mem_cgroup_is_root(memcg)) { 3577 cgroup_rstat_flush(memcg->css.cgroup); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This is from static analysis and potentially a false positive. The problem is that mem_cgroup_usage() is called from __mem_cgroup_threshold() which holds an rcu_read_lock(). And the cgroup_rstat_flush() function can sleep. 3578 val = memcg_page_state(memcg, NR_FILE_PAGES) + 3579 memcg_page_state(memcg, NR_ANON_MAPPED); 3580 if (swap) 3581 val += memcg_page_state(memcg, MEMCG_SWAP); 3582 } else { 3583 if (!swap) 3584 val = page_counter_read(&memcg->memory); 3585 else 3586 val = page_counter_read(&memcg->memsw); 3587 } 3588 return val; 3589 } __mem_cgroup_threshold() indeed holds the rcu lock. In addition, the thresholding code is invoked during stat changes, and those contexts have irqs disabled as well. If the lock breaking occurs inside the flush function, it will result in a sleep from an atomic context. Use the irqsafe flushing variant in mem_cgroup_usage() to fix this. Link: https://lkml.kernel.org/r/20210726150019.251820-1-hannes@cmpxchg.org Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-30ocfs2: issue zeroout to EOF blocksJunxiao Bi
For punch holes in EOF blocks, fallocate used buffer write to zero the EOF blocks in last cluster. But since ->writepage will ignore EOF pages, those zeros will not be flushed. This "looks" ok as commit 6bba4471f0cc ("ocfs2: fix data corruption by fallocate") will zero the EOF blocks when extend the file size, but it isn't. The problem happened on those EOF pages, before writeback, those pages had DIRTY flag set and all buffer_head in them also had DIRTY flag set, when writeback run by write_cache_pages(), DIRTY flag on the page was cleared, but DIRTY flag on the buffer_head not. When next write happened to those EOF pages, since buffer_head already had DIRTY flag set, it would not mark page DIRTY again. That made writeback ignore them forever. That will cause data corruption. Even directio write can't work because it will fail when trying to drop pages caches before direct io, as it found the buffer_head for those pages still had DIRTY flag set, then it will fall back to buffer io mode. To make a summary of the issue, as writeback ingores EOF pages, once any EOF page is generated, any write to it will only go to the page cache, it will never be flushed to disk even file size extends and that page is not EOF page any more. The fix is to avoid zero EOF blocks with buffer write. The following code snippet from qemu-img could trigger the corruption. 656 open("6b3711ae-3306-4bdd-823c-cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11 ... 660 fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2275868672, 327680 <unfinished ...> 660 fallocate(11, 0, 2275868672, 327680) = 0 658 pwrite64(11, " Link: https://lkml.kernel.org/r/20210722054923.24389-2-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-30ocfs2: fix zero out valid dataJunxiao Bi
If append-dio feature is enabled, direct-io write and fallocate could run in parallel to extend file size, fallocate used "orig_isize" to record i_size before taking "ip_alloc_sem", when ocfs2_zeroout_partial_cluster() zeroout EOF blocks, i_size maybe already extended by ocfs2_dio_end_io_write(), that will cause valid data zeroed out. Link: https://lkml.kernel.org/r/20210722054923.24389-1-junxiao.bi@oracle.com Fixes: 6bba4471f0cc ("ocfs2: fix data corruption by fallocate") Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-30lib/test_string.c: move string selftest in the Runtime Testing menuMatteo Croce
STRING_SELFTEST is presented in the "Library routines" menu. Move it in Kernel hacking > Kernel Testing and Coverage > Runtime Testing together with other similar tests found in lib/ --- Runtime Testing <*> Test functions located in the hexdump module at runtime <*> Test string functions (NEW) <*> Test functions located in the string_helpers module at runtime <*> Test strscpy*() family of functions at runtime <*> Test kstrto*() family of functions at runtime <*> Test printf() family of functions at runtime <*> Test scanf() family of functions at runtime Link: https://lkml.kernel.org/r/20210719185158.190371-1-mcroce@linux.microsoft.com Signed-off-by: Matteo Croce <mcroce@microsoft.com> Cc: Peter Rosin <peda@axentia.se> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-30gve: Update MAINTAINERS listCatherine Sullivan
The team maintaining the gve driver has undergone some changes, this updates the MAINTAINERS file accordingly. Signed-off-by: Catherine Sullivan <csully@google.com> Signed-off-by: Jon Olson <jonolson@google.com> Signed-off-by: David Awogbemila <awogbemila@google.com> Signed-off-by: Jeroen de Borst <jeroendb@google.com> Link: https://lore.kernel.org/r/20210729155258.442650-1-csully@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-30arch: Kconfig: clean up obsolete use of HAVE_IDElibata-5.14-2021-07-30Lukas Bulwahn
The arch-specific Kconfig files use HAVE_IDE to indicate if IDE is supported. As IDE support and the HAVE_IDE config vanishes with commit b7fb14d3ac63 ("ide: remove the legacy ide driver"), there is no need to mention HAVE_IDE in all those arch-specific Kconfig files. The issue was identified with ./scripts/checkkconfigsymbols.py. Fixes: b7fb14d3ac63 ("ide: remove the legacy ide driver") Suggested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20210728182115.4401-1-lukas.bulwahn@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-07-30can: esd_usb2: fix memory leakPavel Skripkin
In esd_usb2_setup_rx_urbs() MAX_RX_URBS coherent buffers are allocated and there is nothing, that frees them: 1) In callback function the urb is resubmitted and that's all 2) In disconnect function urbs are simply killed, but URB_FREE_BUFFER is not set (see esd_usb2_setup_rx_urbs) and this flag cannot be used with coherent buffers. So, all allocated buffers should be freed with usb_free_coherent() explicitly. Side note: This code looks like a copy-paste of other can drivers. The same patch was applied to mcba_usb driver and it works nice with real hardware. There is no change in functionality, only clean-up code for coherent buffers. Fixes: 96d8e90382dc ("can: Add driver for esd CAN-USB/2 device") Link: https://lore.kernel.org/r/b31b096926dcb35998ad0271aac4b51770ca7cc8.1627404470.git.paskripkin@gmail.com Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-07-30can: ems_usb: fix memory leakPavel Skripkin
In ems_usb_start() MAX_RX_URBS coherent buffers are allocated and there is nothing, that frees them: 1) In callback function the urb is resubmitted and that's all 2) In disconnect function urbs are simply killed, but URB_FREE_BUFFER is not set (see ems_usb_start) and this flag cannot be used with coherent buffers. So, all allocated buffers should be freed with usb_free_coherent() explicitly. Side note: This code looks like a copy-paste of other can drivers. The same patch was applied to mcba_usb driver and it works nice with real hardware. There is no change in functionality, only clean-up code for coherent buffers. Fixes: 702171adeed3 ("ems_usb: Added support for EMS CPC-USB/ARM7 CAN/USB interface") Link: https://lore.kernel.org/r/59aa9fbc9a8cbf9af2bbd2f61a659c480b415800.1627404470.git.paskripkin@gmail.com Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>