summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2023-11-19Merge tag 'timers_urgent_for_v6.7_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Borislav Petkov: - Do the push of pending hrtimers away from a CPU which is being offlined earlier in the offlining process in order to prevent a deadlock * tag 'timers_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimers: Push pending hrtimers away from outgoing CPU earlier
2023-11-19Merge tag 'sched_urgent_for_v6.7_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Fix virtual runtime calculation when recomputing a sched entity's weights - Fix wrongly rejected unprivileged poll requests to the cgroup psi pressure files - Make sure the load balancing is done by only one CPU * tag 'sched_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix the decision for load balance sched: psi: fix unprivileged polling against cgroups sched/eevdf: Fix vruntime adjustment on reweight
2023-11-19Merge tag 'locking_urgent_for_v6.7_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Borislav Petkov: - Fix a hardcoded futex flags case which lead to one robust futex test failure * tag 'locking_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Fix hardcoded flags
2023-11-19Merge tag 'perf_urgent_for_v6.7_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Borislav Petkov: - Make sure the context refcount is transferred too when migrating perf events * tag 'perf_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix cpuctx refcounting
2023-11-18Merge tag 'parisc-for-6.7-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc fixes from Helge Deller: "On parisc we still sometimes need writeable stacks, e.g. if programs aren't compiled with gcc-14. To avoid issues with the upcoming systemd-254 we therefore have to disable prctl(PR_SET_MDWE) for now (for parisc only). The other two patches are minor: a bugfix for the soft power-off on qemu with 64-bit kernel and prefer strscpy() over strlcpy(): - Fix power soft-off on qemu - Disable prctl(PR_SET_MDWE) since parisc sometimes still needs writeable stacks - Use strscpy instead of strlcpy in show_cpuinfo()" * tag 'parisc-for-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: prctl: Disable prctl(PR_SET_MDWE) on parisc parisc/power: Fix power soft-off when running on qemu parisc: Replace strlcpy() with strscpy()
2023-11-18prctl: Disable prctl(PR_SET_MDWE) on pariscHelge Deller
systemd-254 tries to use prctl(PR_SET_MDWE) for it's MemoryDenyWriteExecute functionality, but fails on parisc which still needs executable stacks in certain combinations of gcc/glibc/kernel. Disable prctl(PR_SET_MDWE) by returning -EINVAL for now on parisc, until userspace has catched up. Signed-off-by: Helge Deller <deller@gmx.de> Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Sam James <sam@gentoo.org> Closes: https://github.com/systemd/systemd/issues/29775 Tested-by: Sam James <sam@gentoo.org> Link: https://lore.kernel.org/all/875y2jro9a.fsf@gentoo.org/ Cc: <stable@vger.kernel.org> # v6.3+
2023-11-17Merge tag 'audit-pr-20231116' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fix from Paul Moore: "One small audit patch to convert a WARN_ON_ONCE() into a normal conditional to avoid scary looking console warnings when eBPF code generates audit records from unexpected places" * tag 'audit-pr-20231116' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare()
2023-11-16Merge tag 'net-6.7-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from BPF and netfilter. Current release - regressions: - core: fix undefined behavior in netdev name allocation - bpf: do not allocate percpu memory at init stage - netfilter: nf_tables: split async and sync catchall in two functions - mptcp: fix possible NULL pointer dereference on close Current release - new code bugs: - eth: ice: dpll: fix initial lock status of dpll Previous releases - regressions: - bpf: fix precision backtracking instruction iteration - af_unix: fix use-after-free in unix_stream_read_actor() - tipc: fix kernel-infoleak due to uninitialized TLV value - eth: bonding: stop the device in bond_setup_by_slave() - eth: mlx5: - fix double free of encap_header - avoid referencing skb after free-ing in drop path - eth: hns3: fix VF reset - eth: mvneta: fix calls to page_pool_get_stats Previous releases - always broken: - core: set SOCK_RCU_FREE before inserting socket into hashtable - bpf: fix control-flow graph checking in privileged mode - eth: ppp: limit MRU to 64K - eth: stmmac: avoid rx queue overrun - eth: icssg-prueth: fix error cleanup on failing initialization - eth: hns3: fix out-of-bounds access may occur when coalesce info is read via debugfs - eth: cortina: handle large frames Misc: - selftests: gso: support CONFIG_MAX_SKB_FRAGS up to 45" * tag 'net-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (78 commits) macvlan: Don't propagate promisc change to lower dev in passthru net: sched: do not offload flows with a helper in act_ct net/mlx5e: Check return value of snprintf writing to fw_version buffer for representors net/mlx5e: Check return value of snprintf writing to fw_version buffer net/mlx5e: Reduce the size of icosq_str net/mlx5: Increase size of irq name buffer net/mlx5e: Update doorbell for port timestamping CQ before the software counter net/mlx5e: Track xmit submission to PTP WQ after populating metadata map net/mlx5e: Avoid referencing skb after free-ing in drop path of mlx5e_sq_xmit_wqe net/mlx5e: Don't modify the peer sent-to-vport rules for IPSec offload net/mlx5e: Fix pedit endianness net/mlx5e: fix double free of encap_header in update funcs net/mlx5e: fix double free of encap_header net/mlx5: Decouple PHC .adjtime and .adjphase implementations net/mlx5: DR, Allow old devices to use multi destination FTE net/mlx5: Free used cpus mask when an IRQ is released Revert "net/mlx5: DR, Supporting inline WQE when possible" bpf: Do not allocate percpu memory at init stage net: Fix undefined behavior in netdev name allocation dt-bindings: net: ethernet-controller: Fix formatting error ...
2023-11-15bpf: Do not allocate percpu memory at init stageYonghong Song
Kirill Shutemov reported significant percpu memory consumption increase after booting in 288-cpu VM ([1]) due to commit 41a5db8d8161 ("bpf: Add support for non-fix-size percpu mem allocation"). The percpu memory consumption is increased from 111MB to 969MB. The number is from /proc/meminfo. I tried to reproduce the issue with my local VM which at most supports upto 255 cpus. With 252 cpus, without the above commit, the percpu memory consumption immediately after boot is 57MB while with the above commit the percpu memory consumption is 231MB. This is not good since so far percpu memory from bpf memory allocator is not widely used yet. Let us change pre-allocation in init stage to on-demand allocation when verifier detects there is a need of percpu memory for bpf program. With this change, percpu memory consumption after boot can be reduced signicantly. [1] https://lore.kernel.org/lkml/20231109154934.4saimljtqx625l3v@box.shutemov.name/ Fixes: 41a5db8d8161 ("bpf: Add support for non-fix-size percpu mem allocation") Reported-and-tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231111013928.948838-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15perf/core: Fix cpuctx refcountingPeter Zijlstra
Audit of the refcounting turned up that perf_pmu_migrate_context() fails to migrate the ctx refcount. Fixes: bd2756811766 ("perf: Rewrite core context handling") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20230612093539.085862001@infradead.org Cc: <stable@vger.kernel.org>
2023-11-15futex: Fix hardcoded flagsPeter Zijlstra
Xi reported that commit 5694289ce183 ("futex: Flag conversion") broke glibc's robust futex tests. This was narrowed down to the change of FLAGS_SHARED from 0x01 to 0x10, at which point Florian noted that handle_futex_death() has a hardcoded flags argument of 1. Change this to: FLAGS_SIZE_32 | FLAGS_SHARED, matching how futex_to_flags() unconditionally sets FLAGS_SIZE_32 for all legacy futex ops. Reported-by: Xi Ruoyao <xry111@xry111.site> Reported-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20231114201402.GA25315@noisy.programming.kicks-ass.net Fixes: 5694289ce183 ("futex: Flag conversion") Cc: <stable@vger.kernel.org>
2023-11-14audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare()Paul Moore
eBPF can end up calling into the audit code from some odd places, and some of these places don't have @current set properly so we end up tripping the `WARN_ON_ONCE(!current->mm)` near the top of `audit_exe_compare()`. While the basic `!current->mm` check is good, the `WARN_ON_ONCE()` results in some scary console messages so let's drop that and just do the regular `!current->mm` check to avoid problems. Cc: <stable@vger.kernel.org> Fixes: 47846d51348d ("audit: don't take task_lock() in audit_exe_compare() code path") Reported-by: Artem Savkov <asavkov@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-11-14sched/fair: Fix the decision for load balanceKeisuke Nishimura
should_we_balance is called for the decision to do load-balancing. When sched ticks invoke this function, only one CPU should return true. However, in the current code, two CPUs can return true. The following situation, where b means busy and i means idle, is an example, because CPU 0 and CPU 2 return true. [0, 1] [2, 3] b b i b This fix checks if there exists an idle CPU with busy sibling(s) after looking for a CPU on an idle core. If some idle CPUs with busy siblings are found, just the first one should do load-balancing. Fixes: b1bfeab9b002 ("sched/fair: Consider the idle state of the whole core for load balance") Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20231031133821.1570861-1-keisuke.nishimura@inria.fr
2023-11-14sched: psi: fix unprivileged polling against cgroupsJohannes Weiner
519fabc7aaba ("psi: remove 500ms min window size limitation for triggers") breaks unprivileged psi polling on cgroups. Historically, we had a privilege check for polling in the open() of a pressure file in /proc, but were erroneously missing it for the open() of cgroup pressure files. When unprivileged polling was introduced in d82caa273565 ("sched/psi: Allow unprivileged polling of N*2s period"), it needed to filter privileges depending on the exact polling parameters, and as such moved the CAP_SYS_RESOURCE check from the proc open() callback to psi_trigger_create(). Both the proc files as well as cgroup files go through this during write(). This implicitly added the missing check for privileges required for HT polling for cgroups. When 519fabc7aaba ("psi: remove 500ms min window size limitation for triggers") followed right after to remove further restrictions on the RT polling window, it incorrectly assumed the cgroup privilege check was still missing and added it to the cgroup open(), mirroring what we used to do for proc files in the past. As a result, unprivileged poll requests that would be supported now get rejected when opening the cgroup pressure file for writing. Remove the cgroup open() check. psi_trigger_create() handles it. Fixes: 519fabc7aaba ("psi: remove 500ms min window size limitation for triggers") Reported-by: Luca Boccassi <bluca@debian.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Luca Boccassi <bluca@debian.org> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: stable@vger.kernel.org # 6.5+ Link: https://lore.kernel.org/r/20231026164114.2488682-1-hannes@cmpxchg.org
2023-11-14sched/eevdf: Fix vruntime adjustment on reweightAbel Wu
vruntime of the (on_rq && !0-lag) entity needs to be adjusted when it gets re-weighted, and the calculations can be simplified based on the fact that re-weight won't change the w-average of all the entities. Please check the proofs in comments. But adjusting vruntime can also cause position change in RB-tree hence require re-queue to fix up which might be costly. This might be avoided by deferring adjustment to the time the entity actually leaves tree (dequeue/pick), but that will negatively affect task selection and probably not good enough either. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231107090510.71322-2-wuyun.abel@bytedance.com
2023-11-11hrtimers: Push pending hrtimers away from outgoing CPU earlierThomas Gleixner
2b8272ff4a70 ("cpu/hotplug: Prevent self deadlock on CPU hot-unplug") solved the straight forward CPU hotplug deadlock vs. the scheduler bandwidth timer. Yu discovered a more involved variant where a task which has a bandwidth timer started on the outgoing CPU holds a lock and then gets throttled. If the lock required by one of the CPU hotplug callbacks the hotplug operation deadlocks because the unthrottling timer event is not handled on the dying CPU and can only be recovered once the control CPU reaches the hotplug state which pulls the pending hrtimers from the dead CPU. Solve this by pushing the hrtimers away from the dying CPU in the dying callbacks. Nothing can queue a hrtimer on the dying CPU at that point because all other CPUs spin in stop_machine() with interrupts disabled and once the operation is finished the CPU is marked offline. Reported-by: Yu Liao <liaoyu15@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Liu Tie <liutie4@huawei.com> Link: https://lore.kernel.org/r/87a5rphara.ffs@tglx
2023-11-10Merge tag 'probes-fixes-v6.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - Documentation update: Add a note about argument and return value fetching is the best effort because it depends on the type. - objpool: Fix to make internal global variables static in test_objpool.c. - kprobes: Unify kprobes_exceptions_nofify() prototypes. There are the same prototypes in asm/kprobes.h for some architectures, but some of them are missing the prototype and it causes a warning. So move the prototype into linux/kprobes.h. - tracing: Fix to check the tracepoint event and return event at parsing stage. The tracepoint event doesn't support %return but if $retval exists, it will be converted to %return silently. This finds that case and rejects it. - tracing: Fix the order of the descriptions about the parameters of __kprobe_event_gen_cmd_start() to be consistent with the argument list of the function. * tag 'probes-fixes-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/kprobes: Fix the order of argument descriptions tracing: fprobe-event: Fix to check tracepoint event and return kprobes: unify kprobes_exceptions_nofify() prototypes lib: test_objpool: make global variables static Documentation: tracing: Add a note about argument and retval access
2023-11-11tracing/kprobes: Fix the order of argument descriptionsYujie Liu
The order of descriptions should be consistent with the argument list of the function, so "kretprobe" should be the second one. int __kprobe_event_gen_cmd_start(struct dynevent_cmd *cmd, bool kretprobe, const char *name, const char *loc, ...) Link: https://lore.kernel.org/all/20231031041305.3363712-1-yujie.liu@intel.com/ Fixes: 2a588dd1d5d6 ("tracing: Add kprobe event command generation functions") Suggested-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Yujie Liu <yujie.liu@intel.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-11-10Merge tag 'dma-mapping-6.7-2023-11-10' of ↵Linus Torvalds
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fixes from Christoph Hellwig: - don't leave pages decrypted for DMA in encrypted memory setups linger around on failure (Petr Tesarik) - fix an out of bounds access in the new dynamic swiotlb code (Petr Tesarik) - fix dma_addressing_limited for systems with weird physical memory layouts (Jia He) * tag 'dma-mapping-6.7-2023-11-10' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: fix out-of-bounds TLB allocations with CONFIG_SWIOTLB_DYNAMIC dma-mapping: fix dma_addressing_limited() if dma_range_map can't cover all system RAM dma-mapping: move dma_addressing_limited() out of line swiotlb: do not free decrypted pages if dynamic
2023-11-10tracing: fprobe-event: Fix to check tracepoint event and returnMasami Hiramatsu (Google)
Fix to check the tracepoint event is not valid with $retval. The commit 08c9306fc2e3 ("tracing/fprobe-event: Assume fprobe is a return event by $retval") introduced automatic return probe conversion with $retval. But since tracepoint event does not support return probe, $retval is not acceptable. Without this fix, ftracetest, tprobe_syntax_errors.tc fails; [22] Tracepoint probe event parser error log check [FAIL] ---- # tail 22-tprobe_syntax_errors.tc-log.mRKroL + ftrace_errlog_check trace_fprobe t kfree ^$retval dynamic_events + printf %s t kfree + wc -c + pos=8 + printf %s t kfree ^$retval + tr -d ^ + command=t kfree $retval + echo Test command: t kfree $retval Test command: t kfree $retval + echo ---- So 't kfree $retval' should fail (tracepoint doesn't support return probe) but passed it. Link: https://lore.kernel.org/all/169944555933.45057.12831706585287704173.stgit@devnote2/ Fixes: 08c9306fc2e3 ("tracing/fprobe-event: Assume fprobe is a return event by $retval") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-11-09bpf: fix control-flow graph checking in privileged modeAndrii Nakryiko
When BPF program is verified in privileged mode, BPF verifier allows bounded loops. This means that from CFG point of view there are definitely some back-edges. Original commit adjusted check_cfg() logic to not detect back-edges in control flow graph if they are resulting from conditional jumps, which the idea that subsequent full BPF verification process will determine whether such loops are bounded or not, and either accept or reject the BPF program. At least that's my reading of the intent. Unfortunately, the implementation of this idea doesn't work correctly in all possible situations. Conditional jump might not result in immediate back-edge, but just a few unconditional instructions later we can arrive at back-edge. In such situations check_cfg() would reject BPF program even in privileged mode, despite it might be bounded loop. Next patch adds one simple program demonstrating such scenario. To keep things simple, instead of trying to detect back edges in privileged mode, just assume every back edge is valid and let subsequent BPF verification prove or reject bounded loops. Note a few test changes. For unknown reason, we have a few tests that are specified to detect a back-edge in a privileged mode, but looking at their code it seems like the right outcome is passing check_cfg() and letting subsequent verification to make a decision about bounded or not bounded looping. Bounded recursion case is also interesting. The example should pass, as recursion is limited to just a few levels and so we never reach maximum number of nested frames and never exhaust maximum stack depth. But the way that max stack depth logic works today it falsely detects this as exceeding max nested frame count. This patch series doesn't attempt to fix this orthogonal problem, so we just adjust expected verifier failure. Suggested-by: Alexei Starovoitov <ast@kernel.org> Fixes: 2589726d12a1 ("bpf: introduce bounded loops") Reported-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231110061412.2995786-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: fix precision backtracking instruction iterationAndrii Nakryiko
Fix an edge case in __mark_chain_precision() which prematurely stops backtracking instructions in a state if it happens that state's first and last instruction indexes are the same. This situations doesn't necessarily mean that there were no instructions simulated in a state, but rather that we starting from the instruction, jumped around a bit, and then ended up at the same instruction before checkpointing or marking precision. To distinguish between these two possible situations, we need to consult jump history. If it's empty or contain a single record "bridging" parent state and first instruction of processed state, then we indeed backtracked all instructions in this state. But if history is not empty, we are definitely not done yet. Move this logic inside get_prev_insn_idx() to contain it more nicely. Use -ENOENT return code to denote "we are out of instructions" situation. This bug was exposed by verifier_loop1.c's bounded_recursion subtest, once the next fix in this patch set is applied. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231110002638.4168352-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: handle ldimm64 properly in check_cfg()Andrii Nakryiko
ldimm64 instructions are 16-byte long, and so have to be handled appropriately in check_cfg(), just like the rest of BPF verifier does. This has implications in three places: - when determining next instruction for non-jump instructions; - when determining next instruction for callback address ldimm64 instructions (in visit_func_call_insn()); - when checking for unreachable instructions, where second half of ldimm64 is expected to be unreachable; We take this also as an opportunity to report jump into the middle of ldimm64. And adjust few test_verifier tests accordingly. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reported-by: Hao Sun <sunhao.th@gmail.com> Fixes: 475fb78fbf48 ("bpf: verifier (add branch/goto checks)") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231110002638.4168352-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09Merge tag 'net-6.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter and bpf. Current release - regressions: - sched: fix SKB_NOT_DROPPED_YET splat under debug config Current release - new code bugs: - tcp: - fix usec timestamps with TCP fastopen - fix possible out-of-bounds reads in tcp_hash_fail() - fix SYN option room calculation for TCP-AO - tcp_sigpool: fix some off by one bugs - bpf: fix compilation error without CGROUPS - ptp: - ptp_read() should not release queue - fix tsevqs corruption Previous releases - regressions: - llc: verify mac len before reading mac header Previous releases - always broken: - bpf: - fix check_stack_write_fixed_off() to correctly spill imm - fix precision tracking for BPF_ALU | BPF_TO_BE | BPF_END - check map->usercnt after timer->timer is assigned - dsa: lan9303: consequently nested-lock physical MDIO - dccp/tcp: call security_inet_conn_request() after setting IP addr - tg3: fix the TX ring stall due to incorrect full ring handling - phylink: initialize carrier state at creation - ice: fix direction of VF rules in switchdev mode Misc: - fill in a bunch of missing MODULE_DESCRIPTION()s, more to come" * tag 'net-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits) net: ti: icss-iep: fix setting counter value ptp: fix corrupted list in ptp_open ptp: ptp_read should not release queue net_sched: sch_fq: better validate TCA_FQ_WEIGHTS and TCA_FQ_PRIOMAP net: kcm: fill in MODULE_DESCRIPTION() net/sched: act_ct: Always fill offloading tuple iifidx netfilter: nat: fix ipv6 nat redirect with mapped and scoped addresses netfilter: xt_recent: fix (increase) ipv6 literal buffer length ipvs: add missing module descriptions netfilter: nf_tables: remove catchall element in GC sync path netfilter: add missing module descriptions drivers/net/ppp: use standard array-copy-function net: enetc: shorten enetc_setup_xdp_prog() error message to fit NETLINK_MAX_FMTMSG_LEN virtio/vsock: Fix uninit-value in virtio_transport_recv_pkt() r8169: respect userspace disabling IFF_MULTICAST selftests/bpf: get trusted cgrp from bpf_iter__cgroup directly bpf: Let verifier consider {task,cgroup} is trusted in bpf_iter_reg net: phylink: initialize carrier state at creation test/vsock: add dobule bind connect test test/vsock: refactor vsock_accept ...
2023-11-08Merge tag 'rcu-fixes-v6.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull RCU fixes from Frederic Weisbecker: - Fix a lock inversion between scheduler and RCU introduced in v6.2-rc4. The scenario could trigger on any user of RCU_NOCB (mostly Android but also nohz_full) - Fix PF_IDLE semantic changes introduced in v6.6-rc3 breaking some RCU-Tasks and RCU-Tasks-Trace expectations as to what exactly is an idle task. This resulted in potential spurious stalls and warnings. * tag 'rcu-fixes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: rcu/tasks-trace: Handle new PF_IDLE semantics rcu/tasks: Handle new PF_IDLE semantics rcu: Introduce rcu_cpu_online() rcu: Break rcu_node_0 --> &rq->__lock order
2023-11-08Merge tag 'kgdb-6.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux Pull kgdb updates from Daniel Thompson: "Just two patches for you this time! - During a panic, flush the console before entering kgdb. This makes things a little easier to comprehend, especially if an NMI backtrace was triggered on all CPUs just before we enter the panic routines - Correcting a couple of misleading (a.k.a. plain wrong) comments" * tag 'kgdb-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux: kdb: Corrects comment for kdballocenv kgdb: Flush console before entering kgdb on panic
2023-11-08swiotlb: fix out-of-bounds TLB allocations with CONFIG_SWIOTLB_DYNAMICPetr Tesarik
Limit the free list length to the size of the IO TLB. Transient pool can be smaller than IO_TLB_SEGSIZE, but the free list is initialized with the assumption that the total number of slots is a multiple of IO_TLB_SEGSIZE. As a result, swiotlb_area_find_slots() may allocate slots past the end of a transient IO TLB buffer. Reported-by: Niklas Schnelle <schnelle@linux.ibm.com> Closes: https://lore.kernel.org/linux-iommu/104a8c8fedffd1ff8a2890983e2ec1c26bff6810.camel@linux.ibm.com/ Fixes: 79636caad361 ("swiotlb: if swiotlb is full, fall back to a transient memory pool") Cc: stable@vger.kernel.org Signed-off-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-11-07bpf: Let verifier consider {task,cgroup} is trusted in bpf_iter_regChuyi Zhou
BTF_TYPE_SAFE_TRUSTED(struct bpf_iter__task) in verifier.c wanted to teach BPF verifier that bpf_iter__task -> task is a trusted ptr. But it doesn't work well. The reason is, bpf_iter__task -> task would go through btf_ctx_access() which enforces the reg_type of 'task' is ctx_arg_info->reg_type, and in task_iter.c, we actually explicitly declare that the ctx_arg_info->reg_type is PTR_TO_BTF_ID_OR_NULL. Actually we have a previous case like this[1] where PTR_TRUSTED is added to the arg flag for map_iter. This patch sets ctx_arg_info->reg_type is PTR_TO_BTF_ID_OR_NULL | PTR_TRUSTED in task_reg_info. Similarly, bpf_cgroup_reg_info -> cgroup is also PTR_TRUSTED since we are under the protection of cgroup_mutex and we would check cgroup_is_dead() in __cgroup_iter_seq_show(). This patch is to improve the user experience of the newly introduced bpf_iter_css_task kfunc before hitting the mainline. The Fixes tag is pointing to the commit introduced the bpf_iter_css_task kfunc. Link[1]:https://lore.kernel.org/all/20230706133932.45883-3-aspsk@isovalent.com/ Fixes: 9c66dc94b62a ("bpf: Introduce css_task open-coded iterator kfuncs") Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231107132204.912120-2-zhouchuyi@bytedance.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-06kdb: Corrects comment for kdballocenvYuran Pereira
This patch corrects the comment for the kdballocenv function. The previous comment incorrectly described the function's parameters and return values. Signed-off-by: Yuran Pereira <yuran.pereira@hotmail.com> Link: https://lore.kernel.org/r/DB3PR10MB6835B383B596133EDECEA98AE8ABA@DB3PR10MB6835.EURPRD10.PROD.OUTLOOK.COM [daniel.thompson@linaro.org: fixed whitespace alignment in new lines] Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2023-11-06dma-mapping: fix dma_addressing_limited() if dma_range_map can't cover all ↵Jia He
system RAM There is an unusual case that the range map covers right up to the top of system RAM, but leaves a hole somewhere lower down. Then it prevents the nvme device dma mapping in the checking path of phys_to_dma() and causes the hangs at boot. E.g. On an Armv8 Ampere server, the dsdt ACPI table is: Method (_DMA, 0, Serialized) // _DMA: Direct Memory Access { Name (RBUF, ResourceTemplate () { QWordMemory (ResourceConsumer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite, 0x0000000000000000, // Granularity 0x0000000000000000, // Range Minimum 0x00000000FFFFFFFF, // Range Maximum 0x0000000000000000, // Translation Offset 0x0000000100000000, // Length ,, , AddressRangeMemory, TypeStatic) QWordMemory (ResourceConsumer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite, 0x0000000000000000, // Granularity 0x0000006010200000, // Range Minimum 0x000000602FFFFFFF, // Range Maximum 0x0000000000000000, // Translation Offset 0x000000001FE00000, // Length ,, , AddressRangeMemory, TypeStatic) QWordMemory (ResourceConsumer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite, 0x0000000000000000, // Granularity 0x00000060F0000000, // Range Minimum 0x00000060FFFFFFFF, // Range Maximum 0x0000000000000000, // Translation Offset 0x0000000010000000, // Length ,, , AddressRangeMemory, TypeStatic) QWordMemory (ResourceConsumer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite, 0x0000000000000000, // Granularity 0x0000007000000000, // Range Minimum 0x000003FFFFFFFFFF, // Range Maximum 0x0000000000000000, // Translation Offset 0x0000039000000000, // Length ,, , AddressRangeMemory, TypeStatic) }) But the System RAM ranges are: cat /proc/iomem |grep -i ram 90000000-91ffffff : System RAM 92900000-fffbffff : System RAM 880000000-fffffffff : System RAM 8800000000-bff5990fff : System RAM bff59d0000-bff5a4ffff : System RAM bff8000000-bfffffffff : System RAM So some RAM ranges are out of dma_range_map. Fix it by checking whether each of the system RAM resources can be properly encompassed within the dma_range_map. Signed-off-by: Jia He <justin.he@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-11-06dma-mapping: move dma_addressing_limited() out of lineJia He
This patch moves dma_addressing_limited() out of line, serving as a preliminary step to prevent the introduction of a new publicly accessible low-level helper when validating whether all system RAM is mapped within the DMA mapping range. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jia He <justin.he@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-11-03Merge tag 'tty-6.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty and serial updates from Greg KH: "Here is the big set of tty/serial driver changes for 6.7-rc1. Included in here are: - console/vgacon cleanups and removals from Arnd - tty core and n_tty cleanups from Jiri - lots of 8250 driver updates and cleanups - sc16is7xx serial driver updates - dt binding updates - first set of port lock wrapers from Thomas for the printk fixes coming in future releases - other small serial and tty core cleanups and updates All of these have been in linux-next for a while with no reported issues" * tag 'tty-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (193 commits) serdev: Replace custom code with device_match_acpi_handle() serdev: Simplify devm_serdev_device_open() function serdev: Make use of device_set_node() tty: n_gsm: add copyright Siemens Mobility GmbH tty: n_gsm: fix race condition in status line change on dead connections serial: core: Fix runtime PM handling for pending tx vgacon: fix mips/sibyte build regression dt-bindings: serial: drop unsupported samsung bindings tty: serial: samsung: drop earlycon support for unsupported platforms tty: 8250: Add note for PX-835 tty: 8250: Fix IS-200 PCI ID comment tty: 8250: Add Brainboxes Oxford Semiconductor-based quirks tty: 8250: Add support for Intashield IX cards tty: 8250: Add support for additional Brainboxes PX cards tty: 8250: Fix up PX-803/PX-857 tty: 8250: Fix port count of PX-257 tty: 8250: Add support for Intashield IS-100 tty: 8250: Add support for Brainboxes UP cards tty: 8250: Add support for additional Brainboxes UC cards tty: 8250: Remove UC-257 and UC-431 ...
2023-11-03Merge tag 'driver-core-6.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the set of driver core updates for 6.7-rc1. Nothing major in here at all, just a small number of changes including: - minor cleanups and updates from Andy Shevchenko - __counted_by addition - firmware_loader update for aborting loads cleaner - other minor changes, details in the shortlog - documentation update All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (21 commits) firmware_loader: Abort all upcoming firmware load request once reboot triggered firmware_loader: Refactor kill_pending_fw_fallback_reqs() Documentation: security-bugs.rst: linux-distros relaxed their rules driver core: Release all resources during unbind before updating device links driver core: class: remove boilerplate code driver core: platform: Annotate struct irq_affinity_devres with __counted_by resource: Constify resource crosscheck APIs resource: Unify next_resource() and next_resource_skip_children() resource: Reuse for_each_resource() macro PCI: Implement custom llseek for sysfs resource entries kernfs: sysfs: support custom llseek method for sysfs entries debugfs: Fix __rcu type comparison warning device property: Replace custom implementation of COUNT_ARGS() drivers: base: test: Make property entry API test modular driver core: Add missing parameter description to __fwnode_link_add() device property: Clarify usage scope of some struct fwnode_handle members devres: rename the first parameter of devm_add_action(_or_reset) driver core: platform: Unify the firmware node type check driver core: platform: Use temporary variable in platform_device_add() driver core: platform: Refactor error path in a couple places ...
2023-11-03Merge tag 'trace-v6.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Remove eventfs_file descriptor This is the biggest change, and the second part of making eventfs create its files dynamically. In 6.6 the first part was added, and that maintained a one to one mapping between eventfs meta descriptors and the directories and file inodes and dentries that were dynamically created. The directories were represented by a eventfs_inode and the files were represented by a eventfs_file. In v6.7 the eventfs_file is removed. As all events have the same directory make up (sched_switch has an "enable", "id", "format", etc files), the handing of what files are underneath each leaf eventfs directory is moved back to the tracing subsystem via a callback. When an event is added to the eventfs, it registers an array of evenfs_entry's. These hold the names of the files and the callbacks to call when the file is referenced. The callback gets the name so that the same callback may be used by multiple files. The callback then supplies the filesystem_operations structure needed to create this file. This has brought the memory footprint of creating multiple eventfs instances down by 2 megs each! - User events now has persistent events that are not associated to a single processes. These are privileged events that hang around even if no process is attached to them - Clean up of seq_buf There's talk about using seq_buf more to replace strscpy() and friends. But this also requires some minor modifications of seq_buf to be able to do this - Expand instance ring buffers individually Currently if boot up creates an instance, and a trace event is enabled on that instance, the ring buffer for that instance and the top level ring buffer are expanded (1.4 MB per CPU). This wastes memory as this happens when nothing is using the top level instance - Other minor clean ups and fixes * tag 'trace-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (34 commits) seq_buf: Export seq_buf_puts() seq_buf: Export seq_buf_putc() eventfs: Use simple_recursive_removal() to clean up dentries eventfs: Remove special processing of dput() of events directory eventfs: Delete eventfs_inode when the last dentry is freed eventfs: Hold eventfs_mutex when calling callback functions eventfs: Save ownership and mode eventfs: Test for ei->is_freed when accessing ei->dentry eventfs: Have a free_ei() that just frees the eventfs_inode eventfs: Remove "is_freed" union with rcu head eventfs: Fix kerneldoc of eventfs_remove_rec() tracing: Have the user copy of synthetic event address use correct context eventfs: Remove extra dget() in eventfs_create_events_dir() tracing: Have trace_event_file have ref counters seq_buf: Introduce DECLARE_SEQ_BUF and seq_buf_str() eventfs: Fix typo in eventfs_inode union comment eventfs: Fix WARN_ON() in create_file_dentry() powerpc: Remove initialisation of readpos tracing/histograms: Simplify last_cmd_set() seq_buf: fix a misleading comment ...
2023-11-03Merge tag 'printk-for-6.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Another preparation step for introducing printk kthreads. The main piece is a per-console lock with several features: - Support three priorities: normal, emergency, and panic. They will be defined by a context where the lock is taken. A context with a higher priority is allowed to take over the lock from a context with a lower one. The plan is to use the emergency context for Oops and WARN() messages, and also by watchdogs. The panic() context will be used on panic CPU. - The owner might enter/exit regions where it is not safe to take over the lock. It allows the take over the lock a safe way in the middle of a message. For example, serial drivers emit characters one by one. And the serial port is in a safe state in between. Only the final console_flush_in_panic() will be allowed to take over the lock even in the unsafe state (last chance, pray, and hope). - A higher priority context might busy wait with a timeout. The current owner is informed about the waiter and releases the lock on exit from the unsafe state. - The new lock is safe even in atomic contexts, including NMI. Another change is a safe manipulation of per-console sequence number counter under the new lock. - simple_strntoull() micro-optimization - Reduce pr_flush() pooling time. - Calm down false warning about possible buffer invalid access to console buffers when CONFIG_PRINTK is disabled. [ .. and Thomas Gleixner wants to point out that while several of the commits are attributed to him, he only authored the early versions of said commits, and that John Ogness and Petr Mladek have been the ones who sorted out the details and really should be those who get the credit - Linus ] * tag 'printk-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: vsprintf: uninline simple_strntoull(), reorder arguments printk: printk: Remove unnecessary statements'len = 0;' printk: Reduce pr_flush() pooling time printk: fix illegal pbufs access for !CONFIG_PRINTK printk: nbcon: Allow drivers to mark unsafe regions and check state printk: nbcon: Add emit function and callback function for atomic printing printk: nbcon: Add sequence handling printk: nbcon: Add ownership state functions printk: nbcon: Add buffer management printk: Make static printk buffers available to nbcon printk: nbcon: Add acquire/release logic printk: Add non-BKL (nbcon) console basic infrastructure
2023-11-03Merge tag 'livepatching-for-6.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching Pull livepatching update from Petr Mladek: - Add missing newline character to avoid waiting for a continuous message * tag 'livepatching-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching: livepatch: Fix missing newline character in klp_resolve_symbols()
2023-11-03swiotlb: do not free decrypted pages if dynamicPetr Tesarik
Fix these two error paths: 1. When set_memory_decrypted() fails, pages may be left fully or partially decrypted. 2. Decrypted pages may be freed if swiotlb_alloc_tlb() determines that the physical address is too high. To fix the first issue, call set_memory_encrypted() on the allocated region after a failed decryption attempt. If that also fails, leak the pages. To fix the second issue, check that the TLB physical address is below the requested limit before decrypting. Let the caller differentiate between unsuitable physical address (=> retry from a lower zone) and allocation failures (=> no point in retrying). Cc: stable@vger.kernel.org Fixes: 79636caad361 ("swiotlb: if swiotlb is full, fall back to a transient memory pool") Signed-off-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-11-02Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "As usual, lots of singleton and doubleton patches all over the tree and there's little I can say which isn't in the individual changelogs. The lengthier patch series are - 'kdump: use generic functions to simplify crashkernel reservation in arch', from Baoquan He. This is mainly cleanups and consolidation of the 'crashkernel=' kernel parameter handling - After much discussion, David Laight's 'minmax: Relax type checks in min() and max()' is here. Hopefully reduces some typecasting and the use of min_t() and max_t() - A group of patches from Oleg Nesterov which clean up and slightly fix our handling of reads from /proc/PID/task/... and which remove task_struct.thread_group" * tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits) scripts/gdb/vmalloc: disable on no-MMU scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n .mailmap: add address mapping for Tomeu Vizoso mailmap: update email address for Claudiu Beznea tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions .mailmap: map Benjamin Poirier's address scripts/gdb: add lx_current support for riscv ocfs2: fix a spelling typo in comment proc: test ProtectionKey in proc-empty-vm test proc: fix proc-empty-vm test with vsyscall fs/proc/base.c: remove unneeded semicolon do_io_accounting: use sig->stats_lock do_io_accounting: use __for_each_thread() ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error() ocfs2: fix a typo in a comment scripts/show_delta: add __main__ judgement before main code treewide: mark stuff as __ro_after_init fs: ocfs2: check status values proc: test /proc/${pid}/statm compiler.h: move __is_constexpr() to compiler.h ...
2023-11-02Merge tag 'mm-stable-2023-11-01-14-33' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
2023-11-02Merge tag 'v6.7-p1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Add virtual-address based lskcipher interface - Optimise ahash/shash performance in light of costly indirect calls - Remove ahash alignmask attribute Algorithms: - Improve AES/XTS performance of 6-way unrolling for ppc - Remove some uses of obsolete algorithms (md4, md5, sha1) - Add FIPS 202 SHA-3 support in pkcs1pad - Add fast path for single-page messages in adiantum - Remove zlib-deflate Drivers: - Add support for S4 in meson RNG driver - Add STM32MP13x support in stm32 - Add hwrng interface support in qcom-rng - Add support for deflate algorithm in hisilicon/zip" * tag 'v6.7-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (283 commits) crypto: adiantum - flush destination page before unmapping crypto: testmgr - move pkcs1pad(rsa,sha3-*) to correct place Documentation/module-signing.txt: bring up to date module: enable automatic module signing with FIPS 202 SHA-3 crypto: asymmetric_keys - allow FIPS 202 SHA-3 signatures crypto: rsa-pkcs1pad - Add FIPS 202 SHA-3 support crypto: FIPS 202 SHA-3 register in hash info for IMA x509: Add OIDs for FIPS 202 SHA-3 hash and signatures crypto: ahash - optimize performance when wrapping shash crypto: ahash - check for shash type instead of not ahash type crypto: hash - move "ahash wrapping shash" functions to ahash.c crypto: talitos - stop using crypto_ahash::init crypto: chelsio - stop using crypto_ahash::init crypto: ahash - improve file comment crypto: ahash - remove struct ahash_request_priv crypto: ahash - remove crypto_ahash_alignmask crypto: gcm - stop using alignmask of ahash crypto: chacha20poly1305 - stop using alignmask of ahash crypto: ccm - stop using alignmask of ahash net: ipv6: stop checking crypto_ahash_alignmask ...
2023-11-02Merge tag 'mfd-next-6.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd Pull MFD updates from Lee Jones: "Core Frameworks: - Allow all MFD Cell properties to be filled in dynamically at runtime - Skip disabled device nodes and continue to look for subsequent devices New Device Support: - Add support for Lunar Lake-M PCI to Intel LPSS PCI - Add support for Denverton to Intel ICH LPC New Functionality: - Add support for Clocks to Texas Instruments TWL* Core - Add support for Interrupts to STMicroelectronics STM32 Timers Fix-ups: - Convert to new devm-* (managed) power-off API - Remove superfluous code - Bunch of Device Tree additions, conversions and adaptions - Simplify obtaining resources (memory, device data) using unified API helpers - Trivial coding-style / spelling type clean-ups - Constify / staticify changes - Expand or edit on existing documentation - Convert some Regmap configurations to use the Maple Tree cache - Apply new __counted_by() annotation to several data structures containing flexible arrays - Replace strncpy() with strscpy() Bug Fixes: - Remove double put creating reference imbalances - Ensure headphone/lineout detection gets set when booting with ACPI" * tag 'mfd-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (73 commits) mfd: lpc_ich: Mark *_gpio_offsets data with const spmi: rename spmi device lookup helper spmi: document spmi_device_from_of() refcounting dt-bindings: mfd: armltd: Move Arm board syscon's to separate schema mfd: rk8xx: Add support for RK806 power off mfd: rk8xx: Add support for standard system-power-controller property dt-bindings: mfd: rk806: Allow system-power-controller property dt-bindings: mfd: rk8xx: Deprecate rockchip,system-power-controller dt-bindings: mfd: max8925: Convert to DT schema format mfd: Use i2c_get_match_data() in a selection of drivers mfd: Use device_get_match_data() in a bunch of drivers mfd: mc13xxx-spi/wm831x-spi: Use spi_get_device_match_data() mfd: motorola-cpcap: Drop unnecessary of_match_device() call mfd: arizona-spi: Set pdata.hpdet_channel for ACPI enumerated devs mfd: qcom-spmi-pmic: Switch to EXPORT_SYMBOL_GPL() mfd: qcom-spmi-pmic: Fix revid implementation mfd: qcom-spmi-pmic: Fix reference leaks in revid helper mfd: intel-m10-bmc: Change contact for ABI docs mfd: max8907: Convert to use maple tree register cache mfd: max77686: Convert to use maple tree register cache ...
2023-11-02module/decompress: use kvmalloc() consistentlyAndrea Righi
We consistently switched from kmalloc() to vmalloc() in module decompression to prevent potential memory allocation failures with large modules, however vmalloc() is not as memory-efficient and fast as kmalloc(). Since we don't know in general the size of the workspace required by the decompression algorithm, it is more reasonable to use kvmalloc() consistently, also considering that we don't have special memory requirements here. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Tested-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-11-02Merge branch 'rework/nbcon-base' into for-linusPetr Mladek
2023-11-02Merge branch 'rework/misc-cleanups' into for-linusPetr Mladek
2023-11-02Merge branch 'for-6.7' into for-linusPetr Mladek
2023-11-01Merge tag 'modules-6.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull modules updates from Luis Chamberlain: "The only thing worth highligthing is that gzip moves to use vmalloc() instead of kmalloc just as we had a fix for this for zstd on v6.6-rc1. The rest is regular house keeping, keeping things neat, tidy, and boring" [ The kmalloc -> vmalloc conversion is not the right approach. Unless you know you need huge areas or know you need to use virtual mappings for some reason (playing with protection bits or whatever), you should use kvmalloc()/kvfree, which automatically picks the right allocation model - Linus ] * tag 'modules-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: module: Annotate struct module_notes_attrs with __counted_by module: Fix comment typo module: Make is_valid_name() return bool module: Make is_mapping_symbol() return bool module/decompress: use vmalloc() for gzip decompression workspace MAINTAINERS: add include/linux/module*.h to modules module: Clarify documentation of module_param_call()
2023-11-01Merge tag 'sysctl-6.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "To help make the move of sysctls out of kernel/sysctl.c not incur a size penalty sysctl has been changed to allow us to not require the sentinel, the final empty element on the sysctl array. Joel Granados has been doing all this work. On the v6.6 kernel we got the major infrastructure changes required to support this. For v6.7-rc1 we have all arch/ and drivers/ modified to remove the sentinel. Both arch and driver changes have been on linux-next for a bit less than a month. It is worth re-iterating the value: - this helps reduce the overall build time size of the kernel and run time memory consumed by the kernel by about ~64 bytes per array - the extra 64-byte penalty is no longer inncurred now when we move sysctls out from kernel/sysctl.c to their own files For v6.8-rc1 expect removal of all the sentinels and also then the unneeded check for procname == NULL. The last two patches are fixes recently merged by Krister Johansen which allow us again to use softlockup_panic early on boot. This used to work but the alias work broke it. This is useful for folks who want to detect softlockups super early rather than wait and spend money on cloud solutions with nothing but an eventual hung kernel. Although this hadn't gone through linux-next it's also a stable fix, so we might as well roll through the fixes now" * tag 'sysctl-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (23 commits) watchdog: move softlockup_panic back to early_param proc: sysctl: prevent aliased sysctls from getting passed to init intel drm: Remove now superfluous sentinel element from ctl_table array Drivers: hv: Remove now superfluous sentinel element from ctl_table array raid: Remove now superfluous sentinel element from ctl_table array fw loader: Remove the now superfluous sentinel element from ctl_table array sgi-xp: Remove the now superfluous sentinel element from ctl_table array vrf: Remove the now superfluous sentinel element from ctl_table array char-misc: Remove the now superfluous sentinel element from ctl_table array infiniband: Remove the now superfluous sentinel element from ctl_table array macintosh: Remove the now superfluous sentinel element from ctl_table array parport: Remove the now superfluous sentinel element from ctl_table array scsi: Remove now superfluous sentinel element from ctl_table array tty: Remove now superfluous sentinel element from ctl_table array xen: Remove now superfluous sentinel element from ctl_table array hpet: Remove now superfluous sentinel element from ctl_table array c-sky: Remove now superfluous sentinel element from ctl_talbe array powerpc: Remove now superfluous sentinel element from ctl_table arrays riscv: Remove now superfluous sentinel element from ctl_table array x86/vdso: Remove now superfluous sentinel element from ctl_table array ...
2023-11-01bpf: Fix precision tracking for BPF_ALU | BPF_TO_BE | BPF_ENDShung-Hsi Yu
BPF_END and BPF_NEG has a different specification for the source bit in the opcode compared to other ALU/ALU64 instructions, and is either reserved or use to specify the byte swap endianness. In both cases the source bit does not encode source operand location, and src_reg is a reserved field. backtrack_insn() currently does not differentiate BPF_END and BPF_NEG from other ALU/ALU64 instructions, which leads to r0 being incorrectly marked as precise when processing BPF_ALU | BPF_TO_BE | BPF_END instructions. This commit teaches backtrack_insn() to correctly mark precision for such case. While precise tracking of BPF_NEG and other BPF_END instructions are correct and does not need fixing, this commit opt to process all BPF_NEG and BPF_END instructions within the same if-clause to better align with current convention used in the verifier (e.g. check_alu_op). Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking") Cc: stable@vger.kernel.org Reported-by: Mohamed Mahmoud <mmahmoud@redhat.com> Closes: https://lore.kernel.org/r/87jzrrwptf.fsf@toke.dk Tested-by: Toke Høiland-Jørgensen <toke@redhat.com> Tested-by: Tao Lyu <tao.lyu@epfl.ch> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20231102053913.12004-2-shung-hsi.yu@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-01bpf: Relax allowlist for css_task iterChuyi Zhou
The newly added open-coded css_task iter would try to hold the global css_set_lock in bpf_iter_css_task_new, so the bpf side has to be careful in where it allows to use this iter. The mainly concern is dead locking on css_set_lock. check_css_task_iter_allowlist() in verifier enforced css_task can only be used in bpf_lsm hooks and sleepable bpf_iter. This patch relax the allowlist for css_task iter. Any lsm and any iter (even non-sleepable) and any sleepable are safe since they would not hold the css_set_lock before entering BPF progs context. This patch also fixes the misused BPF_TRACE_ITER in check_css_task_iter_allowlist which compared bpf_prog_type with bpf_attach_type. Fixes: 9c66dc94b62ae ("bpf: Introduce css_task open-coded iterator kfuncs") Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231031050438.93297-2-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-01bpf: Check map->usercnt after timer->timer is assignedHou Tao
When there are concurrent uref release and bpf timer init operations, the following sequence diagram is possible. It will break the guarantee provided by bpf_timer: bpf_timer will still be alive after userspace application releases or unpins the map. It also will lead to kmemleak for old kernel version which doesn't release bpf_timer when map is released. bpf program X: bpf_timer_init() lock timer->lock read timer->timer as NULL read map->usercnt != 0 process Y: close(map_fd) // put last uref bpf_map_put_uref() atomic_dec_and_test(map->usercnt) array_map_free_timers() bpf_timer_cancel_and_free() // just return read timer->timer is NULL t = bpf_map_kmalloc_node() timer->timer = t unlock timer->lock Fix the problem by checking map->usercnt after timer->timer is assigned, so when there are concurrent uref release and bpf timer init, either bpf_timer_cancel_and_free() from uref release reads a no-NULL timer or the newly-added atomic64_read() returns a zero usercnt. Because atomic_dec_and_test(map->usercnt) and READ_ONCE(timer->timer) in bpf_timer_cancel_and_free() are not protected by a lock, so add a memory barrier to guarantee the order between map->usercnt and timer->timer. Also use WRITE_ONCE(timer->timer, x) to match the lockless read of timer->timer in bpf_timer_cancel_and_free(). Reported-by: Hsin-Wei Hung <hsinweih@uci.edu> Closes: https://lore.kernel.org/bpf/CABcoxUaT2k9hWsS1tNgXyoU3E-=PuOgMn737qK984fbFmfYixQ@mail.gmail.com Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231030063616.1653024-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>