summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-11-01Merge tag 'timers-urgent-2020-11-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "A few fixes for timers/timekeeping: - Prevent undefined behaviour in the timespec64_to_ns() conversion which is used for converting user supplied time input to nanoseconds. It lacked overflow protection. - Mark sched_clock_read_begin/retry() to prevent recursion in the tracer - Remove unused debug functions in the hrtimer and timerlist code" * tag 'timers-urgent-2020-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time: Prevent undefined behaviour in timespec64_to_ns() timers: Remove unused inline funtion debug_timer_free() hrtimer: Remove unused inline function debug_hrtimer_free() time/sched_clock: Mark sched_clock_read_begin/retry() as notrace
2020-11-01Merge tag 'smp-urgent-2020-11-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull smp fix from Thomas Gleixner: "A single fix for stop machine. Mark functions no trace to prevent a crash caused by recursion when enabling or disabling a tracer on RISC-V (probably all architectures which patch through stop machine)" * tag 'smp-urgent-2020-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: stop_machine, rcu: Mark functions as notrace
2020-11-01Merge tag 'locking-urgent-2020-11-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "A couple of locking fixes: - Fix incorrect failure injection handling in the fuxtex code - Prevent a preemption warning in lockdep when tracking local_irq_enable() and interrupts are already enabled - Remove more raw_cpu_read() usage from lockdep which causes state corruption on !X86 architectures. - Make the nr_unused_locks accounting in lockdep correct again" * tag 'locking-urgent-2020-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lockdep: Fix nr_unused_locks accounting locking/lockdep: Remove more raw_cpu_read() usage futex: Fix incorrect should_fail_futex() handling lockdep: Fix preemption WARN for spurious IRQ-enable
2020-10-31Merge tag 'flexible-array-conversions-5.10-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull more flexible-array member conversions from Gustavo A. R. Silva: "Replace zero-length arrays with flexible-array members" * tag 'flexible-array-conversions-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: printk: ringbuffer: Replace zero-length array with flexible-array member net/smc: Replace zero-length array with flexible-array member net/mlx5: Replace zero-length array with flexible-array member mei: hw: Replace zero-length array with flexible-array member gve: Replace zero-length array with flexible-array member Bluetooth: btintel: Replace zero-length array with flexible-array member scsi: target: tcmu: Replace zero-length array with flexible-array member ima: Replace zero-length array with flexible-array member enetc: Replace zero-length array with flexible-array member fs: Replace zero-length array with flexible-array member Bluetooth: Replace zero-length array with flexible-array member params: Replace zero-length array with flexible-array member tracepoint: Replace zero-length array with flexible-array member platform/chrome: cros_ec_proto: Replace zero-length array with flexible-array member platform/chrome: cros_ec_commands: Replace zero-length array with flexible-array member mailbox: zynqmp-ipi-message: Replace zero-length array with flexible-array member dmaengine: ti-cppi5: Replace zero-length array with flexible-array member
2020-10-30printk: ringbuffer: Replace zero-length array with flexible-array memberGustavo A. R. Silva
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.9/process/deprecated.html#zero-length-and-one-element-arrays Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-10-30Merge tag 'pm-5.10-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix a few issues related to running intel_pstate in the passive mode with HWP enabled, correct the handling of the max_cstate module parameter in intel_idle and make a few janitorial changes. Specifics: - Modify Kconfig to prevent configuring either the "conservative" or the "ondemand" governor as the default cpufreq governor if intel_pstate is selected, in which case "schedutil" is the default choice for the default governor setting (Rafael Wysocki). - Modify the cpufreq core, intel_pstate and the schedutil governor to avoid missing updates of the HWP max limit when intel_pstate operates in the passive mode with HWP enabled (Rafael Wysocki). - Fix max_cstate module parameter handling in intel_idle for processor models with C-state tables coming from ACPI (Chen Yu). - Clean up assorted pieces of power management code (Jackie Zamow, Tom Rix, Zhang Qilong)" * tag 'pm-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: schedutil: Always call driver if CPUFREQ_NEED_UPDATE_LIMITS is set cpufreq: Introduce cpufreq_driver_test_flags() cpufreq: speedstep: remove unneeded semicolon PM: sleep: fix typo in kernel/power/process.c intel_idle: Fix max_cstate for processor models without C-state tables cpufreq: intel_pstate: Avoid missing HWP max updates in passive mode cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS driver flag cpufreq: Avoid configuring old governors as default with intel_pstate cpufreq: e_powersaver: remove unreachable break
2020-10-30lockdep: Fix nr_unused_locks accountinglocking-urgent-2020-11-01Peter Zijlstra
Chris reported that commit 24d5a3bffef1 ("lockdep: Fix usage_traceoverflow") breaks the nr_unused_locks validation code triggered by /proc/lockdep_stats. By fully splitting LOCK_USED and LOCK_USED_READ it becomes a bad indicator for accounting nr_unused_locks; simplyfy by using any first bit. Fixes: 24d5a3bffef1 ("lockdep: Fix usage_traceoverflow") Reported-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://lkml.kernel.org/r/20201027124834.GL2628@hirez.programming.kicks-ass.net
2020-10-30locking/lockdep: Remove more raw_cpu_read() usagePeter Zijlstra
I initially thought raw_cpu_read() was OK, since if it is !0 we have IRQs disabled and can't get migrated, so if we get migrated both CPUs must have 0 and it doesn't matter which 0 we read. And while that is true; it isn't the whole store, on pretty much all architectures (except x86) this can result in computing the address for one CPU, getting migrated, the old CPU continuing execution with another task (possibly setting recursion) and then the new CPU reading the value of the old CPU, which is no longer 0. Similer to: baffd723e44d ("lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201026152256.GB2651@hirez.programming.kicks-ass.net
2020-10-30Merge branches 'pm-cpuidle' and 'pm-sleep'Rafael J. Wysocki
* pm-cpuidle: intel_idle: Fix max_cstate for processor models without C-state tables * pm-sleep: PM: sleep: fix typo in kernel/power/process.c
2020-10-29params: Replace zero-length array with flexible-array memberGustavo A. R. Silva
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-10-29tracepoint: Replace zero-length array with flexible-array memberGustavo A. R. Silva
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-10-29cpufreq: schedutil: Always call driver if CPUFREQ_NEED_UPDATE_LIMITS is setRafael J. Wysocki
Because sugov_update_next_freq() may skip a frequency update even if the need_freq_update flag has been set for the policy at hand, policy limits updates may not take effect as expected. For example, if the intel_pstate driver operates in the passive mode with HWP enabled, it needs to update the HWP min and max limits when the policy min and max limits change, respectively, but that may not happen if the target frequency does not change along with the limit at hand. In particular, if the policy min is changed first, causing the target frequency to be adjusted to it, and the policy max limit is changed later to the same value, the HWP max limit will not be updated to follow it as expected, because the target frequency is still equal to the policy min limit and it will not change until that limit is updated. To address this issue, modify get_next_freq() to let the driver callback run if the CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag is set regardless of whether or not the new frequency to set is equal to the previous one. Fixes: f6ebbcf08f37 ("cpufreq: intel_pstate: Implement passive mode with HWP enabled") Reported-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: 1c534352f47f cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS ... Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: a62f68f5ca53 cpufreq: Introduce cpufreq_driver_test_flags() Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-28Merge tag 'trace-v5.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Fix synthetic event "strcat" overrun New synthetic event code used strcat() and miscalculated the ending, causing the concatenation to write beyond the allocated memory. Instead of using strncat(), the code is switched over to seq_buf which has all the mechanisms in place to protect against writing more than what is allocated, and cleans up the code a bit" * tag 'trace-v5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing, synthetic events: Replace buggy strcat() with seq_buf operations
2020-10-28futex: Fix incorrect should_fail_futex() handlingMateusz Nosek
If should_futex_fail() returns true in futex_wake_pi(), then the 'ret' variable is set to -EFAULT and then immediately overwritten. So the failure injection is non-functional. Fix it by actually leaving the function and returning -EFAULT. The Fixes tag is kinda blury because the initial commit which introduced failure injection was already sloppy, but the below mentioned commit broke it completely. [ tglx: Massaged changelog ] Fixes: 6b4f4bc9cb22 ("locking/futex: Allow low-level atomic operations to return -EAGAIN") Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200927000858.24219-1-mateusznosek0@gmail.com
2020-10-27PM: sleep: fix typo in kernel/power/process.cJackie Zamow
Fix a typo in a comment in freeze_processes(). Signed-off-by: Jackie Zamow <jackie.zamow@gmail.com> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-27tracing, synthetic events: Replace buggy strcat() with seq_buf operationsSteven Rostedt (VMware)
There was a memory corruption bug happening while running the synthetic event selftests: kmemleak: Cannot insert 0xffff8c196fa2afe5 into the object search tree (overlaps existing) CPU: 5 PID: 6866 Comm: ftracetest Tainted: G W 5.9.0-rc5-test+ #577 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 Call Trace: dump_stack+0x8d/0xc0 create_object.cold+0x3b/0x60 slab_post_alloc_hook+0x57/0x510 ? tracing_map_init+0x178/0x340 __kmalloc+0x1b1/0x390 tracing_map_init+0x178/0x340 event_hist_trigger_func+0x523/0xa40 trigger_process_regex+0xc5/0x110 event_trigger_write+0x71/0xd0 vfs_write+0xca/0x210 ksys_write+0x70/0xf0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fef0a63a487 Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007fff76f18398 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000039 RCX: 00007fef0a63a487 RDX: 0000000000000039 RSI: 000055eb3b26d690 RDI: 0000000000000001 RBP: 000055eb3b26d690 R08: 000000000000000a R09: 0000000000000038 R10: 000055eb3b2cdb80 R11: 0000000000000246 R12: 0000000000000039 R13: 00007fef0a70b500 R14: 0000000000000039 R15: 00007fef0a70b700 kmemleak: Kernel memory leak detector disabled kmemleak: Object 0xffff8c196fa2afe0 (size 8): kmemleak: comm "ftracetest", pid 6866, jiffies 4295082531 kmemleak: min_count = 1 kmemleak: count = 0 kmemleak: flags = 0x1 kmemleak: checksum = 0 kmemleak: backtrace: __kmalloc+0x1b1/0x390 tracing_map_init+0x1be/0x340 event_hist_trigger_func+0x523/0xa40 trigger_process_regex+0xc5/0x110 event_trigger_write+0x71/0xd0 vfs_write+0xca/0x210 ksys_write+0x70/0xf0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The cause came down to a use of strcat() that was adding an string that was shorten, but the strcat() did not take that into account. strcat() is extremely dangerous as it does not care how big the buffer is. Replace it with seq_buf operations that prevent the buffer from being overwritten if what is being written is bigger than the buffer. Fixes: 10819e25799a ("tracing: Handle synthetic event array field type checking correctly") Reviewed-by: Tom Zanussi <zanussi@kernel.org> Tested-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-26stop_machine, rcu: Mark functions as notracesmp-urgent-2020-11-01Zong Li
Some architectures assume that the stopped CPUs don't make function calls to traceable functions when they are in the stopped state. See also commit cb9d7fd51d9f ("watchdog: Mark watchdog touch functions as notrace"). Violating this assumption causes kernel crashes when switching tracer on RISC-V. Mark rcu_momentary_dyntick_idle() and stop_machine_yield() notrace to prevent this. Fixes: 4ecf0a43e729 ("processor: get rid of cpu_relax_yield") Fixes: 366237e7b083 ("stop_machine: Provide RCU quiescent state in multi_cpu_stop()") Signed-off-by: Zong Li <zong.li@sifive.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Atish Patra <atish.patra@wdc.com> Tested-by: Colin Ian King <colin.king@canonical.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20201021073839.43935-1-zong.li@sifive.com
2020-10-26time: Prevent undefined behaviour in timespec64_to_ns()timers-urgent-2020-11-01Zeng Tao
UBSAN reports: Undefined behaviour in ./include/linux/time64.h:127:27 signed integer overflow: 17179869187 * 1000000000 cannot be represented in type 'long long int' Call Trace: timespec64_to_ns include/linux/time64.h:127 [inline] set_cpu_itimer+0x65c/0x880 kernel/time/itimer.c:180 do_setitimer+0x8e/0x740 kernel/time/itimer.c:245 __x64_sys_setitimer+0x14c/0x2c0 kernel/time/itimer.c:336 do_syscall_64+0xa1/0x540 arch/x86/entry/common.c:295 Commit bd40a175769d ("y2038: itimer: change implementation to timespec64") replaced the original conversion which handled time clamping correctly with timespec64_to_ns() which has no overflow protection. Fix it in timespec64_to_ns() as this is not necessarily limited to the usage in itimers. [ tglx: Added comment and adjusted the fixes tag ] Fixes: 361a3bf00582 ("time64: Add time64.h header and define struct timespec64") Signed-off-by: Zeng Tao <prime.zeng@hisilicon.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/1598952616-6416-1-git-send-email-prime.zeng@hisilicon.com
2020-10-26timers: Remove unused inline funtion debug_timer_free()YueHaibing
There is no caller in tree, remove it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200909134749.32300-1-yuehaibing@huawei.com
2020-10-26hrtimer: Remove unused inline function debug_hrtimer_free()YueHaibing
There is no caller in tree, remove it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200909134850.21940-1-yuehaibing@huawei.com
2020-10-26time/sched_clock: Mark sched_clock_read_begin/retry() as notraceQuanyang Wang
Since sched_clock_read_begin() and sched_clock_read_retry() are called by notrace function sched_clock(), they shouldn't be traceable either, or else ftrace_graph_caller will run into a dead loop on the path as below (arm for instance): ftrace_graph_caller() prepare_ftrace_return() function_graph_enter() ftrace_push_return_trace() trace_clock_local() sched_clock() sched_clock_read_begin/retry() Fixes: 1b86abc1c645 ("sched_clock: Expose struct clock_read_data") Signed-off-by: Quanyang Wang <quanyang.wang@windriver.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200929082027.16787-1-quanyang.wang@windriver.com
2020-10-25treewide: Convert macro and uses of __section(foo) to __section("foo")Joe Perches
Use a more generic form for __section that requires quotes to avoid complications with clang and gcc differences. Remove the quote operator # from compiler_attributes.h __section macro. Convert all unquoted __section(foo) uses to quoted __section("foo"). Also convert __attribute__((section("foo"))) uses to __section("foo") even if the __attribute__ has multiple list entry forms. Conversion done using the script at: https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Nick Desaulniers <ndesaulniers@gooogle.com> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-25kernel/sys.c: fix prototype of prctl_get_tid_address()Rasmus Villemoes
tid_addr is not a "pointer to (pointer to int in userspace)"; it is in fact a "pointer to (pointer to int in userspace) in userspace". So sparse rightfully complains about passing a kernel pointer to put_user(). Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-25Merge tag 'timers-urgent-2020-10-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "A time namespace fix and a matching selftest. The futex absolute timeouts which are based on CLOCK_MONOTONIC require time namespace corrected. This was missed in the original time namesapce support" * tag 'timers-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: selftests/timens: Add a test for futex() futex: Adjust absolute futex timeouts with per time namespace offset
2020-10-25Merge tag 'sched-urgent-2020-10-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "Two scheduler fixes: - A trivial build fix for sched_feat() to compile correctly with CONFIG_JUMP_LABEL=n - Replace a zero lenght array with a flexible array" * tag 'sched-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/features: Fix !CONFIG_JUMP_LABEL case sched: Replace zero-length array with flexible-array
2020-10-25Merge tag 'safesetid-5.10' of git://github.com/micah-morton/linuxLinus Torvalds
Pull SafeSetID updates from Micah Morton: "The changes are mostly contained to within the SafeSetID LSM, with the exception of a few 1-line changes to change some ns_capable() calls to ns_capable_setid() -- causing a flag (CAP_OPT_INSETID) to be set that is examined by SafeSetID code and nothing else in the kernel. The changes to SafeSetID internally allow for setting up GID transition security policies, as already existed for UIDs" * tag 'safesetid-5.10' of git://github.com/micah-morton/linux: LSM: SafeSetID: Fix warnings reported by test bot LSM: SafeSetID: Add GID security policy handling LSM: Signal to SafeSetID when setting group IDs
2020-10-25Merge tag '20201024-v4-5.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/prandom Pull random32 updates from Willy Tarreau: "Make prandom_u32() less predictable. This is the cleanup of the latest series of prandom_u32 experimentations consisting in using SipHash instead of Tausworthe to produce the randoms used by the network stack. The changes to the files were kept minimal, and the controversial commit that used to take noise from the fast_pool (f227e3ec3b5c) was reverted. Instead, a dedicated "net_rand_noise" per_cpu variable is fed from various sources of activities (networking, scheduling) to perturb the SipHash state using fast, non-trivially predictable data, instead of keeping it fully deterministic. The goal is essentially to make any occasional memory leakage or brute-force attempt useless. The resulting code was verified to be very slightly faster on x86_64 than what is was with the controversial commit above, though this remains barely above measurement noise. It was also tested on i386 and arm, and build- tested only on arm64" Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/ * tag '20201024-v4-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/prandom: random32: add a selftest for the prandom32 code random32: add noise from network and scheduling activity random32: make prandom_u32() output unpredictable
2020-10-24Merge tag 'dma-mapping-5.10-1' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds
Pull dma-mapping fixes from Christoph Hellwig: - document the new dma_{alloc,free}_pages() API - two fixups for the dma-mapping.h split * tag 'dma-mapping-5.10-1' of git://git.infradead.org/users/hch/dma-mapping: dma-mapping: document dma_{alloc,free}_pages dma-mapping: move more functions to dma-map-ops.h ARM/sa1111: add a missing include of dma-map-ops.h
2020-10-24random32: add noise from network and scheduling activityWilly Tarreau
With the removal of the interrupt perturbations in previous random32 change (random32: make prandom_u32() output unpredictable), the PRNG has become 100% deterministic again. While SipHash is expected to be way more robust against brute force than the previous Tausworthe LFSR, there's still the risk that whoever has even one temporary access to the PRNG's internal state is able to predict all subsequent draws till the next reseed (roughly every minute). This may happen through a side channel attack or any data leak. This patch restores the spirit of commit f227e3ec3b5c ("random32: update the net random state on interrupt and activity") in that it will perturb the internal PRNG's statee using externally collected noise, except that it will not pick that noise from the random pool's bits nor upon interrupt, but will rather combine a few elements along the Tx path that are collectively hard to predict, such as dev, skb and txq pointers, packet length and jiffies values. These ones are combined using a single round of SipHash into a single long variable that is mixed with the net_rand_state upon each invocation. The operation was inlined because it produces very small and efficient code, typically 3 xor, 2 add and 2 rol. The performance was measured to be the same (even very slightly better) than before the switch to SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC (i40e), the connection rate dropped from 556k/s to 555k/s while the SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps. Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/ Cc: George Spelvin <lkml@sdf.org> Cc: Amit Klein <aksecurity@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: tytso@mit.edu Cc: Florian Westphal <fw@strlen.de> Cc: Marc Plumb <lkml.mplumb@gmail.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
2020-10-24random32: make prandom_u32() output unpredictableGeorge Spelvin
Non-cryptographic PRNGs may have great statistical properties, but are usually trivially predictable to someone who knows the algorithm, given a small sample of their output. An LFSR like prandom_u32() is particularly simple, even if the sample is widely scattered bits. It turns out the network stack uses prandom_u32() for some things like random port numbers which it would prefer are *not* trivially predictable. Predictability led to a practical DNS spoofing attack. Oops. This patch replaces the LFSR with a homebrew cryptographic PRNG based on the SipHash round function, which is in turn seeded with 128 bits of strong random key. (The authors of SipHash have *not* been consulted about this abuse of their algorithm.) Speed is prioritized over security; attacks are rare, while performance is always wanted. Replacing all callers of prandom_u32() is the quick fix. Whether to reinstate a weaker PRNG for uses which can tolerate it is an open question. Commit f227e3ec3b5c ("random32: update the net random state on interrupt and activity") was an earlier attempt at a solution. This patch replaces it. Reported-by: Amit Klein <aksecurity@gmail.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Eric Dumazet <edumazet@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: tytso@mit.edu Cc: Florian Westphal <fw@strlen.de> Cc: Marc Plumb <lkml.mplumb@gmail.com> Fixes: f227e3ec3b5c ("random32: update the net random state on interrupt and activity") Signed-off-by: George Spelvin <lkml@sdf.org> Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/ [ willy: partial reversal of f227e3ec3b5c; moved SIPROUND definitions to prandom.h for later use; merged George's prandom_seed() proposal; inlined siprand_u32(); replaced the net_rand_state[] array with 4 members to fix a build issue; cosmetic cleanups to make checkpatch happy; fixed RANDOM32_SELFTEST build ] Signed-off-by: Willy Tarreau <w@1wt.eu>
2020-10-23Merge tag 'trace-v5.10-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing ring-buffer fix from Steven Rostedt: "The success return value of ring_buffer_resize() is stated to be zero and checked that way. But it was incorrectly returning the size allocated. Also, a fix to a comment" * tag 'trace-v5.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ring-buffer: Update the description for ring_buffer_wait ring-buffer: Return 0 on success from ring_buffer_resize()
2020-10-23Merge tag 'pm-5.10-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "First of all, the adaptive voltage scaling (AVS) drivers go to new platform-specific locations as planned (this part was reported to have merge conflicts against the new arm-soc updates in linux-next). In addition to that, there are some fixes (intel_idle, intel_pstate, RAPL, acpi_cpufreq), the addition of on/off notifiers and idle state accounting support to the generic power domains (genpd) code and some janitorial changes all over. Specifics: - Move the AVS drivers to new platform-specific locations and get rid of the drivers/power/avs directory (Ulf Hansson). - Add on/off notifiers and idle state accounting support to the generic power domains (genpd) framework (Ulf Hansson, Lina Iyer). - Ulf will maintain the PM domain part of cpuidle-psci (Ulf Hansson). - Make intel_idle disregard ACPI _CST if it cannot use the data returned by that method (Mel Gorman). - Modify intel_pstate to avoid leaving useless sysfs directory structure behind if it cannot be registered (Chen Yu). - Fix domain detection in the RAPL power capping driver and prevent it from failing to enumerate the Psys RAPL domain (Zhang Rui). - Allow acpi-cpufreq to use ACPI _PSD information with Family 19 and later AMD chips (Wei Huang). - Update the driver assumptions comment in intel_idle and fix a kerneldoc comment in the runtime PM framework (Alexander Monakov, Bean Huo). - Avoid unnecessary resets of the cached frequency in the schedutil cpufreq governor to reduce overhead (Wei Wang). - Clean up the cpufreq core a bit (Viresh Kumar). - Make assorted minor janitorial changes (Daniel Lezcano, Geert Uytterhoeven, Hubert Jasudowicz, Tom Rix). - Clean up and optimize the cpupower utility somewhat (Colin Ian King, Martin Kaistra)" * tag 'pm-5.10-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits) PM: sleep: remove unreachable break PM: AVS: Drop the avs directory and the corresponding Kconfig PM: AVS: qcom-cpr: Move the driver to the qcom specific drivers PM: runtime: Fix typo in pm_runtime_set_active() helper comment PM: domains: Fix build error for genpd notifiers powercap: Fix typo in Kconfig "Plance" -> "Plane" cpufreq: schedutil: restore cached freq when next_f is not changed acpi-cpufreq: Honor _PSD table setting on new AMD CPUs PM: AVS: smartreflex Move driver to soc specific drivers PM: AVS: rockchip-io: Move the driver to the rockchip specific drivers PM: domains: enable domain idle state accounting PM: domains: Add curly braces to delimit comment + statement block PM: domains: Add support for PM domain on/off notifiers for genpd powercap/intel_rapl: enumerate Psys RAPL domain together with package RAPL domain powercap/intel_rapl: Fix domain detection intel_idle: Ignore _CST if control cannot be taken from the platform cpuidle: Remove pointless stub intel_idle: mention assumption that WBINVD is not needed MAINTAINERS: Add section for cpuidle-psci PM domain cpufreq: intel_pstate: Delete intel_pstate sysfs if failed to register the driver ...
2020-10-23Merge tag 'net-5.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Cross-tree/merge window issues: - rtl8150: don't incorrectly assign random MAC addresses; fix late in the 5.9 cycle started depending on a return code from a function which changed with the 5.10 PR from the usb subsystem Current release regressions: - Revert "virtio-net: ethtool configurable RXCSUM", it was causing crashes at probe when control vq was not negotiated/available Previous release regressions: - ixgbe: fix probing of multi-port 10 Gigabit Intel NICs with an MDIO bus, only first device would be probed correctly - nexthop: Fix performance regression in nexthop deletion by effectively switching from recently added synchronize_rcu() to synchronize_rcu_expedited() - netsec: ignore 'phy-mode' device property on ACPI systems; the property is not populated correctly by the firmware, but firmware configures the PHY so just keep boot settings Previous releases - always broken: - tcp: fix to update snd_wl1 in bulk receiver fast path, addressing bulk transfers getting "stuck" - icmp: randomize the global rate limiter to prevent attackers from getting useful signal - r8169: fix operation under forced interrupt threading, make the driver always use hard irqs, even on RT, given the handler is light and only wants to schedule napi (and do so through a _irqoff() variant, preferably) - bpf: Enforce pointer id generation for all may-be-null register type to avoid pointers erroneously getting marked as null-checked - tipc: re-configure queue limit for broadcast link - net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels - fix various issues in chelsio inline tls driver Misc: - bpf: improve just-added bpf_redirect_neigh() helper api to support supplying nexthop by the caller - in case BPF program has already done a lookup we can avoid doing another one - remove unnecessary break statements - make MCTCP not select IPV6, but rather depend on it" * tag 'net-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits) tcp: fix to update snd_wl1 in bulk receiver fast path net: Properly typecast int values to set sk_max_pacing_rate netfilter: nf_fwd_netdev: clear timestamp in forwarding path ibmvnic: save changed mac address to adapter->mac_addr selftests: mptcp: depends on built-in IPv6 Revert "virtio-net: ethtool configurable RXCSUM" rtnetlink: fix data overflow in rtnl_calcit() net: ethernet: mtk-star-emac: select REGMAP_MMIO net: hdlc_raw_eth: Clear the IFF_TX_SKB_SHARING flag after calling ether_setup net: hdlc: In hdlc_rcv, check to make sure dev is an HDLC device bpf, libbpf: Guard bpf inline asm from bpf_tail_call_static bpf, selftests: Extend test_tc_redirect to use modified bpf_redirect_neigh() bpf: Fix bpf_redirect_neigh helper api to support supplying nexthop mptcp: depends on IPV6 but not as a module sfc: move initialisation of efx->filter_sem to efx_init_struct() mpls: load mpls_gso after mpls_iptunnel net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels net/sched: act_gate: Unlock ->tcfa_lock in tc_setup_flow_action() net: dsa: bcm_sf2: make const array static, makes object smaller mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it ...
2020-10-23Merge tag 'arch-cleanup-2020-10-22' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull arch task_work cleanups from Jens Axboe: "Two cleanups that don't fit other categories: - Finally get the task_work_add() cleanup done properly, so we don't have random 0/1/false/true/TWA_SIGNAL confusing use cases. Updates all callers, and also fixes up the documentation for task_work_add(). - While working on some TIF related changes for 5.11, this TIF_NOTIFY_RESUME cleanup fell out of that. Remove some arch duplication for how that is handled" * tag 'arch-cleanup-2020-10-22' of git://git.kernel.dk/linux-block: task_work: cleanup notification modes tracehook: clear TIF_NOTIFY_RESUME in tracehook_notify_resume()
2020-10-22Merge tag 'kbuild-v5.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Support 'make compile_commands.json' to generate the compilation database more easily, avoiding stale entries - Support 'make clang-analyzer' and 'make clang-tidy' for static checks using clang-tidy - Preprocess scripts/modules.lds.S to allow CONFIG options in the module linker script - Drop cc-option tests from compiler flags supported by our minimal GCC/Clang versions - Use always 12-digits commit hash for CONFIG_LOCALVERSION_AUTO=y - Use sha1 build id for both BFD linker and LLD - Improve deb-pkg for reproducible builds and rootless builds - Remove stale, useless scripts/namespace.pl - Turn -Wreturn-type warning into error - Fix build error of deb-pkg when CONFIG_MODULES=n - Replace 'hostname' command with more portable 'uname -n' - Various Makefile cleanups * tag 'kbuild-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (34 commits) kbuild: Use uname for LINUX_COMPILE_HOST detection kbuild: Only add -fno-var-tracking-assignments for old GCC versions kbuild: remove leftover comment for filechk utility treewide: remove DISABLE_LTO kbuild: deb-pkg: clean up package name variables kbuild: deb-pkg: do not build linux-headers package if CONFIG_MODULES=n kbuild: enforce -Werror=return-type scripts: remove namespace.pl builddeb: Add support for all required debian/rules targets builddeb: Enable rootless builds builddeb: Pass -n to gzip for reproducible packages kbuild: split the build log of kallsyms kbuild: explicitly specify the build id style scripts/setlocalversion: make git describe output more reliable kbuild: remove cc-option test of -Werror=date-time kbuild: remove cc-option test of -fno-stack-check kbuild: remove cc-option test of -fno-strict-overflow kbuild: move CFLAGS_{KASAN,UBSAN,KCSAN} exports to relevant Makefiles kbuild: remove redundant CONFIG_KASAN check from scripts/Makefile.kasan kbuild: do not create built-in objects for external module builds ...
2020-10-22Merge tag 'modules-for-v5.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull modules updates from Jessica Yu: "Code cleanups: more informative error messages and statically initialize init_free_wq to avoid a workqueue warning" * tag 'modules-for-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: module: statically initialize init section freeing data module: Add more error message for failed kernel module loading
2020-10-22Merge branch 'work.set_fs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull initial set_fs() removal from Al Viro: "Christoph's set_fs base series + fixups" * 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: Allow a NULL pos pointer to __kernel_read fs: Allow a NULL pos pointer to __kernel_write powerpc: remove address space overrides using set_fs() powerpc: use non-set_fs based maccess routines x86: remove address space overrides using set_fs() x86: make TASK_SIZE_MAX usable from assembly code x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h lkdtm: remove set_fs-based tests test_bitmap: remove user bitmap tests uaccess: add infrastructure for kernel builds with set_fs() fs: don't allow splice read/write without explicit ops fs: don't allow kernel reads and writes without iter ops sysctl: Convert to iter interfaces proc: add a read_iter method to proc proc_ops proc: cleanup the compat vs no compat file ops proc: remove a level of indentation in proc_get_inode
2020-10-22ring-buffer: Update the description for ring_buffer_waitQiujun Huang
The function changed at some point, but the description was not updated. Link: https://lkml.kernel.org/r/20201017095246.5170-1-hqjagain@gmail.com Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-22ring-buffer: Return 0 on success from ring_buffer_resize()Qiujun Huang
We don't need to check the new buffer size, and the return value had confused resize_buffer_duplicate_size(). ... ret = ring_buffer_resize(trace_buf->buffer, per_cpu_ptr(size_buf->data,cpu_id)->entries, cpu_id); if (ret == 0) per_cpu_ptr(trace_buf->data, cpu_id)->entries = per_cpu_ptr(size_buf->data, cpu_id)->entries; ... Link: https://lkml.kernel.org/r/20201019142242.11560-1-hqjagain@gmail.com Cc: stable@vger.kernel.org Fixes: d60da506cbeb3 ("tracing: Add a resize function to make one buffer equivalent to another buffer") Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-22lockdep: Fix preemption WARN for spurious IRQ-enablePeter Zijlstra
It is valid (albeit uncommon) to call local_irq_enable() without first having called local_irq_disable(). In this case we enter lockdep_hardirqs_on*() with IRQs enabled and trip a preemption warning for using __this_cpu_read(). Use this_cpu_read() instead to avoid the warning. Fixes: 4d004099a6 ("lockdep: Fix lockdep recursion") Reported-by: syzbot+53f8ce8bbc07924b6417@syzkaller.appspotmail.com Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-10-21treewide: remove DISABLE_LTOSami Tolvanen
This change removes all instances of DISABLE_LTO from Makefiles, as they are currently unused, and the preferred method of disabling LTO is to filter out the flags instead. Note added by Masahiro Yamada: DISABLE_LTO was added as preparation for GCC LTO, but GCC LTO was not pulled into the mainline. (https://lkml.org/lkml/2014/4/8/272) Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-10-20futex: Adjust absolute futex timeouts with per time namespace offsetAndrei Vagin
For all commands except FUTEX_WAIT, the timeout is interpreted as an absolute value. This absolute value is inside the task's time namespace and has to be converted to the host's time. Fixes: 5a590f35add9 ("posix-clocks: Wire up clock_gettime() with timens offsets") Reported-by: Hans van der Laan <j.h.vanderlaan@student.utwente.nl> Signed-off-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20201015160020.293748-1-avagin@gmail.com
2020-10-20dma-mapping: move more functions to dma-map-ops.hChristoph Hellwig
Due to a mismerge a bunch of prototypes that should have moved to dma-map-ops.h are still in dma-mapping.h, fix that up. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-19bpf: Enforce id generation for all may-be-null register typeMartin KaFai Lau
The commit af7ec1383361 ("bpf: Add bpf_skc_to_tcp6_sock() helper") introduces RET_PTR_TO_BTF_ID_OR_NULL and the commit eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()") introduces RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL. Note that for RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL, the reg0->type could become PTR_TO_MEM_OR_NULL which is not covered by BPF_PROBE_MEM. The BPF_REG_0 will then hold a _OR_NULL pointer type. This _OR_NULL pointer type requires the bpf program to explicitly do a NULL check first. After NULL check, the verifier will mark all registers having the same reg->id as safe to use. However, the reg->id is not set for those new _OR_NULL return types. One of the ways that may be wrong is, checking NULL for one btf_id typed pointer will end up validating all other btf_id typed pointers because all of them have id == 0. The later tests will exercise this path. To fix it and also avoid similar issue in the future, this patch moves the id generation logic out of each individual RET type test in check_helper_call(). Instead, it does one reg_type_may_be_null() test and then do the id generation if needed. This patch also adds a WARN_ON_ONCE in mark_ptr_or_null_reg() to catch future breakage. The _OR_NULL pointer usage in the bpf_iter_reg.ctx_arg_info is fine because it just happens that the existing id generation after check_ctx_access() has covered it. It is also using the reg_type_may_be_null() to decide if id generation is needed or not. Fixes: af7ec1383361 ("bpf: Add bpf_skc_to_tcp6_sock() helper") Fixes: eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201019194212.1050855-1-kafai@fb.com
2020-10-19bpf: Remove unneeded breakTom Rix
A break is not needed if it is preceded by a return. Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201019173846.1021-1-trix@redhat.com
2020-10-19cpufreq: schedutil: restore cached freq when next_f is not changedWei Wang
We have the raw cached freq to reduce the chance in calling cpufreq driver where it could be costly in some arch/SoC. Currently, the raw cached freq is reset in sugov_update_single() when it avoids frequency reduction (which is not desirable sometimes), but it is better to restore the previous value of it in that case, because it may not change in the next cycle and it is not necessary to change the CPU frequency then. Adapted from https://android-review.googlesource.com/1352810/ Signed-off-by: Wei Wang <wvw@google.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> [ rjw: Subject edit and changelog rewrite ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-18Merge tag 'core-rcu-2020-10-12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU changes from Ingo Molnar: - Debugging for smp_call_function() - RT raw/non-raw lock ordering fixes - Strict grace periods for KASAN - New smp_call_function() torture test - Torture-test updates - Documentation updates - Miscellaneous fixes [ This doesn't actually pull the tag - I've dropped the last merge from the RCU branch due to questions about the series. - Linus ] * tag 'core-rcu-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits) smp: Make symbol 'csd_bug_count' static kernel/smp: Provide CSD lock timeout diagnostics smp: Add source and destination CPUs to __call_single_data rcu: Shrink each possible cpu krcp rcu/segcblist: Prevent useless GP start if no CBs to accelerate torture: Add gdb support rcutorture: Allow pointer leaks to test diagnostic code rcutorture: Hoist OOM registry up one level refperf: Avoid null pointer dereference when buf fails to allocate rcutorture: Properly synchronize with OOM notifier rcutorture: Properly set rcu_fwds for OOM handling torture: Add kvm.sh --help and update help message rcutorture: Add CONFIG_PROVE_RCU_LIST to TREE05 torture: Update initrd documentation rcutorture: Replace HTTP links with HTTPS ones locktorture: Make function torture_percpu_rwsem_init() static torture: document --allcpus argument added to the kvm.sh script rcutorture: Output number of elapsed grace periods rcutorture: Remove KCSAN stubs rcu: Remove unused "cpu" parameter from rcu_report_qs_rdp() ...
2020-10-18mm/madvise: introduce process_madvise() syscall: an external memory hinting APIMinchan Kim
There is usecase that System Management Software(SMS) want to give a memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the case of Android, it is the ActivityManagerService. The information required to make the reclaim decision is not known to the app. Instead, it is known to the centralized userspace daemon(ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the issue, this patch introduces a new syscall process_madvise(2). It uses pidfd of an external process to give the hint. It also supports vector address range because Android app has thousands of vmas due to zygote so it's totally waste of CPU and power if we should call the syscall one by one for each vma.(With testing 2000-vma syscall vs 1-vector syscall, it showed 15% performance improvement. I think it would be bigger in real practice because the testing ran very cache friendly environment). Another potential use case for the vector range is to amortize the cost ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could benefit users like TCP receive zerocopy and malloc implementations. In future, we could find more usecases for other advises so let's make it happens as API since we introduce a new syscall at this moment. With that, existing madvise(2) user could replace it with process_madvise(2) with their own pid if they want to have batch address ranges support feature. ince it could affect other process's address range, only privileged process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same UID) gives it the right to ptrace the process could use it successfully. The flag argument is reserved for future use if we need to extend the API. I think supporting all hints madvise has/will supported/support to process_madvise is rather risky. Because we are not sure all hints make sense from external process and implementation for the hint may rely on the caller being in the current context so it could be error-prone. Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. If someone want to add other hints, we could hear the usecase and review it for each hint. It's safer for maintenance rather than introducing a buggy syscall but hard to fix it later. So finally, the API is as follows, ssize_t process_madvise(int pidfd, const struct iovec *iovec, unsigned long vlen, int advice, unsigned int flags); DESCRIPTION The process_madvise() system call is used to give advice or directions to the kernel about the address ranges from external process as well as local process. It provides the advice to address ranges of process described by iovec and vlen. The goal of such advice is to improve system or application performance. The pidfd selects the process referred to by the PID file descriptor specified in pidfd. (See pidofd_open(2) for further information) The pointer iovec points to an array of iovec structures, defined in <sys/uio.h> as: struct iovec { void *iov_base; /* starting address */ size_t iov_len; /* number of bytes to be advised */ }; The iovec describes address ranges beginning at address(iov_base) and with size length of bytes(iov_len). The vlen represents the number of elements in iovec. The advice is indicated in the advice argument, which is one of the following at this moment if the target process specified by pidfd is external. MADV_COLD MADV_PAGEOUT Permission to provide a hint to external process is governed by a ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2). The process_madvise supports every advice madvise(2) has if target process is in same thread group with calling process so user could use process_madvise(2) to extend existing madvise(2) to support vector address ranges. RETURN VALUE On success, process_madvise() returns the number of bytes advised. This return value may be less than the total number of requested bytes, if an error occurred. The caller should check return value to determine whether a partial advice occurred. FAQ: Q.1 - Why does any external entity have better knowledge? Quote from Sandeep "For Android, every application (including the special SystemServer) are forked from Zygote. The reason of course is to share as many libraries and classes between the two as possible to benefit from the preloading during boot. After applications start, (almost) all of the APIs end up calling into this SystemServer process over IPC (binder) and back to the application. In a fully running system, the SystemServer monitors every single process periodically to calculate their PSS / RSS and also decides which process is "important" to the user for interactivity. So, because of how these processes start _and_ the fact that the SystemServer is looping to monitor each process, it does tend to *know* which address range of the application is not used / useful. Besides, we can never rely on applications to clean things up themselves. We've had the "hey app1, the system is low on memory, please trim your memory usage down" notifications for a long time[1]. They rely on applications honoring the broadcasts and very few do. So, if we want to avoid the inevitable killing of the application and restarting it, some way to be able to tell the OS about unimportant memory in these applications will be useful. - ssp Q.2 - How to guarantee the race(i.e., object validation) between when giving a hint from an external process and get the hint from the target process? process_madvise operates on the target process's address space as it exists at the instant that process_madvise is called. If the space target process can run between the time the process_madvise process inspects the target process address space and the time that process_madvise is actually called, process_madvise may operate on memory regions that the calling process does not expect. It's the responsibility of the process calling process_madvise to close this race condition. For example, the calling process can suspend the target process with ptrace, SIGSTOP, or the freezer cgroup so that it doesn't have an opportunity to change its own address space before process_madvise is called. Another option is to operate on memory regions that the caller knows a priori will be unchanged in the target process. Yet another option is to accept the race for certain process_madvise calls after reasoning that mistargeting will do no harm. The suggested API itself does not provide synchronization. It also apply other APIs like move_pages, process_vm_write. The race isn't really a problem though. Why is it so wrong to require that callers do their own synchronization in some manner? Nobody objects to write(2) merely because it's possible for two processes to open the same file and clobber each other's writes --- instead, we tell people to use flock or something. Think about mmap. It never guarantees newly allocated address space is still valid when the user tries to access it because other threads could unmap the memory right before. That's where we need synchronization by using other API or design from userside. It shouldn't be part of API itself. If someone needs more fine-grained synchronization rather than process level, there were two ideas suggested - cookie[2] and anon-fd[3]. Both are applicable via using last reserved argument of the API but I don't think it's necessary right now since we have already ways to prevent the race so don't want to add additional complexity with more fine-grained optimization model. To make the API extend, it reserved an unsigned long as last argument so we could support it in future if someone really needs it. Q.3 - Why doesn't ptrace work? Injecting an madvise in the target process using ptrace would not work for us because such injected madvise would have to be executed by the target process, which means that process would have to be runnable and that creates the risk of the abovementioned race and hinting a wrong VMA. Furthermore, we want to act the hint in caller's context, not the callee's, because the callee is usually limited in cpuset/cgroups or even freezed state so they can't act by themselves quick enough, which causes more thrashing/kill. It doesn't work if the target process are ptraced(e.g., strace, debugger, minidump) because a process can have at most one ptracer. [1] https://developer.android.com/topic/performance/memory" [2] process_getinfo for getting the cookie which is updated whenever vma of process address layout are changed - Daniel Colascione - https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224 [3] anonymous fd which is used for the object(i.e., address range) validation - Michal Hocko - https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/ [minchan@kernel.org: fix process_madvise build break for arm64] Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com [minchan@kernel.org: fix build error for mips of process_madvise] Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com [akpm@linux-foundation.org: fix patch ordering issue] [akpm@linux-foundation.org: fix arm64 whoops] [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian] [akpm@linux-foundation.org: fix i386 build] [sfr@canb.auug.org.au: fix syscall numbering] Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au [sfr@canb.auug.org.au: madvise.c needs compat.h] Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au [minchan@kernel.org: fix mips build] Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com [yuehaibing@huawei.com: remove duplicate header which is included twice] Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com [minchan@kernel.org: do not use helper functions for process_madvise] Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com [akpm@linux-foundation.org: pidfd_get_pid() gained an argument] [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"] Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <christian@brauner.io> Cc: Daniel Colascione <dancol@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Dias <joaodias@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Sandeep Patil <sspatil@google.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: <linux-man@vger.kernel.org> Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18pid: move pidfd_get_pid() to pid.cMinchan Kim
process_madvise syscall needs pidfd_get_pid function to translate pidfd to pid so this patch move the function to kernel/pid.c. Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jann Horn <jannh@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Daniel Colascione <dancol@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Dias <joaodias@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Sandeep Patil <sspatil@google.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: <linux-man@vger.kernel.org> Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org Link: http://lkml.kernel.org/r/20200622192900.22757-3-minchan@kernel.org Link: https://lkml.kernel.org/r/20200901000633.1920247-3-minchan@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-17task_work: cleanup notification modesarch-cleanup-2020-10-22Jens Axboe
A previous commit changed the notification mode from true/false to an int, allowing notify-no, notify-yes, or signal-notify. This was backwards compatible in the sense that any existing true/false user would translate to either 0 (on notification sent) or 1, the latter which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2. Clean this up properly, and define a proper enum for the notification mode. Now we have: - TWA_NONE. This is 0, same as before the original change, meaning no notification requested. - TWA_RESUME. This is 1, same as before the original change, meaning that we use TIF_NOTIFY_RESUME. - TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the notification. Clean up all the callers, switching their 0/1/false/true to using the appropriate TWA_* mode for notifications. Fixes: e91b48162332 ("task_work: teach task_work_add() to do signal_wake_up()") Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>