summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2019-12-05xdp: fix cpumap redirect SKB creation bugJesper Dangaard Brouer
[ Upstream commit 676e4a6fe703f2dae699ee9d56f14516f9ada4ea ] We want to avoid leaking pointer info from xdp_frame (that is placed in top of frame) like commit 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page reuse"), and followup commit 97e19cce05e5 ("bpf: reserve xdp_frame size in xdp headroom") that reserve this headroom. These changes also affected how cpumap constructed SKBs, as xdpf->headroom size changed, the skb data starting point were in-effect shifted with 32 bytes (sizeof xdp_frame). This was still okay, as the cpumap frame_size calculation also included xdpf->headroom which were reduced by same amount. A bug was introduced in commit 77ea5f4cbe20 ("bpf/cpumap: make sure frame_size for build_skb is aligned if headroom isn't"), where the xdpf->headroom became part of the SKB_DATA_ALIGN rounding up. This round-up to find the frame_size is in principle still correct as it does not exceed the 2048 bytes frame_size (which is max for ixgbe and i40e), but the 32 bytes offset of pkt_data_start puts this over the 2048 bytes limit. This cause skb_shared_info to spill into next frame. It is a little hard to trigger, as the SKB need to use above 15 skb_shinfo->frags[] as far as I calculate. This does happen in practise for TCP streams when skb_try_coalesce() kicks in. KASAN can be used to detect these wrong memory accesses, I've seen: BUG: KASAN: use-after-free in skb_try_coalesce+0x3cb/0x760 BUG: KASAN: wild-memory-access in skb_release_data+0xe2/0x250 Driver veth also construct a SKB from xdp_frame in this way, but is not affected, as it doesn't reserve/deduct the room (used by xdp_frame) from the SKB headroom. Instead is clears the pointers via xdp_scrub_frame(), and allows SKB to use this area. The fix in this patch is to do like veth and instead allow SKB to (re)use the area occupied by xdp_frame, by clearing via xdp_scrub_frame(). (This does kill the idea of the SKB being able to access (mem) info from this area, but I guess it was a bad idea anyhow, and it was already killed by the veth changes.) Fixes: 77ea5f4cbe20 ("bpf/cpumap: make sure frame_size for build_skb is aligned if headroom isn't") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05bpf: drop refcount if bpf_map_new_fd() fails in map_create()Peng Sun
[ Upstream commit 352d20d611414715353ee65fc206ee57ab1a6984 ] In bpf/syscall.c, map_create() first set map->usercnt to 1, a file descriptor is supposed to return to userspace. When bpf_map_new_fd() fails, drop the refcount. Fixes: bd5f5f4ecb78 ("bpf: Add BPF_MAP_GET_FD_BY_ID") Signed-off-by: Peng Sun <sironhide0null@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05bpf: decrease usercnt if bpf_map_new_fd() fails in bpf_map_get_fd_by_id()Peng Sun
[ Upstream commit 781e62823cb81b972dc8652c1827205cda2ac9ac ] In bpf/syscall.c, bpf_map_get_fd_by_id() use bpf_map_inc_not_zero() to increase the refcount, both map->refcnt and map->usercnt. Then, if bpf_map_new_fd() fails, should handle map->usercnt too. Fixes: bd5f5f4ecb78 ("bpf: Add BPF_MAP_GET_FD_BY_ID") Signed-off-by: Peng Sun <sironhide0null@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05fork: fix some -Wmissing-prototypes warningsYi Wang
[ Upstream commit fb5bf31722d0805a3f394f7d59f2e8cd07acccb7 ] We get a warning when building kernel with W=1: kernel/fork.c:167:13: warning: no previous prototype for `arch_release_thread_stack' [-Wmissing-prototypes] kernel/fork.c:779:13: warning: no previous prototype for `fork_init' [-Wmissing-prototypes] Add the missing declaration in head file to fix this. Also, remove arch_release_thread_stack() completely because no arch seems to implement it since bb9d81264 (arch: remove tile port). Link: http://lkml.kernel.org/r/1542170087-23645-1-git-send-email-wang.yi59@zte.com.cn Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05bpf/cpumap: make sure frame_size for build_skb is aligned if headroom isn'tJesper Dangaard Brouer
[ Upstream commit 77ea5f4cbe2084db9ab021ba73fb7eadf1610884 ] The frame_size passed to build_skb must be aligned, else it is possible that the embedded struct skb_shared_info gets unaligned. For correctness make sure that xdpf->headroom in included in the alignment. No upstream drivers can hit this, as all XDP drivers provide an aligned headroom. This was discovered when playing with implementing XDP support for mvneta, which have a 2 bytes DSA header, and this Marvell ARM64 platform didn't like doing atomic operations on an unaligned skb_shinfo(skb)->dataref addresses. Fixes: 1c601d829ab0 ("bpf: cpumap xdp_buff to skb conversion and allocation") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05kprobes: Blacklist symbols in arch-defined prohibited areaMasami Hiramatsu
[ Upstream commit fb1a59fae8baa3f3c69b72a87ff94fc4fa5683ec ] Blacklist symbols in arch-defined probe-prohibited areas. With this change, user can see all symbols which are prohibited to probe in debugfs. All archtectures which have custom prohibit areas should define its own arch_populate_kprobe_blacklist() function, but unless that, all symbols marked __kprobes are blacklisted. Reported-by: Andrea Righi <righi.andrea@gmail.com> Tested-by: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: David S. Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yonghong Song <yhs@fb.com> Link: http://lkml.kernel.org/r/154503485491.26176.15823229545155174796.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05tracing: Lock event_mutex before synth_event_mutexMasami Hiramatsu
[ Upstream commit fc800a10be26017f8f338bc8e500d48e3e6429d9 ] synthetic event is using synth_event_mutex for protecting synth_event_list, and event_trigger_write() path acquires locks as below order. event_trigger_write(event_mutex) ->trigger_process_regex(trigger_cmd_mutex) ->event_hist_trigger_func(synth_event_mutex) On the other hand, synthetic event creation and deletion paths call trace_add_event_call() and trace_remove_event_call() which acquires event_mutex. In that case, if we keep the synth_event_mutex locked while registering/unregistering synthetic events, its dependency will be inversed. To avoid this issue, current synthetic event is using a 2 phase process to create/delete events. For example, it searches existing events under synth_event_mutex to check for event-name conflicts, and unlocks synth_event_mutex, then registers a new event under event_mutex locked. Finally, it locks synth_event_mutex and tries to add the new event to the list. But it can introduce complexity and a chance for name conflicts. To solve this simpler, this introduces trace_add_event_call_nolock() and trace_remove_event_call_nolock() which don't acquire event_mutex inside. synthetic event can lock event_mutex before synth_event_mutex to solve the lock dependency issue simpler. Link: http://lkml.kernel.org/r/154140844377.17322.13781091165954002713.stgit@devbox Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com> Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01futex: Prevent robust futex exit raceYang Tao
commit ca16d5bee59807bf04deaab0a8eccecd5061528c upstream. Robust futexes utilize the robust_list mechanism to allow the kernel to release futexes which are held when a task exits. The exit can be voluntary or caused by a signal or fault. This prevents that waiters block forever. The futex operations in user space store a pointer to the futex they are either locking or unlocking in the op_pending member of the per task robust list. After a lock operation has succeeded the futex is queued in the robust list linked list and the op_pending pointer is cleared. After an unlock operation has succeeded the futex is removed from the robust list linked list and the op_pending pointer is cleared. The robust list exit code checks for the pending operation and any futex which is queued in the linked list. It carefully checks whether the futex value is the TID of the exiting task. If so, it sets the OWNER_DIED bit and tries to wake up a potential waiter. This is race free for the lock operation but unlock has two race scenarios where waiters might not be woken up. These issues can be observed with regular robust pthread mutexes. PI aware pthread mutexes are not affected. (1) Unlocking task is killed after unlocking the futex value in user space before being able to wake a waiter. pthread_mutex_unlock() | V atomic_exchange_rel (&mutex->__data.__lock, 0) <------------------------killed lll_futex_wake () | | |(__lock = 0) |(enter kernel) | V do_exit() exit_mm() mm_release() exit_robust_list() handle_futex_death() | |(__lock = 0) |(uval = 0) | V if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr)) return 0; The sanity check which ensures that the user space futex is owned by the exiting task prevents the wakeup of waiters which in consequence block infinitely. (2) Waiting task is killed after a wakeup and before it can acquire the futex in user space. OWNER WAITER futex_wait() pthread_mutex_unlock() | | | |(__lock = 0) | | | V | futex_wake() ------------> wakeup() | |(return to userspace) |(__lock = 0) | V oldval = mutex->__data.__lock <-----------------killed atomic_compare_and_exchange_val_acq (&mutex->__data.__lock, | id | assume_other_futex_waiters, 0) | | | (enter kernel)| | V do_exit() | | V handle_futex_death() | |(__lock = 0) |(uval = 0) | V if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr)) return 0; The sanity check which ensures that the user space futex is owned by the exiting task prevents the wakeup of waiters, which seems to be correct as the exiting task does not own the futex value, but the consequence is that other waiters wont be woken up and block infinitely. In both scenarios the following conditions are true: - task->robust_list->list_op_pending != NULL - user space futex value == 0 - Regular futex (not PI) If these conditions are met then it is reasonably safe to wake up a potential waiter in order to prevent the above problems. As this might be a false positive it can cause spurious wakeups, but the waiter side has to handle other types of unrelated wakeups, e.g. signals gracefully anyway. So such a spurious wakeup will not affect the correctness of these operations. This workaround must not touch the user space futex value and cannot set the OWNER_DIED bit because the lock value is 0, i.e. uncontended. Setting OWNER_DIED in this case would result in inconsistent state and subsequently in malfunction of the owner died handling in user space. The rest of the user space state is still consistent as no other task can observe the list_op_pending entry in the exiting tasks robust list. The eventually woken up waiter will observe the uncontended lock value and take it over. [ tglx: Massaged changelog and comment. Made the return explicit and not depend on the subsequent check and added constants to hand into handle_futex_death() instead of plain numbers. Fixed a few coding style issues. ] Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core") Signed-off-by: Yang Tao <yang.tao172@zte.com.cn> Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1573010582-35297-1-git-send-email-wang.yi59@zte.com.cn Link: https://lkml.kernel.org/r/20191106224555.943191378@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-01y2038: futex: Move compat implementation into futex.cArnd Bergmann
commit 04e7712f4460585e5eed5b853fd8b82a9943958f upstream. We are going to share the compat_sys_futex() handler between 64-bit architectures and 32-bit architectures that need to deal with both 32-bit and 64-bit time_t, and this is easier if both entry points are in the same file. In fact, most other system call handlers do the same thing these days, so let's follow the trend here and merge all of futex_compat.c into futex.c. In the process, a few minor changes have to be done to make sure everything still makes sense: handle_futex_death() and futex_cmpxchg_enabled() become local symbol, and the compat version of the fetch_robust_entry() function gets renamed to compat_fetch_robust_entry() to avoid a symbol clash. This is intended as a purely cosmetic patch, no behavior should change. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-01audit: print empty EXECVE argsRichard Guy Briggs
[ Upstream commit ea956d8be91edc702a98b7fe1f9463e7ca8c42ab ] Empty executable arguments were being skipped when printing out the list of arguments in an EXECVE record, making it appear they were somehow lost. Include empty arguments as an itemized empty string. Reproducer: autrace /bin/ls "" "/etc" ausearch --start recent -m execve -i | grep EXECVE type=EXECVE msg=audit(10/03/2018 13:04:03.208:1391) : argc=3 a0=/bin/ls a2=/etc With fix: type=EXECVE msg=audit(10/03/2018 21:51:38.290:194) : argc=3 a0=/bin/ls a1= a2=/etc type=EXECVE msg=audit(1538617898.290:194): argc=3 a0="/bin/ls" a1="" a2="/etc" Passes audit-testsuite. GH issue tracker at https://github.com/linux-audit/audit-kernel/issues/99 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: cleaned up the commit metadata] Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01sched/fair: Don't increase sd->balance_interval on newidle balanceValentin Schneider
[ Upstream commit 3f130a37c442d5c4d66531b240ebe9abfef426b5 ] When load_balance() fails to move some load because of task affinity, we end up increasing sd->balance_interval to delay the next periodic balance in the hopes that next time we look, that annoying pinned task(s) will be gone. However, idle_balance() pays no attention to sd->balance_interval, yet it will still lead to an increase in balance_interval in case of pinned tasks. If we're going through several newidle balances (e.g. we have a periodic task), this can lead to a huge increase of the balance_interval in a very small amount of time. To prevent that, don't increase the balance interval when going through a newidle balance. This is a similar approach to what is done in commit 58b26c4c0257 ("sched: Increment cache_nice_tries only on periodic lb"), where we disregard newidle balance and rely on periodic balance for more stable results. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar.Eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: patrick.bellasi@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1537974727-30788-2-git-send-email-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01sched/topology: Fix off by one bugPeter Zijlstra
[ Upstream commit 993f0b0510dad98b4e6e39506834dab0d13fd539 ] With the addition of the NUMA identity level, we increased @level by one and will run off the end of the array in the distance sort loop. Fixed: 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01irq/matrix: Fix memory overallocationMichael Kelley
[ Upstream commit 57f01796f14fecf00d330fe39c8d2477ced9cd79 ] IRQ_MATRIX_SIZE is the number of longs needed for a bitmap, multiplied by the size of a long, yielding a byte count. But it is used to size an array of longs, which is way more memory than is needed. Change IRQ_MATRIX_SIZE so it is just the number of longs needed and the arrays come out the correct size. Fixes: 2f75d9e1c905 ("genirq: Implement bitmap matrix allocator") Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: KY Srinivasan <kys@microsoft.com> Link: https://lkml.kernel.org/r/1541032428-10392-1-git-send-email-mikelley@microsoft.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01kernel/panic.c: do not append newline to the stack protector panic stringBorislav Petkov
[ Upstream commit 95c4fb78fb23081472465ca20d5d31c4b780ed82 ] ... because panic() itself already does this. Otherwise you have line-broken trailer: [ 1.836965] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: pgd_alloc+0x29e/0x2a0 [ 1.836965] ]--- Link: http://lkml.kernel.org/r/20181008202901.7894-1-bp@alien8.de Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Kees Cook <keescook@chromium.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01bpf, btf: fix a missing check bug in btf_parseMartin Lau
[ Upstream commit 4a6998aff82a20a1aece86a186d8e5263f8b2315 ] Wenwen Wang reported: In btf_parse(), the header of the user-space btf data 'btf_data' is firstly parsed and verified through btf_parse_hdr(). In btf_parse_hdr(), the header is copied from user-space 'btf_data' to kernel-space 'btf->hdr' and then verified. If no error happens during the verification process, the whole data of 'btf_data', including the header, is then copied to 'data' in btf_parse(). It is obvious that the header is copied twice here. More importantly, no check is enforced after the second copy to make sure the headers obtained in these two copies are same. Given that 'btf_data' resides in the user space, a malicious user can race to modify the header between these two copies. By doing so, the user can inject inconsistent data, which can cause undefined behavior of the kernel and introduce potential security risk. This issue is similar to the one fixed in commit 8af03d1ae2e1 ("bpf: btf: Fix a missing check bug"). To fix it, this patch copies the user 'btf_data' *before* parsing / verifying the BTF header. Fixes: 69b693f0aefa ("bpf: btf: Introduce BPF Type Format (BTF)") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Co-developed-by: Wenwen Wang <wang6495@umn.edu> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01bpf: devmap: fix wrong interface selection in notifier_callTaehee Yoo
[ Upstream commit f592f804831f1cf9d1f9966f58c80f150e6829b5 ] The dev_map_notification() removes interface in devmap if unregistering interface's ifindex is same. But only checking ifindex is not enough because other netns can have same ifindex. so that wrong interface selection could occurred. Hence netdev pointer comparison code is added. v2: compare netdev pointer instead of using net_eq() (Daniel Borkmann) v1: Initial patch Fixes: 2ddf71e23cc2 ("net: add notifier hooks for devmap bpf map") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01swiotlb: do not panic on mapping failuresChristoph Hellwig
[ Upstream commit 8088546832aa2c0d8f99dd56edf6384f8a9b63b3 ] All properly written drivers now have error handling in the dma_map_single / dma_map_page callers. As swiotlb_tbl_map_single already prints a useful warning when running out of swiotlb pool space we can also remove swiotlb_full entirely as it serves no purpose now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01printk: fix integer overflow in setup_log_buf()Sergey Senozhatsky
[ Upstream commit d2130e82e9454304e9b91ba9da551b5989af8c27 ] The way we calculate logbuf free space percentage overflows signed integer: int free; free = __LOG_BUF_LEN - log_next_idx; pr_info("early log buf free: %u(%u%%)\n", free, (free * 100) / __LOG_BUF_LEN); We support LOG_BUF_LEN of up to 1<<25 bytes. Since setup_log_buf() is called during early init, logbuf is mostly empty, so __LOG_BUF_LEN - log_next_idx is close to 1<<25. Thus when we multiply it by 100, we overflow signed integer value range: 100 is 2^6 + 2^5 + 2^2. Example, booting with LOG_BUF_LEN 1<<25 and log_buf_len=2G boot param: [ 0.075317] log_buf_len: -2147483648 bytes [ 0.075319] early log buf free: 33549896(-28%) Make "free" unsigned integer and use appropriate printk() specifier. Link: http://lkml.kernel.org/r/20181010113308.9337-1-sergey.senozhatsky@gmail.com To: Steven Rostedt <rostedt@goodmis.org> Cc: linux-kernel@vger.kernel.org Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01printk: lock/unlock console only for new logbuf entriesSergey Senozhatsky
[ Upstream commit 3ac37a93fa9217e576bebfd4ba3e80edaaeb2289 ] Prior to commit 5c2992ee7fd8a29 ("printk: remove console flushing special cases for partial buffered lines") we would do console_cont_flush() for each pr_cont() to print cont fragments, so console_unlock() would actually print data: pr_cont(); console_lock(); console_unlock() console_cont_flush(); // print cont fragment ... pr_cont(); console_lock(); console_unlock() console_cont_flush(); // print cont fragment We don't do console_cont_flush() anymore, so when we do pr_cont() console_unlock() does nothing (unless we flushed the cont buffer): pr_cont(); console_lock(); console_unlock(); // noop ... pr_cont(); console_lock(); console_unlock(); // noop ... pr_cont(); cont_flush(); console_lock(); console_unlock(); // print data We also wakeup klogd purposelessly for pr_cont() output - un-flushed cont buffer is not stored in log_buf; there is nothing to pull. Thus we can console_lock()/console_unlock()/wake_up_klogd() only when we know that we log_store()-ed a message and there is something to print to the consoles/syslog. Link: http://lkml.kernel.org/r/20181002023836.4487-3-sergey.senozhatsky@gmail.com To: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: LKML <linux-kernel@vger.kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-24bpf: btf: Fix a missing check bugWenwen Wang
[ Upstream commit 8af03d1ae2e154a8be3631e8694b87007e1bdbc2 ] In btf_parse_hdr(), the length of the btf data header is firstly copied from the user space to 'hdr_len' and checked to see whether it is larger than 'btf_data_size'. If yes, an error code EINVAL is returned. Otherwise, the whole header is copied again from the user space to 'btf->hdr'. However, after the second copy, there is no check between 'btf->hdr->hdr_len' and 'hdr_len' to confirm that the two copies get the same value. Given that the btf data is in the user space, a malicious user can race to change the data between the two copies. By doing so, the user can provide malicious data to the kernel and cause undefined behavior. This patch adds a necessary check after the second copy, to make sure 'btf->hdr->hdr_len' has the same value as 'hdr_len'. Otherwise, an error code EINVAL will be returned. Signed-off-by: Wenwen Wang <wang6495@umn.edu> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-24printk: Give error on attempt to set log buffer length to over 2GHe Zhe
[ Upstream commit e6fe3e5b7d16e8f146a4ae7fe481bc6e97acde1e ] The current printk() is ready to handle log buffer size up to 2G. Give an explicit error for users who want to use larger log buffer. Also fix printk formatting to show the 2G as a positive number. Link: http://lkml.kernel.org/r/20181008135916.gg4kkmoki5bgtco5@pathway.suse.cz Cc: rostedt@goodmis.org Cc: linux-kernel@vger.kernel.org Suggested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: He Zhe <zhe.he@windriver.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> [pmladek: Fixed to the really safe limit 2GB.] Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-24kexec: Allocate decrypted control pages for kdump if SME is enabledLianbo Jiang
[ Upstream commit 9cf38d5559e813cccdba8b44c82cc46ba48d0896 ] When SME is enabled in the first kernel, it needs to allocate decrypted pages for kdump because when the kdump kernel boots, these pages need to be accessed decrypted in the initial boot stage, before SME is enabled. [ bp: clean up text. ] Signed-off-by: Lianbo Jiang <lijiang@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: kexec@lists.infradead.org Cc: tglx@linutronix.de Cc: mingo@redhat.com Cc: hpa@zytor.com Cc: akpm@linux-foundation.org Cc: dan.j.williams@intel.com Cc: bhelgaas@google.com Cc: baiyaowei@cmss.chinamobile.com Cc: tiwai@suse.de Cc: brijesh.singh@amd.com Cc: dyoung@redhat.com Cc: bhe@redhat.com Cc: jroedel@suse.de Link: https://lkml.kernel.org/r/20180930031033.22110-3-lijiang@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-24printk: Correct wrong castingHe Zhe
[ Upstream commit 51a72ab7372d85c96104e58036f1b49ba11e5d2b ] log_first_seq and console_seq are 64-bit unsigned integers. Correct a wrong casting that might cut off the output. Link: http://lkml.kernel.org/r/1538239553-81805-2-git-send-email-zhe.he@windriver.com Cc: rostedt@goodmis.org Cc: linux-kernel@vger.kernel.org Signed-off-by: He Zhe <zhe.he@windriver.com> [sergey.senozhatsky@gmail.com: More descriptive commit message] Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-24cpu/SMT: State SMT is disabled even with nosmt and without "=force"Borislav Petkov
[ Upstream commit d0e7d14455d41163126afecd0fcce935463cc512 ] When booting with "nosmt=force" a message is issued into dmesg to confirm that SMT has been force-disabled but such a message is not issued when only "nosmt" is on the kernel command line. Fix that. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20181004172227.10094-1-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-24printk: CON_PRINTBUFFER console registration is a bit racySergey Senozhatsky
[ Upstream commit 884e370ea88c109a3b982f4eb9ecd82510a3a1fe ] CON_PRINTBUFFER console registration requires us to do several preparation steps: - Rollback console_seq to replay logbuf messages which were already seen on other consoles; - Set exclusive_console flag so console_unlock() will ->write() logbuf messages only to the exclusive_console driver. The way we do it, however, is a bit racy logbuf_lock_irqsave(flags); console_seq = syslog_seq; console_idx = syslog_idx; logbuf_unlock_irqrestore(flags); << preemption enabled << irqs enabled exclusive_console = newcon; console_unlock(); We rollback console_seq under logbuf_lock with IRQs disabled, but we set exclusive_console with local IRQs enabled and logbuf unlocked. If the system oops-es or panic-s before we set exclusive_console - and given that we have IRQs and preemption enabled there is such a possibility - we will re-play all logbuf messages to every registered console, which may be a bit annoying and time consuming. Move exclusive_console assignment to the same IRQs-disabled and logbuf_lock-protected section where we rollback console_seq. Link: http://lkml.kernel.org/r/20180928095304.9972-1-sergey.senozhatsky@gmail.com To: Steven Rostedt <rostedt@goodmis.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-24printk: Do not miss new messages when replaying the logPetr Mladek
[ Upstream commit f92b070f2dc89a8ff1a0cc8b608e20abef894c7d ] The variable "exclusive_console" is used to reply all existing messages on a newly registered console. It is cleared when all messages are out. The problem is that new messages might appear in the meantime. These are then visible only on the exclusive console. The obvious solution is to clear "exclusive_console" after we replay all messages that were already proceed before we started the reply. Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Link: http://lkml.kernel.org/r/20180913123406.14378-1-pmladek@suse.com To: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: linux-kernel@vger.kernel.org Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-20kprobes: Don't call BUG_ON() if there is a kprobe in use on free listMasami Hiramatsu
[ Upstream commit cbdd96f5586151e48317d90a403941ec23f12660 ] Instead of calling BUG_ON(), if we find a kprobe in use on free kprobe list, just remove it from the list and keep it on kprobe hash list as same as other in-use kprobes. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S . Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/153666126882.21306.10738207224288507996.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-20signal: Properly deliver SIGILL from uprobesEric W. Biederman
[ Upstream commit 55a3235fc71bf34303e34a95eeee235b2d2a35dd ] For userspace to tell the difference between a random signal and an exception, the exception must include siginfo information. Using SEND_SIG_FORCED for SIGILL is thus wrong, and it will result in userspace seeing si_code == SI_USER (like a random signal) instead of si_code == SI_KERNEL or a more specific si_code as all exceptions deliver. Therefore replace force_sig_info(SIGILL, SEND_SIG_FORCE, current) with force_sig(SIG_ILL, current) which gets this right and is shorter and easier to type. Fixes: 014940bad8e4 ("uprobes/x86: Send SIGILL if arch_uprobe_post_xol() fails") Fixes: 0b5256c7f173 ("uprobes: Send SIGILL if handle_trampoline() fails") Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-20signal: Always ignore SIGKILL and SIGSTOP sent to the global initEric W. Biederman
[ Upstream commit 86989c41b5ea08776c450cb759592532314a4ed6 ] If the first process started (aka /sbin/init) receives a SIGKILL it will panic the system if it is delivered. Making the system unusable and undebugable. It isn't much better if the first process started receives SIGSTOP. So always ignore SIGSTOP and SIGKILL sent to init. This is done in a separate clause in sig_task_ignored as force_sig_info can clear SIG_UNKILLABLE and this protection should work even then. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-20sched/debug: Explicitly cast sched_feat() to boolPeter Zijlstra
[ Upstream commit 7e6f4c5d600c1c8e2a1d900e65cab319d9b6782e ] LLVM has a warning that tags expressions like: if (foo && non-bool-const) This pattern triggers for CONFIG_SCHED_DEBUG=n where sched_feat() ends up being whatever bit we select. Avoid the warning with an explicit cast to bool. Reported-by: Philipp Klocke <philipp97kl@gmail.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-20y2038: make do_gettimeofday() and get_seconds() inlineArnd Bergmann
[ Upstream commit 33e26418193f58d1895f2f968e1953b1caf8deb7 ] get_seconds() and do_gettimeofday() are only used by a few modules now any more (waiting for the respective patches to get accepted), and they are among the last holdouts of code that is not y2038 safe in the core kernel. Move the implementation into the timekeeping32.h header to clean up the core kernel and isolate the old interfaces further. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-12cpu/speculation: Uninline and export CPU mitigations helpersTyler Hicks
commit 731dc9df975a5da21237a18c3384f811a7a41cc6 upstream. A kernel module may need to check the value of the "mitigations=" kernel command line parameter as part of its setup when the module needs to perform software mitigations for a CPU flaw. Uninline and export the helper functions surrounding the cpu_mitigations enum to allow for their usage from a module. Lastly, privatize the enum and cpu_mitigations variable since the value of cpu_mitigations can be checked with the exported helper functions. Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12sched/fair: Fix -Wunused-but-set-variable warningsQian Cai
commit 763a9ec06c409dcde2a761aac4bb83ff3938e0b3 upstream. Commit: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices") introduced a few compilation warnings: kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime': kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable] kernel/sched/fair.c: In function 'start_cfs_bandwidth': kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable] Also, __refill_cfs_bandwidth_runtime() does no longer update the expiration time, so fix the comments accordingly. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Reviewed-by: Dave Chiluk <chiluk+linux@indeed.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: pauld@redhat.com Fixes: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices") Link: https://lkml.kernel.org/r/1566326455-8038-1-git-send-email-cai@lca.pw Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12sched/fair: Fix low cpu usage with high throttling by removing expiration of ↵Dave Chiluk
cpu-local slices commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream. It has been observed, that highly-threaded, non-cpu-bound applications running under cpu.cfs_quota_us constraints can hit a high percentage of periods throttled while simultaneously not consuming the allocated amount of quota. This use case is typical of user-interactive non-cpu bound applications, such as those running in kubernetes or mesos when run on multiple cpu cores. This has been root caused to cpu-local run queue being allocated per cpu bandwidth slices, and then not fully using that slice within the period. At which point the slice and quota expires. This expiration of unused slice results in applications not being able to utilize the quota for which they are allocated. The non-expiration of per-cpu slices was recently fixed by 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")'. Prior to that it appears that this had been broken since at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That added the following conditional which resulted in slices never being expired. if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; Because this was broken for nearly 5 years, and has recently been fixed and is now being noticed by many users running kubernetes (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion that the mechanisms around expiring runtime should be removed altogether. This allows quota already allocated to per-cpu run-queues to live longer than the period boundary. This allows threads on runqueues that do not use much CPU to continue to use their remaining slice over a longer period of time than cpu.cfs_period_us. However, this helps prevent the above condition of hitting throttling while also not fully utilizing your cpu quota. This theoretically allows a machine to use slightly more than its allotted quota in some periods. This overflow would be bounded by the remaining quota left on each per-cpu runqueueu. This is typically no more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will change nothing, as they should theoretically fully utilize all of their quota in each period. For user-interactive tasks as described above this provides a much better user/application experience as their cpu utilization will more closely match the amount they requested when they hit throttling. This means that cpu limits no longer strictly apply per period for non-cpu bound applications, but that they are still accurate over longer timeframes. This greatly improves performance of high-thread-count, non-cpu bound applications with low cfs_quota_us allocation on high-core-count machines. In the case of an artificial testcase (10ms/100ms of quota on 80 CPU machine), this commit resulted in almost 30x performance improvement, while still maintaining correct cpu quota restrictions. That testcase is available at https://github.com/indeedeng/fibtest. Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: John Hammond <jhammond@indeed.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kyle Anderson <kwa@yelp.com> Cc: Gabriel Munos <gmunoz@netflix.com> Cc: Peter Oskolkov <posk@posk.io> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Brendan Gregg <bgregg@netflix.com> Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10tracing: Fix "gfp_t" format for synthetic eventsZhengjun Xing
[ Upstream commit 9fa8c9c647be624e91b09ecffa7cd97ee0600b40 ] In the format of synthetic events, the "gfp_t" is shown as "signed:1", but in fact the "gfp_t" is "unsigned", should be shown as "signed:0". The issue can be reproduced by the following commands: echo 'memlatency u64 lat; unsigned int order; gfp_t gfp_flags; int migratetype' > /sys/kernel/debug/tracing/synthetic_events cat /sys/kernel/debug/tracing/events/synthetic/memlatency/format name: memlatency ID: 2233 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:u64 lat; offset:8; size:8; signed:0; field:unsigned int order; offset:16; size:4; signed:0; field:gfp_t gfp_flags; offset:24; size:4; signed:1; field:int migratetype; offset:32; size:4; signed:1; print fmt: "lat=%llu, order=%u, gfp_flags=%x, migratetype=%d", REC->lat, REC->order, REC->gfp_flags, REC->migratetype Link: http://lkml.kernel.org/r/20191018012034.6404-1-zhengjun.xing@linux.intel.com Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Zhengjun Xing <zhengjun.xing@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-06tracing: Initialize iter->seq after zeroing in tracing_read_pipe()Petr Mladek
[ Upstream commit d303de1fcf344ff7c15ed64c3f48a991c9958775 ] A customer reported the following softlockup: [899688.160002] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [test.sh:16464] [899688.160002] CPU: 0 PID: 16464 Comm: test.sh Not tainted 4.12.14-6.23-azure #1 SLE12-SP4 [899688.160002] RIP: 0010:up_write+0x1a/0x30 [899688.160002] Kernel panic - not syncing: softlockup: hung tasks [899688.160002] RIP: 0010:up_write+0x1a/0x30 [899688.160002] RSP: 0018:ffffa86784d4fde8 EFLAGS: 00000257 ORIG_RAX: ffffffffffffff12 [899688.160002] RAX: ffffffff970fea00 RBX: 0000000000000001 RCX: 0000000000000000 [899688.160002] RDX: ffffffff00000001 RSI: 0000000000000080 RDI: ffffffff970fea00 [899688.160002] RBP: ffffffffffffffff R08: ffffffffffffffff R09: 0000000000000000 [899688.160002] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b59014720d8 [899688.160002] R13: ffff8b59014720c0 R14: ffff8b5901471090 R15: ffff8b5901470000 [899688.160002] tracing_read_pipe+0x336/0x3c0 [899688.160002] __vfs_read+0x26/0x140 [899688.160002] vfs_read+0x87/0x130 [899688.160002] SyS_read+0x42/0x90 [899688.160002] do_syscall_64+0x74/0x160 It caught the process in the middle of trace_access_unlock(). There is no loop. So, it must be looping in the caller tracing_read_pipe() via the "waitagain" label. Crashdump analyze uncovered that iter->seq was completely zeroed at this point, including iter->seq.seq.size. It means that print_trace_line() was never able to print anything and there was no forward progress. The culprit seems to be in the code: /* reset all but tr, trace, and overruns */ memset(&iter->seq, 0, sizeof(struct trace_iterator) - offsetof(struct trace_iterator, seq)); It was added by the commit 53d0aa773053ab182877 ("ftrace: add logic to record overruns"). It was v2.6.27-rc1. It was the time when iter->seq looked like: struct trace_seq { unsigned char buffer[PAGE_SIZE]; unsigned int len; }; There was no "size" variable and zeroing was perfectly fine. The solution is to reinitialize the structure after or without zeroing. Link: http://lkml.kernel.org/r/20191011142134.11997-1-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-06sched/vtime: Fix guest/system mis-accounting on task switchFrederic Weisbecker
[ Upstream commit 68e7a4d66b0ce04bf18ff2ffded5596ab3618585 ] vtime_account_system() assumes that the target task to account cputime to is always the current task. This is most often true indeed except on task switch where we call: vtime_common_task_switch(prev) vtime_account_system(prev) Here prev is the scheduling-out task where we account the cputime to. It doesn't match current that is already the scheduling-in task at this stage of the context switch. So we end up checking the wrong task flags to determine if we are accounting guest or system time to the previous task. As a result the wrong task is used to check if the target is running in guest mode. We may then spuriously account or leak either system or guest time on task switch. Fix this assumption and also turn vtime_guest_enter/exit() to use the task passed in parameter as well to avoid future similar issues. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpengli@tencent.com> Fixes: 2a42eb9594a1 ("sched/cputime: Accumulate vtime on top of nsec clocksource") Link: https://lkml.kernel.org/r/20190925214242.21873-1-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-29tracing: Fix race in perf_trace_buf initializationPrateek Sood
commit 6b1340cc00edeadd52ebd8a45171f38c8de2a387 upstream. A race condition exists while initialiazing perf_trace_buf from perf_trace_init() and perf_kprobe_init(). CPU0 CPU1 perf_trace_init() mutex_lock(&event_mutex) perf_trace_event_init() perf_trace_event_reg() total_ref_count == 0 buf = alloc_percpu() perf_trace_buf[i] = buf tp_event->class->reg() //fails perf_kprobe_init() goto fail perf_trace_event_init() perf_trace_event_reg() fail: total_ref_count == 0 total_ref_count == 0 buf = alloc_percpu() perf_trace_buf[i] = buf tp_event->class->reg() total_ref_count++ free_percpu(perf_trace_buf[i]) perf_trace_buf[i] = NULL Any subsequent call to perf_trace_event_reg() will observe total_ref_count > 0, causing the perf_trace_buf to be always NULL. This can result in perf_trace_buf getting accessed from perf_trace_buf_alloc() without being initialized. Acquiring event_mutex in perf_kprobe_init() before calling perf_trace_event_init() should fix this race. The race caused the following bug: Unable to handle kernel paging request at virtual address 0000003106f2003c Mem abort info: ESR = 0x96000045 Exception class = DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000045 CM = 0, WnR = 1 user pgtable: 4k pages, 39-bit VAs, pgdp = ffffffc034b9b000 [0000003106f2003c] pgd=0000000000000000, pud=0000000000000000 Internal error: Oops: 96000045 [#1] PREEMPT SMP Process syz-executor (pid: 18393, stack limit = 0xffffffc093190000) pstate: 80400005 (Nzcv daif +PAN -UAO) pc : __memset+0x20/0x1ac lr : memset+0x3c/0x50 sp : ffffffc09319fc50 __memset+0x20/0x1ac perf_trace_buf_alloc+0x140/0x1a0 perf_trace_sys_enter+0x158/0x310 syscall_trace_enter+0x348/0x7c0 el0_svc_common+0x11c/0x368 el0_svc_handler+0x12c/0x198 el0_svc+0x8/0xc Ramdumps showed the following: total_ref_count = 3 perf_trace_buf = ( 0x0 -> NULL, 0x0 -> NULL, 0x0 -> NULL, 0x0 -> NULL) Link: http://lkml.kernel.org/r/1571120245-4186-1-git-send-email-prsood@codeaurora.org Cc: stable@vger.kernel.org Fixes: e12f03d7031a9 ("perf/core: Implement the 'perf_kprobe' PMU") Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Prateek Sood <prsood@codeaurora.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-29perf/aux: Fix AUX output stoppingAlexander Shishkin
commit f3a519e4add93b7b31a6616f0b09635ff2e6a159 upstream. Commit: 8a58ddae2379 ("perf/core: Fix exclusive events' grouping") allows CAP_EXCLUSIVE events to be grouped with other events. Since all of those also happen to be AUX events (which is not the case the other way around, because arch/s390), this changes the rules for stopping the output: the AUX event may not be on its PMU's context any more, if it's grouped with a HW event, in which case it will be on that HW event's context instead. If that's the case, munmap() of the AUX buffer can't find and stop the AUX event, potentially leaving the last reference with the atomic context, which will then end up freeing the AUX buffer. This will then trip warnings: Fix this by using the context's PMU context when looking for events to stop, instead of the event's PMU context. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20191022073940.61814-1-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-17perf/hw_breakpoint: Fix arch_hw_breakpoint use-before-initializationMark-PK Tsai
commit 310aa0a25b338b3100c94880c9a69bec8ce8c3ae upstream. If we disable the compiler's auto-initialization feature, if -fplugin-arg-structleak_plugin-byref or -ftrivial-auto-var-init=pattern are disabled, arch_hw_breakpoint may be used before initialization after: 9a4903dde2c86 ("perf/hw_breakpoint: Split attribute parse and commit") On our ARM platform, the struct step_ctrl in arch_hw_breakpoint, which used to be zero-initialized by kzalloc(), may be used in arch_install_hw_breakpoint() without initialization. Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alix Wu <alix.wu@mediatek.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: YJ Chiang <yj.chiang@mediatek.com> Link: https://lkml.kernel.org/r/20190906060115.9460-1-mark-pk.tsai@mediatek.com [ Minor edits. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Doug Anderson <dianders@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-17tracing: Get trace_array reference for available_tracers filesSteven Rostedt (VMware)
commit 194c2c74f5532e62c218adeb8e2b683119503907 upstream. As instances may have different tracers available, we need to look at the trace_array descriptor that shows the list of the available tracers for the instance. But there's a race between opening the file and an admin deleting the instance. The trace_array_get() needs to be called before accessing the trace_array. Cc: stable@vger.kernel.org Fixes: 607e2ea167e56 ("tracing: Set up infrastructure to allow tracers for instances") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-17ftrace: Get a reference counter for the trace_array on filter filesSteven Rostedt (VMware)
commit 9ef16693aff8137faa21d16ffe65bb9832d24d71 upstream. The ftrace set_ftrace_filter and set_ftrace_notrace files are specific for an instance now. They need to take a reference to the instance otherwise there could be a race between accessing the files and deleting the instance. It wasn't until the :mod: caching where these file operations started referencing the trace_array directly. Cc: stable@vger.kernel.org Fixes: 673feb9d76ab3 ("ftrace: Add :mod: caching infrastructure to trace_array") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-17tracing/hwlat: Don't ignore outer-loop duration when calculating max_latencySrivatsa S. Bhat (VMware)
commit fc64e4ad80d4b72efce116f87b3174f0b7196f8e upstream. max_latency is intended to record the maximum ever observed hardware latency, which may occur in either part of the loop (inner/outer). So we need to also consider the outer-loop sample when updating max_latency. Link: http://lkml.kernel.org/r/157073345463.17189.18124025522664682811.stgit@srivatsa-ubuntu Fixes: e7c15cd8a113 ("tracing: Added hardware latency tracer") Cc: stable@vger.kernel.org Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-17tracing/hwlat: Report total time spent in all NMIs during the sampleSrivatsa S. Bhat (VMware)
commit 98dc19c11470ee6048aba723d77079ad2cda8a52 upstream. nmi_total_ts is supposed to record the total time spent in *all* NMIs that occur on the given CPU during the (active portion of the) sampling window. However, the code seems to be overwriting this variable for each NMI, thereby only recording the time spent in the most recent NMI. Fix it by accumulating the duration instead. Link: http://lkml.kernel.org/r/157073343544.17189.13911783866738671133.stgit@srivatsa-ubuntu Fixes: 7b2c86250122 ("tracing: Add NMI tracing in hwlat detector") Cc: stable@vger.kernel.org Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-17kernel/sysctl.c: do not override max_threads provided by userspaceMichal Hocko
commit b0f53dbc4bc4c371f38b14c391095a3bb8a0bb40 upstream. Partially revert 16db3d3f1170 ("kernel/sysctl.c: threads-max observe limits") because the patch is causing a regression to any workload which needs to override the auto-tuning of the limit provided by kernel. set_max_threads is implementing a boot time guesstimate to provide a sensible limit of the concurrently running threads so that runaways will not deplete all the memory. This is a good thing in general but there are workloads which might need to increase this limit for an application to run (reportedly WebSpher MQ is affected) and that is simply not possible after the mentioned change. It is also very dubious to override an admin decision by an estimation that doesn't have any direct relation to correctness of the kernel operation. Fix this by dropping set_max_threads from sysctl_max_threads so any value is accepted as long as it fits into MAX_THREADS which is important to check because allowing more threads could break internal robust futex restriction. While at it, do not use MIN_THREADS as the lower boundary because it is also only a heuristic for automatic estimation and admin might have a good reason to stop new threads to be created even when below this limit. This became more severe when we switched x86 from 4k to 8k kernel stacks. Starting since 6538b8ea886e ("x86_64: expand kernel stack to 16K") (3.16) we use THREAD_SIZE_ORDER = 2 and that halved the auto-tuned value. In the particular case 3.12 kernel.threads-max = 515561 4.4 kernel.threads-max = 200000 Neither of the two values is really insane on 32GB machine. I am not sure we want/need to tune the max_thread value further. If anything the tuning should be removed altogether if proven not useful in general. But we definitely need a way to override this auto-tuning. Link: http://lkml.kernel.org/r/20190922065801.GB18814@dhcp22.suse.cz Fixes: 16db3d3f1170 ("kernel/sysctl.c: threads-max observe limits") Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-17panic: ensure preemption is disabled during panic()Will Deacon
commit 20bb759a66be52cf4a9ddd17fddaf509e11490cd upstream. Calling 'panic()' on a kernel with CONFIG_PREEMPT=y can leave the calling CPU in an infinite loop, but with interrupts and preemption enabled. From this state, userspace can continue to be scheduled, despite the system being "dead" as far as the kernel is concerned. This is easily reproducible on arm64 when booting with "nosmp" on the command line; a couple of shell scripts print out a periodic "Ping" message whilst another triggers a crash by writing to /proc/sysrq-trigger: | sysrq: Trigger a crash | Kernel panic - not syncing: sysrq triggered crash | CPU: 0 PID: 1 Comm: init Not tainted 5.2.15 #1 | Hardware name: linux,dummy-virt (DT) | Call trace: | dump_backtrace+0x0/0x148 | show_stack+0x14/0x20 | dump_stack+0xa0/0xc4 | panic+0x140/0x32c | sysrq_handle_reboot+0x0/0x20 | __handle_sysrq+0x124/0x190 | write_sysrq_trigger+0x64/0x88 | proc_reg_write+0x60/0xa8 | __vfs_write+0x18/0x40 | vfs_write+0xa4/0x1b8 | ksys_write+0x64/0xf0 | __arm64_sys_write+0x14/0x20 | el0_svc_common.constprop.0+0xb0/0x168 | el0_svc_handler+0x28/0x78 | el0_svc+0x8/0xc | Kernel Offset: disabled | CPU features: 0x0002,24002004 | Memory Limit: none | ---[ end Kernel panic - not syncing: sysrq triggered crash ]--- | Ping 2! | Ping 1! | Ping 1! | Ping 2! The issue can also be triggered on x86 kernels if CONFIG_SMP=n, otherwise local interrupts are disabled in 'smp_send_stop()'. Disable preemption in 'panic()' before re-enabling interrupts. Link: http://lkml.kernel.org/r/20191002123538.22609-1-will@kernel.org Link: https://lore.kernel.org/r/BX1W47JXPMR8.58IYW53H6M5N@dragonstone Signed-off-by: Will Deacon <will@kernel.org> Reported-by: Xogium <contact@xogium.me> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Feng Tang <feng.tang@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-11tick: broadcast-hrtimer: Fix a race in bc_set_nextBalasubramani Vivekanandan
[ Upstream commit b9023b91dd020ad7e093baa5122b6968c48cc9e0 ] When a cpu requests broadcasting, before starting the tick broadcast hrtimer, bc_set_next() checks if the timer callback (bc_handler) is active using hrtimer_try_to_cancel(). But hrtimer_try_to_cancel() does not provide the required synchronization when the callback is active on other core. The callback could have already executed tick_handle_oneshot_broadcast() and could have also returned. But still there is a small time window where the hrtimer_try_to_cancel() returns -1. In that case bc_set_next() returns without doing anything, but the next_event of the tick broadcast clock device is already set to a timeout value. In the race condition diagram below, CPU #1 is running the timer callback and CPU #2 is entering idle state and so calls bc_set_next(). In the worst case, the next_event will contain an expiry time, but the hrtimer will not be started which happens when the racing callback returns HRTIMER_NORESTART. The hrtimer might never recover if all further requests from the CPUs to subscribe to tick broadcast have timeout greater than the next_event of tick broadcast clock device. This leads to cascading of failures and finally noticed as rcu stall warnings Here is a depiction of the race condition CPU #1 (Running timer callback) CPU #2 (Enter idle and subscribe to tick broadcast) --------------------- --------------------- __run_hrtimer() tick_broadcast_enter() bc_handler() __tick_broadcast_oneshot_control() tick_handle_oneshot_broadcast() raw_spin_lock(&tick_broadcast_lock); dev->next_event = KTIME_MAX; //wait for tick_broadcast_lock //next_event for tick broadcast clock set to KTIME_MAX since no other cores subscribed to tick broadcasting raw_spin_unlock(&tick_broadcast_lock); if (dev->next_event == KTIME_MAX) return HRTIMER_NORESTART // callback function exits without restarting the hrtimer //tick_broadcast_lock acquired raw_spin_lock(&tick_broadcast_lock); tick_broadcast_set_event() clockevents_program_event() dev->next_event = expires; bc_set_next() hrtimer_try_to_cancel() //returns -1 since the timer callback is active. Exits without restarting the timer cpu_base->running = NULL; The comment that hrtimer cannot be armed from within the callback is wrong. It is fine to start the hrtimer from within the callback. Also it is safe to start the hrtimer from the enter/exit idle code while the broadcast handler is active. The enter/exit idle code and the broadcast handler are synchronized using tick_broadcast_lock. So there is no need for the existing try to cancel logic. All this can be removed which will eliminate the race condition as well. Fixes: 5d1638acb9f6 ("tick: Introduce hrtimer based broadcast") Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Balasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190926135101.12102-2-balasubramani_vivekanandan@mentor.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-11kernel/elfcore.c: include proper prototypesValdis Kletnieks
[ Upstream commit 0f74914071ab7e7b78731ed62bf350e3a344e0a5 ] When building with W=1, gcc properly complains that there's no prototypes: CC kernel/elfcore.o kernel/elfcore.c:7:17: warning: no previous prototype for 'elf_core_extra_phdrs' [-Wmissing-prototypes] 7 | Elf_Half __weak elf_core_extra_phdrs(void) | ^~~~~~~~~~~~~~~~~~~~ kernel/elfcore.c:12:12: warning: no previous prototype for 'elf_core_write_extra_phdrs' [-Wmissing-prototypes] 12 | int __weak elf_core_write_extra_phdrs(struct coredump_params *cprm, loff_t offset) | ^~~~~~~~~~~~~~~~~~~~~~~~~~ kernel/elfcore.c:17:12: warning: no previous prototype for 'elf_core_write_extra_data' [-Wmissing-prototypes] 17 | int __weak elf_core_write_extra_data(struct coredump_params *cprm) | ^~~~~~~~~~~~~~~~~~~~~~~~~ kernel/elfcore.c:22:15: warning: no previous prototype for 'elf_core_extra_data_size' [-Wmissing-prototypes] 22 | size_t __weak elf_core_extra_data_size(void) | ^~~~~~~~~~~~~~~~~~~~~~~~ Provide the include file so gcc is happy, and we don't have potential code drift Link: http://lkml.kernel.org/r/29875.1565224705@turing-police Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-11sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()KeMeng Shi
[ Upstream commit 714e501e16cd473538b609b3e351b2cc9f7f09ed ] An oops can be triggered in the scheduler when running qemu on arm64: Unable to handle kernel paging request at virtual address ffff000008effe40 Internal error: Oops: 96000007 [#1] SMP Process migration/0 (pid: 12, stack limit = 0x00000000084e3736) pstate: 20000085 (nzCv daIf -PAN -UAO) pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20 lr : move_queued_task.isra.21+0x124/0x298 ... Call trace: __ll_sc___cmpxchg_case_acq_4+0x4/0x20 __migrate_task+0xc8/0xe0 migration_cpu_stop+0x170/0x180 cpu_stopper_thread+0xec/0x178 smpboot_thread_fn+0x1ac/0x1e8 kthread+0x134/0x138 ret_from_fork+0x10/0x18 __set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to migrage the process if process is not currently running on any one of the CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if corresponding bit is coincidentally set. As a consequence, kernel will access an invalid rq address associate with the invalid CPU in migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs. The reproduce the crash: 1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling sched_setaffinity. 2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online" and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn. 3) Oops appears if the invalid CPU is set in memory after tested cpumask. Signed-off-by: KeMeng Shi <shikemeng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1568616808-16808-1-git-send-email-shikemeng@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-11sched/membarrier: Fix private expedited registration checkMathieu Desnoyers
[ Upstream commit fc0d77387cb5ae883fd774fc559e056a8dde024c ] Fix a logic flaw in the way membarrier_register_private_expedited() handles ready state checks for private expedited sync core and private expedited registrations. If a private expedited membarrier registration is first performed, and then a private expedited sync_core registration is performed, the ready state check will skip the second registration when it really should not. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190919173705.2181-2-mathieu.desnoyers@efficios.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>