summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2016-11-20tracing: Move mutex to protect against resetting of seq dataSteven Rostedt (Red Hat)
commit 1245800c0f96eb6ebb368593e251d66c01e61022 upstream. The iter->seq can be reset outside the protection of the mutex. So can reading of user data. Move the mutex up to the beginning of the function. Fixes: d7350c3f45694 ("tracing/core: make the read callbacks reentrants") Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-11-20sched/core: Fix a race between try_to_wake_up() and a woken up taskBalbir Singh
commit 135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf upstream. The origin of the issue I've seen is related to a missing memory barrier between check for task->state and the check for task->on_rq. The task being woken up is already awake from a schedule() and is doing the following: do { schedule() set_current_state(TASK_(UN)INTERRUPTIBLE); } while (!cond); The waker, actually gets stuck doing the following in try_to_wake_up(): while (p->on_cpu) cpu_relax(); Analysis: The instance I've seen involves the following race: CPU1 CPU2 while () { if (cond) break; do { schedule(); set_current_state(TASK_UN..) } while (!cond); wakeup_routine() spin_lock_irqsave(wait_lock) raw_spin_lock_irqsave(wait_lock) wake_up_process() } try_to_wake_up() set_current_state(TASK_RUNNING); .. list_del(&waiter.list); CPU2 wakes up CPU1, but before it can get the wait_lock and set current state to TASK_RUNNING the following occurs: CPU3 wakeup_routine() raw_spin_lock_irqsave(wait_lock) if (!list_empty) wake_up_process() try_to_wake_up() raw_spin_lock_irqsave(p->pi_lock) .. if (p->on_rq && ttwu_wakeup()) .. while (p->on_cpu) cpu_relax() .. CPU3 tries to wake up the task on CPU1 again since it finds it on the wait_queue, CPU1 is spinning on wait_lock, but immediately after CPU2, CPU3 got it. CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and the task is spinning on the wait_lock. Interestingly since p->on_rq is checked under pi_lock, I've noticed that try_to_wake_up() finds p->on_rq to be 0. This was the most confusing bit of the analysis, but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq check is not reliable without this fix IMHO. The race is visible (based on the analysis) only when ttwu_queue() does a remote wakeup via ttwu_queue_remote. In which case the p->on_rq change is not done uder the pi_lock. The result is that after a while the entire system locks up on the raw_spin_irqlock_save(wait_lock) and the holder spins infintely Reproduction of the issue: The issue can be reproduced after a long run on my system with 80 threads and having to tweak available memory to very low and running memory stress-ng mmapfork test. It usually takes a long time to reproduce. I am trying to work on a test case that can reproduce the issue faster, but thats work in progress. I am still testing the changes on my still in a loop and the tests seem OK thus far. Big thanks to Benjamin and Nick for helping debug this as well. Ben helped catch the missing barrier, Nick caught every missing bit in my theory. Signed-off-by: Balbir Singh <bsingharora@gmail.com> [ Updated comment to clarify matching barriers. Many architectures do not have a full barrier in switch_to() so that cannot be relied upon. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nicholas Piggin <nicholas.piggin@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-11-20sched/cputime: Fix prev steal time accouting during CPU hotplugWanpeng Li
commit 3d89e5478bf550a50c99e93adf659369798263b0 upstream. Commit: e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug") ... set rq->prev_* to 0 after a CPU hotplug comes back, in order to fix the case where (after CPU hotplug) steal time is smaller than rq->prev_steal_time. However, this should never happen. Steal time was only smaller because of the KVM-specific bug fixed by the previous patch. Worse, the previous patch triggers a bug on CPU hot-unplug/plug operation: because rq->prev_steal_time is cleared, all of the CPU's past steal time will be accounted again on hot-plug. Since the root cause has been fixed, we can just revert commit e9532e69b8d1. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 'commit e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")' Link: http://lkml.kernel.org/r/1465813966-3116-3-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-08-22audit: fix a double fetch in audit_log_single_execve_arg()Paul Moore
commit 43761473c254b45883a64441dd0bc85a42f3645c upstream. There is a double fetch problem in audit_log_single_execve_arg() where we first check the execve(2) argumnets for any "bad" characters which would require hex encoding and then re-fetch the arguments for logging in the audit record[1]. Of course this leaves a window of opportunity for an unsavory application to munge with the data. This patch reworks things by only fetching the argument data once[2] into a buffer where it is scanned and logged into the audit records(s). In addition to fixing the double fetch, this patch improves on the original code in a few other ways: better handling of large arguments which require encoding, stricter record length checking, and some performance improvements (completely unverified, but we got rid of some strlen() calls, that's got to be a good thing). As part of the development of this patch, I've also created a basic regression test for the audit-testsuite, the test can be tracked on GitHub at the following link: * https://github.com/linux-audit/audit-testsuite/issues/25 [1] If you pay careful attention, there is actually a triple fetch problem due to a strnlen_user() call at the top of the function. [2] This is a tiny white lie, we do make a call to strnlen_user() prior to fetching the argument data. I don't like it, but due to the way the audit record is structured we really have no choice unless we copy the entire argument at once (which would require a rather wasteful allocation). The good news is that with this patch the kernel no longer relies on this strnlen_user() value for anything beyond recording it in the log, we also update it with a trustworthy value whenever possible. Reported-by: Pengfei Wang <wpengfeinudt@gmail.com> Signed-off-by: Paul Moore <paul@paul-moore.com> [bwh: Backported to 3.2: - In audit_log_execve_info() various information is retrieved via the extra parameter struct audit_aux_data_execve *axi - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-08-22kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while ↵Andrey Ryabinin
processing sysrq-w commit 57675cb976eff977aefb428e68e4e0236d48a9ff upstream. Lengthy output of sysrq-w may take a lot of time on slow serial console. Currently we reset NMI-watchdog on the current CPU to avoid spurious lockup messages. Sometimes this doesn't work since softlockup watchdog might trigger on another CPU which is waiting for an IPI to proceed. We reset softlockup watchdogs on all CPUs, but we do this only after listing all tasks, and this may be too late on a busy system. So, reset watchdogs CPUs earlier, in for_each_process_thread() loop. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1465474805-14641-1-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-08-22wait/ptrace: assume __WALL if the child is tracedOleg Nesterov
commit bf959931ddb88c4e4366e96dd22e68fa0db9527c upstream. The following program (simplified version of generated by syzkaller) #include <pthread.h> #include <unistd.h> #include <sys/ptrace.h> #include <stdio.h> #include <signal.h> void *thread_func(void *arg) { ptrace(PTRACE_TRACEME, 0,0,0); return 0; } int main(void) { pthread_t thread; if (fork()) return 0; while (getppid() != 1) ; pthread_create(&thread, NULL, thread_func, NULL); pthread_join(thread, NULL); return 0; } creates an unreapable zombie if /sbin/init doesn't use __WALL. This is not a kernel bug, at least in a sense that everything works as expected: debugger should reap a traced sub-thread before it can reap the leader, but without __WALL/__WCLONE do_wait() ignores sub-threads. Unfortunately, it seems that /sbin/init in most (all?) distributions doesn't use it and we have to change the kernel to avoid the problem. Note also that most init's use sys_waitid() which doesn't allow __WALL, so the necessary user-space fix is not that trivial. This patch just adds the "ptrace" check into eligible_child(). To some degree this matches the "tsk->ptrace" in exit_notify(), ->exit_signal is mostly ignored when the tracee reports to debugger. Or WSTOPPED, the tracer doesn't need to set this flag to wait for the stopped tracee. This obviously means the user-visible change: __WCLONE and __WALL no longer have any meaning for debugger. And I can only hope that this won't break something, but at least strace/gdb won't suffer. We could make a more conservative change. Say, we can take __WCLONE into account, or !thread_group_leader(). But it would be nice to not complicate these historical/confusing checks. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Pedro Alves <palves@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: <syzkaller@googlegroups.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-08-22sched/loadavg: Fix loadavg artifacts on fully idle and on fully loaded systemsVik Heyndrickx
commit 20878232c52329f92423d27a60e48b6a6389e0dd upstream. Systems show a minimal load average of 0.00, 0.01, 0.05 even when they have no load at all. Uptime and /proc/loadavg on all systems with kernels released during the last five years up until kernel version 4.6-rc5, show a 5- and 15-minute minimum loadavg of 0.01 and 0.05 respectively. This should be 0.00 on idle systems, but the way the kernel calculates this value prevents it from getting lower than the mentioned values. Likewise but not as obviously noticeable, a fully loaded system with no processes waiting, shows a maximum 1/5/15 loadavg of 1.00, 0.99, 0.95 (multiplied by number of cores). Once the (old) load becomes 93 or higher, it mathematically can never get lower than 93, even when the active (load) remains 0 forever. This results in the strange 0.00, 0.01, 0.05 uptime values on idle systems. Note: 93/2048 = 0.0454..., which rounds up to 0.05. It is not correct to add a 0.5 rounding (=1024/2048) here, since the result from this function is fed back into the next iteration again, so the result of that +0.5 rounding value then gets multiplied by (2048-2037), and then rounded again, so there is a virtual "ghost" load created, next to the old and active load terms. By changing the way the internally kept value is rounded, that internal value equivalent now can reach 0.00 on idle, and 1.00 on full load. Upon increasing load, the internally kept load value is rounded up, when the load is decreasing, the load value is rounded down. The modified code was tested on nohz=off and nohz kernels. It was tested on vanilla kernel 4.6-rc5 and on centos 7.1 kernel 3.10.0-327. It was tested on single, dual, and octal cores system. It was tested on virtual hosts and bare hardware. No unwanted effects have been observed, and the problems that the patch intended to fix were indeed gone. Tested-by: Damien Wyart <damien.wyart@free.fr> Signed-off-by: Vik Heyndrickx <vik.heyndrickx@veribox.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Doug Smythies <dsmythies@telus.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 0f004f5a696a ("sched: Cure more NO_HZ load average woes") Link: http://lkml.kernel.org/r/e8d32bff-d544-7748-72b5-3c86cc71f09f@veribox.net Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-05-01fs/coredump: prevent fsuid=0 dumps into user-controlled directoriesJann Horn
commit 378c6520e7d29280f400ef2ceaf155c86f05a71a upstream. This commit fixes the following security hole affecting systems where all of the following conditions are fulfilled: - The fs.suid_dumpable sysctl is set to 2. - The kernel.core_pattern sysctl's value starts with "/". (Systems where kernel.core_pattern starts with "|/" are not affected.) - Unprivileged user namespace creation is permitted. (This is true on Linux >=3.8, but some distributions disallow it by default using a distro patch.) Under these conditions, if a program executes under secure exec rules, causing it to run with the SUID_DUMP_ROOT flag, then unshares its user namespace, changes its root directory and crashes, the coredump will be written using fsuid=0 and a path derived from kernel.core_pattern - but this path is interpreted relative to the root directory of the process, allowing the attacker to control where a coredump will be written with root privileges. To fix the security issue, always interpret core_pattern for dumps that are written under SUID_DUMP_ROOT relative to the root directory of init. Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-05-01tracing: Fix crash from reading trace_pipe with sendfileSteven Rostedt (Red Hat)
commit a29054d9478d0435ab01b7544da4f674ab13f533 upstream. If tracing contains data and the trace_pipe file is read with sendfile(), then it can trigger a NULL pointer dereference and various BUG_ON within the VM code. There's a patch to fix this in the splice_to_pipe() code, but it's also a good idea to not let that happen from trace_pipe either. Link: http://lkml.kernel.org/r/1457641146-9068-1-git-send-email-rabin@rab.in Reported-by: Rabin Vincent <rabin.vincent@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-05-01tracing: Have preempt(irqs)off trace preempt disabled functionsSteven Rostedt (Red Hat)
commit cb86e05390debcc084cfdb0a71ed4c5dbbec517d upstream. Joel Fernandes reported that the function tracing of preempt disabled sections was not being reported when running either the preemptirqsoff or preemptoff tracers. This was due to the fact that the function tracer callback for those tracers checked if irqs were disabled before tracing. But this fails when we want to trace preempt off locations as well. Joel explained that he wanted to see funcitons where interrupts are enabled but preemption was disabled. The expected output he wanted: <...>-2265 1d.h1 3419us : preempt_count_sub <-irq_exit <...>-2265 1d..1 3419us : __do_softirq <-irq_exit <...>-2265 1d..1 3419us : msecs_to_jiffies <-__do_softirq <...>-2265 1d..1 3420us : irqtime_account_irq <-__do_softirq <...>-2265 1d..1 3420us : __local_bh_disable_ip <-__do_softirq <...>-2265 1..s1 3421us : run_timer_softirq <-__do_softirq <...>-2265 1..s1 3421us : hrtimer_run_pending <-run_timer_softirq <...>-2265 1..s1 3421us : _raw_spin_lock_irq <-run_timer_softirq <...>-2265 1d.s1 3422us : preempt_count_add <-_raw_spin_lock_irq <...>-2265 1d.s2 3422us : _raw_spin_unlock_irq <-run_timer_softirq <...>-2265 1..s2 3422us : preempt_count_sub <-_raw_spin_unlock_irq <...>-2265 1..s1 3423us : rcu_bh_qs <-__do_softirq <...>-2265 1d.s1 3423us : irqtime_account_irq <-__do_softirq <...>-2265 1d.s1 3423us : __local_bh_enable <-__do_softirq There's a comment saying that the irq disabled check is because there's a possible race that tracing_cpu may be set when the function is executed. But I don't remember that race. For now, I added a check for preemption being enabled too to not record the function, as there would be no race if that was the case. I need to re-investigate this, as I'm now thinking that the tracing_cpu will always be correct. But no harm in keeping the check for now, except for the slight performance hit. Link: http://lkml.kernel.org/r/1457770386-88717-1-git-send-email-agnel.joel@gmail.com Fixes: 5e6d2b9cfa3a "tracing: Use one prologue for the preempt irqs off tracer function tracers" Cc: stable@vget.kernel.org # 2.6.37+ Reported-by: Joel Fernandes <agnel.joel@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-05-01sched/cputime: Fix steal time accounting vs. CPU hotplugThomas Gleixner
commit e9532e69b8d1d1284e8ecf8d2586de34aec61244 upstream. On CPU hotplug the steal time accounting can keep a stale rq->prev_steal_time value over CPU down and up. So after the CPU comes up again the delta calculation in steal_account_process_tick() wreckages itself due to the unsigned math: u64 steal = paravirt_steal_clock(smp_processor_id()); steal -= this_rq()->prev_steal_time; So if steal is smaller than rq->prev_steal_time we end up with an insane large value which then gets added to rq->prev_steal_time, resulting in a permanent wreckage of the accounting. As a consequence the per CPU stats in /proc/stat become stale. Nice trick to tell the world how idle the system is (100%) while the CPU is 100% busy running tasks. Though we prefer realistic numbers. None of the accounting values which use a previous value to account for fractions is reset at CPU hotplug time. update_rq_clock_task() has a sanity check for prev_irq_time and prev_steal_time_rq, but that sanity check solely deals with clock warps and limits the /proc/stat visible wreckage. The prev_time values are still wrong. Solution is simple: Reset rq->prev_*_time when the CPU is plugged in again. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Fixes: commit 095c0aa83e52 "sched: adjust scheduler cpu power for stolen time" Fixes: commit aa483808516c "sched: Remove irq time from available CPU power" Fixes: commit e6e6685accfa "KVM guest: Steal time accounting" Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603041539490.3686@nanos Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: adjust filenames] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-04-01kernel/resource.c: fix muxed resource handling in __request_region()Simon Guinot
commit 59ceeaaf355fa0fb16558ef7c24413c804932ada upstream. In __request_region, if a conflict with a BUSY and MUXED resource is detected, then the caller goes to sleep and waits for the resource to be released. A pointer on the conflicting resource is kept. At wake-up this pointer is used as a parent to retry to request the region. A first problem is that this pointer might well be invalid (if for example the conflicting resource have already been freed). Another problem is that the next call to __request_region() fails to detect a remaining conflict. The previously conflicting resource is passed as a parameter and __request_region() will look for a conflict among the children of this resource and not at the resource itself. It is likely to succeed anyway, even if there is still a conflict. Instead, the parent of the conflicting resource should be passed to __request_region(). As a fix, this patch doesn't update the parent resource pointer in the case we have to wait for a muxed region right after. Reported-and-tested-by: Vincent Pelletier <plr.vincent@gmail.com> Signed-off-by: Simon Guinot <simon.guinot@sequanux.org> Tested-by: Vincent Donnefort <vdonnefort@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-02-27sched: fix __sched_setscheduler() vs load balancing raceMike Galbraith
__sched_setscheduler() may release rq->lock in pull_rt_task() as a task is being changed rt -> fair class. load balancing may sneak in, move the task behind __sched_setscheduler()'s back, which explodes in switched_to_fair() when the passed but no longer valid rq is used. Tell can_migrate_task() to say no if ->pi_lock is held. @stable: Kernels that predate SCHED_DEADLINE can use this simple (and tested) check in lieu of backport of the full 18 patch mainline treatment. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> [bwh: Backported to 3.2: - Adjust numbering in the comment - Adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Willy Tarreau <w@1wt.eu>
2016-02-27pipe: limit the per-user amount of pages allocated in pipesWilly Tarreau
commit 759c01142a5d0f364a462346168a56de28a80f52 upstream. On no-so-small systems, it is possible for a single process to cause an OOM condition by filling large pipes with data that are never read. A typical process filling 4000 pipes with 1 MB of data will use 4 GB of memory. On small systems it may be tricky to set the pipe max size to prevent this from happening. This patch makes it possible to enforce a per-user soft limit above which new pipes will be limited to a single page, effectively limiting them to 4 kB each, as well as a hard limit above which no new pipes may be created for this user. This has the effect of protecting the system against memory abuse without hurting other users, and still allowing pipes to work correctly though with less data at once. The limit are controlled by two new sysctls : pipe-user-pages-soft, and pipe-user-pages-hard. Both may be disabled by setting them to zero. The default soft limit allows the default number of FDs per process (1024) to create pipes of the default size (64kB), thus reaching a limit of 64MB before starting to create only smaller pipes. With 256 processes limited to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB = 1084 MB of memory allocated for a user. The hard limit is disabled by default to avoid breaking existing applications that make intensive use of pipes (eg: for splicing). Reported-by: socketpair@gmail.com Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Mitigates: CVE-2013-4312 (Linux 2.0+) Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-02-27itimers: Handle relative timers with CONFIG_TIME_LOW_RES properThomas Gleixner
commit 51cbb5242a41700a3f250ecfb48dcfb7e4375ea4 upstream. As Helge reported for timerfd we have the same issue in itimers. We return remaining time larger than the programmed relative time to user space in case of CONFIG_TIME_LOW_RES=y. Use the proper function to adjust the extra time added in hrtimer_start_range_ns(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Helge Deller <deller@gmx.de> Cc: John Stultz <john.stultz@linaro.org> Cc: linux-m68k@lists.linux-m68k.org Cc: dhowells@redhat.com Link: http://lkml.kernel.org/r/20160114164159.528222587@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-02-27posix-timers: Handle relative timers with CONFIG_TIME_LOW_RES properThomas Gleixner
commit 572c39172684c3711e4a03c9a7380067e2b0661c upstream. As Helge reported for timerfd we have the same issue in posix timers. We return remaining time larger than the programmed relative time to user space in case of CONFIG_TIME_LOW_RES=y. Use the proper function to adjust the extra time added in hrtimer_start_range_ns(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Helge Deller <deller@gmx.de> Cc: John Stultz <john.stultz@linaro.org> Cc: linux-m68k@lists.linux-m68k.org Cc: dhowells@redhat.com Link: http://lkml.kernel.org/r/20160114164159.450510905@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-02-27hrtimer: Handle remaining time proper for TIME_LOW_RESThomas Gleixner
commit 203cbf77de59fc8f13502dcfd11350c6d4a5c95f upstream. If CONFIG_TIME_LOW_RES is enabled we add a jiffie to the relative timeout to prevent short sleeps, but we do not account for that in interfaces which retrieve the remaining time. Helge observed that timerfd can return a remaining time larger than the relative timeout. That's not expected and breaks userland test programs. Store the information that the timer was armed relative and provide functions to adjust the remaining time. To avoid bloating the hrtimer struct make state a u8, which as a bonus results in better code on x86 at least. Reported-and-tested-by: Helge Deller <deller@gmx.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: John Stultz <john.stultz@linaro.org> Cc: linux-m68k@lists.linux-m68k.org Cc: dhowells@redhat.com Link: http://lkml.kernel.org/r/20160114164159.273328486@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bwh: Backported to 3.2: - Use #ifdef instead of IS_ENABLED() as that doesn't work for config symbols that don't exist on the current architecture - Use KTIME_LOW_RES directly instead of hrtimer_resolution - Use ktime_sub() instead of modifying ktime::tv64 directly - Adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-02-13posix-clock: Fix return code on the poll method's error pathRichard Cochran
commit 1b9f23727abb92c5e58f139e7d180befcaa06fe0 upstream. The posix_clock_poll function is supposed to return a bit mask of POLLxxx values. However, in case the hardware has disappeared (due to hot plugging for example) this code returns -ENODEV in a futile attempt to throw an error at the file descriptor level. The kernel's file_operations interface does not accept such error codes from the poll method. Instead, this function aught to return POLLERR. The value -ENODEV does, in fact, contain the POLLERR bit (and almost all the other POLLxxx bits as well), but only by chance. This patch fixes code to return a proper bit mask. Credit goes to Markus Elfring for pointing out the suspicious signed/unsigned mismatch. Reported-by: Markus Elfring <elfring@users.sourceforge.net> igned-off-by: Richard Cochran <richardcochran@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Julia Lawall <julia.lawall@lip6.fr> Link: http://lkml.kernel.org/r/1450819198-17420-1-git-send-email-richardcochran@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-02-13futex: Drop refcount if requeue_pi() acquired the rtmutexThomas Gleixner
commit fb75a4282d0d9a3c7c44d940582c2d226cf3acfb upstream. If the proxy lock in the requeue loop acquires the rtmutex for a waiter then it acquired also refcount on the pi_state related to the futex, but the waiter side does not drop the reference count. Add the missing free_pi_state() call. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <darren@dvhart.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Bhuvanesh_Surachari@mentor.com Cc: Andy Lowe <Andy_Lowe@mentor.com> Link: http://lkml.kernel.org/r/20151219200607.178132067@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-01-22genirq: Prevent chip buslock deadlockThomas Gleixner
commit abc7e40c81d113ef4bacb556f0a77ca63ac81d85 upstream. If a interrupt chip utilizes chip->buslock then free_irq() can deadlock in the following way: CPU0 CPU1 interrupt(X) (Shared or spurious) free_irq(X) interrupt_thread(X) chip_bus_lock(X) irq_finalize_oneshot(X) chip_bus_lock(X) synchronize_irq(X) synchronize_irq() waits for the interrupt thread to complete, i.e. forever. Solution is simple: Drop chip_bus_lock() before calling synchronize_irq() as we do with the irq_desc lock. There is nothing to be protected after the point where irq_desc lock has been released. This adds chip_bus_lock/unlock() to the remove_irq() code path, but that's actually correct in the case where remove_irq() is called on such an interrupt. The current users of remove_irq() are not affected as none of those interrupts is on a chip which requires buslock. Reported-by: Fredrik Markström <fredrik.markstrom@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-12-30sched/core: Clear the root_domain cpumasks in init_rootdomain()Xunlei Pang
commit 8295c69925ad53ec32ca54ac9fc194ff21bc40e2 upstream. root_domain::rto_mask allocated through alloc_cpumask_var() contains garbage data, this may cause problems. For instance, When doing pull_rt_task(), it may do useless iterations if rto_mask retains some extra garbage bits. Worse still, this violates the isolated domain rule for clustered scheduling using cpuset, because the tasks(with all the cpus allowed) belongs to one root domain can be pulled away into another root domain. The patch cleans the garbage by using zalloc_cpumask_var() instead of alloc_cpumask_var() for root_domain::rto_mask allocation, thereby addressing the issues. Do the same thing for root_domain's other cpumask memembers: dlo_mask, span, and online. Signed-off-by: Xunlei Pang <xlpang@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1449057179-29321-1-git-send-email-xlpang@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: - There's no dlo_mask to initialise - Adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-12-30sched/core: Remove false-positive warning from wake_up_process()Sasha Levin
commit 119d6f6a3be8b424b200dcee56e74484d5445f7e upstream. Because wakeups can (fundamentally) be late, a task might not be in the expected state. Therefore testing against a task's state is racy, and can yield false positives. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: oleg@redhat.com Fixes: 9067ac85d533 ("wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task") Link: http://lkml.kernel.org/r/1448933660-23082-1-git-send-email-sasha.levin@oracle.com Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-12-30ring-buffer: Update read stamp with first real commit on pageSteven Rostedt (Red Hat)
commit b81f472a208d3e2b4392faa6d17037a89442f4ce upstream. Do not update the read stamp after swapping out the reader page from the write buffer. If the reader page is swapped out of the buffer before an event is written to it, then the read_stamp may get an out of date timestamp, as the page timestamp is updated on the first commit to that page. rb_get_reader_page() only returns a page if it has an event on it, otherwise it will return NULL. At that point, check if the page being returned has events and has not been read yet. Then at that point update the read_stamp to match the time stamp of the reader page. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-11-27perf: Fix inherited events vs. tracepoint filtersPeter Zijlstra
commit b71b437eedaed985062492565d9d421d975ae845 upstream. Arnaldo reported that tracepoint filters seem to misbehave (ie. not apply) on inherited events. The fix is obvious; filters are only set on the actual (parent) event, use the normal pattern of using this parent event for filters. This is safe because each child event has a reference to it. Reported-by: Arnaldo Carvalho de Melo <acme@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frédéric Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/20151102095051.GN17308@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-11-17sched/core: Fix TASK_DEAD race in finish_task_switch()Peter Zijlstra
commit 95913d97914f44db2b81271c2e2ebd4d2ac2df83 upstream. So the problem this patch is trying to address is as follows: CPU0 CPU1 context_switch(A, B) ttwu(A) LOCK A->pi_lock A->on_cpu == 0 finish_task_switch(A) prev_state = A->state <-. WMB | A->on_cpu = 0; | UNLOCK rq0->lock | | context_switch(C, A) `-- A->state = TASK_DEAD prev_state == TASK_DEAD put_task_struct(A) context_switch(A, C) finish_task_switch(A) A->state == TASK_DEAD put_task_struct(A) The argument being that the WMB will allow the load of A->state on CPU0 to cross over and observe CPU1's store of A->state, which will then result in a double-drop and use-after-free. Now the comment states (and this was true once upon a long time ago) that we need to observe A->state while holding rq->lock because that will order us against the wakeup; however the wakeup will not in fact acquire (that) rq->lock; it takes A->pi_lock these days. We can obviously fix this by upgrading the WMB to an MB, but that is expensive, so we'd rather avoid that. The alternative this patch takes is: smp_store_release(&A->on_cpu, 0), which avoids the MB on some archs, but not important ones like ARM. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: manfred@colorfullife.com Cc: will.deacon@arm.com Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()") Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: - Adjust filename - As smp_store_release() is not defined, use smp_mb()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-11-17clocksource: Fix abs() usage w/ 64bit valuesJohn Stultz
commit 67dfae0cd72fec5cd158b6e5fb1647b7dbe0834c upstream. This patch fixes one cases where abs() was being used with 64-bit nanosecond values, where the result may be capped at 32-bits. This potentially could cause watchdog false negatives on 32-bit systems, so this patch addresses the issue by using abs64(). Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/1442279124-7309-2-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-11-17genirq: Fix race in register_irq_proc()Ben Hutchings
commit 95c2b17534654829db428f11bcf4297c059a2a7e upstream. Per-IRQ directories in procfs are created only when a handler is first added to the irqdesc, not when the irqdesc is created. In the case of a shared IRQ, multiple tasks can race to create a directory. This race condition seems to have been present forever, but is easier to hit with async probing. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Link: http://lkml.kernel.org/r/1443266636.2004.2.camel@decadent.org.uk Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-11-17module: Fix locking in symbol_put_addr()Peter Zijlstra
commit 275d7d44d802ef271a42dc87ac091a495ba72fc5 upstream. Poma (on the way to another bug) reported an assertion triggering: [<ffffffff81150529>] module_assert_mutex_or_preempt+0x49/0x90 [<ffffffff81150822>] __module_address+0x32/0x150 [<ffffffff81150956>] __module_text_address+0x16/0x70 [<ffffffff81150f19>] symbol_put_addr+0x29/0x40 [<ffffffffa04b77ad>] dvb_frontend_detach+0x7d/0x90 [dvb_core] Laura Abbott <labbott@redhat.com> produced a patch which lead us to inspect symbol_put_addr(). This function has a comment claiming it doesn't need to disable preemption around the module lookup because it holds a reference to the module it wants to find, which therefore cannot go away. This is wrong (and a false optimization too, preempt_disable() is really rather cheap, and I doubt any of this is on uber critical paths, otherwise it would've retained a pointer to the actual module anyway and avoided the second lookup). While its true that the module cannot go away while we hold a reference on it, the data structure we do the lookup in very much _CAN_ change while we do the lookup. Therefore fix the comment and add the required preempt_disable(). Reported-by: poma <pomidorabelisima@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Fixes: a6e6abd575fc ("module: remove module_text_address()") Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-10-13fs: create and use seq_show_option for escapingKees Cook
commit a068acf2ee77693e0bf39d6e07139ba704f461c3 upstream. Many file systems that implement the show_options hook fail to correctly escape their output which could lead to unescaped characters (e.g. new lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This could lead to confusion, spoofed entries (resulting in things like systemd issuing false d-bus "mount" notifications), and who knows what else. This looks like it would only be the root user stepping on themselves, but it's possible weird things could happen in containers or in other situations with delegated mount privileges. Here's an example using overlay with setuid fusermount trusting the contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use of "sudo" is something more sneaky: $ BASE="ovl" $ MNT="$BASE/mnt" $ LOW="$BASE/lower" $ UP="$BASE/upper" $ WORK="$BASE/work/ 0 0 none /proc fuse.pwn user_id=1000" $ mkdir -p "$LOW" "$UP" "$WORK" $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt $ cat /proc/mounts none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0 none /proc fuse.pwn user_id=1000 0 0 $ fusermount -u /proc $ cat /proc/mounts cat: /proc/mounts: No such file or directory This fixes the problem by adding new seq_show_option and seq_show_option_n helpers, and updating the vulnerable show_option handlers to use them as needed. Some, like SELinux, need to be open coded due to unusual existing escape mechanisms. [akpm@linux-foundation.org: add lost chunk, per Kees] [keescook@chromium.org: seq_show_option should be using const parameters] Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Jan Kara <jack@suse.com> Acked-by: Paul Moore <paul@paul-moore.com> Cc: J. R. Okajima <hooanon05g@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: - Drop changes to overlayfs, reiserfs - Drop vers option from cifs - ceph changes are all in one file - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-10-13perf: Fix fasync handling on inherited eventsPeter Zijlstra
commit fed66e2cdd4f127a43fd11b8d92a99bdd429528c upstream. Vince reported that the fasync signal stuff doesn't work proper for inherited events. So fix that. Installing fasync allocates memory and sets filp->f_flags |= FASYNC, which upon the demise of the file descriptor ensures the allocation is freed and state is updated. Now for perf, we can have the events stick around for a while after the original FD is dead because of references from child events. So we cannot copy the fasync pointer around. We can however consistently use the parent's fasync, as that will be updated. Reported-and-Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho deMelo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: eranian@google.com Link: http://lkml.kernel.org/r/1434011521.1495.71.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-08-12tracing/filter: Do not allow infix to exceed end of stringSteven Rostedt (Red Hat)
commit 6b88f44e161b9ee2a803e5b2b1fbcf4e20e8b980 upstream. While debugging a WARN_ON() for filtering, I found that it is possible for the filter string to be referenced after its end. With the filter: # echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter The filter_parse() function can call infix_get_op() which calls infix_advance() that updates the infix filter pointers for the cnt and tail without checking if the filter is already at the end, which will put the cnt to zero and the tail beyond the end. The loop then calls infix_next() that has ps->infix.cnt--; return ps->infix.string[ps->infix.tail++]; The cnt will now be below zero, and the tail that is returned is already passed the end of the filter string. So far the allocation of the filter string usually has some buffer that is zeroed out, but if the filter string is of the exact size of the allocated buffer there's no guarantee that the charater after the nul terminating character will be zero. Luckily, only root can write to the filter. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-08-12tracing/filter: Do not WARN on operand count going below zeroSteven Rostedt (Red Hat)
commit b4875bbe7e68f139bd3383828ae8e994a0df6d28 upstream. When testing the fix for the trace filter, I could not come up with a scenario where the operand count goes below zero, so I added a WARN_ON_ONCE(cnt < 0) to the logic. But there is legitimate case that it can happen (although the filter would be wrong). # echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter That is, a single operation without any operands will hit the path where the WARN_ON_ONCE() can trigger. Although this is harmless, and the filter is reported as a error. But instead of spitting out a warning to the kernel dmesg, just fail nicely and report it via the proper channels. Link: http://lkml.kernel.org/r/558C6082.90608@oracle.com Reported-by: Vince Weaver <vincent.weaver@maine.edu> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-08-12rcu: Correctly handle non-empty Tiny RCU callback list with none readyPaul E. McKenney
commit 6e91f8cb138625be96070b778d9ba71ce520ea7e upstream. If, at the time __rcu_process_callbacks() is invoked, there are callbacks in Tiny RCU's callback list, but none of them are ready to be invoked, the current list-management code will knit the non-ready callbacks out of the list. This can result in hangs and possibly worse. This commit therefore inserts a check for there being no callbacks that can be invoked immediately. This bug is unlikely to occur -- you have to get a new callback between the time rcu_sched_qs() or rcu_bh_qs() was called, but before we get to __rcu_process_callbacks(). It was detected by the addition of RCU-bh testing to rcutorture, which in turn was instigated by Iftekhar Ahmed's mutation testing. Although this bug was made much more likely by 915e8a4fe45e (rcu: Remove fastpath from __rcu_process_callbacks()), this did not cause the bug, but rather made it much more probable. That said, it takes more than 40 hours of rcutorture testing, on average, for this bug to appear, so this fix cannot be considered an emergency. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> [bwh: Backported to 3.2: adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-08-12hrtimer: Allow concurrent hrtimer_start() for self restarting timersPeter Zijlstra
commit 5de2755c8c8b3a6b8414870e2c284914a2b42e4d upstream. Because we drop cpu_base->lock around calling hrtimer::function, it is possible for hrtimer_start() to come in between and enqueue the timer. If hrtimer::function then returns HRTIMER_RESTART we'll hit the BUG_ON because HRTIMER_STATE_ENQUEUED will be set. Since the above is a perfectly valid scenario, remove the BUG_ON and make the enqueue_hrtimer() call conditional on the timer not being enqueued already. NOTE: in that concurrent scenario its entirely common for both sites to want to modify the hrtimer, since hrtimers don't provide serialization themselves be sure to provide some such that the hrtimer::function and the hrtimer_start() caller don't both try and fudge the expiration state at the same time. To that effect, add a WARN when someone tries to forward an already enqueued timer, the most common way to change the expiry of self restarting timers. Ideally we'd put the WARN in everything modifying the expiry but most of that is inlines and we don't need the bloat. Fixes: 2d44ae4d7135 ("hrtimer: clean up cpu->base locking tricks") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ben Segall <bsegall@google.com> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Paul Turner <pjt@google.com> Link: http://lkml.kernel.org/r/20150415113105.GT5029@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bwh: Backported to 3.2: adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-08-07Fix lockup related to stop_machine being stuck in __do_softirq.Ben Greear
commit 34376a50fb1fa095b9d0636fa41ed2e73125f214 upstream. The stop machine logic can lock up if all but one of the migration threads make it through the disable-irq step and the one remaining thread gets stuck in __do_softirq. The reason __do_softirq can hang is that it has a bail-out based on jiffies timeout, but in the lockup case, jiffies itself is not incremented. To work around this, re-add the max_restart counter in __do_irq and stop processing irqs after 10 restarts. Thanks to Tejun Heo and Rusty Russell and others for helping me track this down. This was introduced in 3.9 by commit c10d73671ad3 ("softirq: reduce latencies"). It may be worth looking into ath9k to see if it has issues with its irq handler at a later date. The hang stack traces look something like this: ------------[ cut here ]------------ WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7() Watchdog detected hard LOCKUP on cpu 2 Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] Pid: 23, comm: migration/2 Tainted: G C 3.9.4+ #11 Call Trace: <NMI> warn_slowpath_common+0x85/0x9f warn_slowpath_fmt+0x46/0x48 watchdog_overflow_callback+0x9c/0xa7 __perf_event_overflow+0x137/0x1cb perf_event_overflow+0x14/0x16 intel_pmu_handle_irq+0x2dc/0x359 perf_event_nmi_handler+0x19/0x1b nmi_handle+0x7f/0xc2 do_nmi+0xbc/0x304 end_repeat_nmi+0x1e/0x2e <<EOE>> cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 ---[ end trace 4947dfa9b0a4cec3 ]--- BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17] Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc] irq event stamp: 835637905 hardirqs last enabled at (835637904): __do_softirq+0x9f/0x257 hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80 softirqs last enabled at (5654720): __do_softirq+0x1ff/0x257 softirqs last disabled at (5654725): irq_exit+0x5f/0xbb CPU 1 Pid: 17, comm: migration/1 Tainted: G WC 3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: tasklet_hi_action+0xf0/0xf0 Process migration/1 Call Trace: <IRQ> __do_softirq+0x117/0x257 irq_exit+0x5f/0xbb smp_apic_timer_interrupt+0x8a/0x98 apic_timer_interrupt+0x72/0x80 <EOI> printk+0x4d/0x4f stop_machine_cpu_stop+0x22c/0x274 cpu_stopper_thread+0xae/0x162 smpboot_thread_fn+0x258/0x260 kthread+0xc7/0xcf ret_from_fork+0x7c/0xb0 Signed-off-by: Ben Greear <greearb@candelatech.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Pekka Riikonen <priikone@iki.fi> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Rui Xiang <rui.xiang@huawei.com>
2015-08-07softirq: reduce latenciesEric Dumazet
commit c10d73671ad30f54692f7f69f0e09e75d3a8926a upstream. In various network workloads, __do_softirq() latencies can be up to 20 ms if HZ=1000, and 200 ms if HZ=100. This is because we iterate 10 times in the softirq dispatcher, and some actions can consume a lot of cycles. This patch changes the fallback to ksoftirqd condition to : - A time limit of 2 ms. - need_resched() being set on current task When one of this condition is met, we wakeup ksoftirqd for further softirq processing if we still have pending softirqs. Using need_resched() as the only condition can trigger RCU stalls, as we can keep BH disabled for too long. I ran several benchmarks and got no significant difference in throughput, but a very significant reduction of latencies (one order of magnitude) : In following bench, 200 antagonist "netperf -t TCP_RR" are started in background, using all available cpus. Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC IRQ (hard+soft) Before patch : # netperf -H 7.7.7.84 -t TCP_RR -T2,2 -- -k RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=550110.424 MIN_LATENCY=146858 MAX_LATENCY=997109 P50_LATENCY=305000 P90_LATENCY=550000 P99_LATENCY=710000 MEAN_LATENCY=376989.12 STDDEV_LATENCY=184046.92 After patch : # netperf -H 7.7.7.84 -t TCP_RR -T2,2 -- -k RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=40545.492 MIN_LATENCY=9834 MAX_LATENCY=78366 P50_LATENCY=33583 P90_LATENCY=59000 P99_LATENCY=69000 MEAN_LATENCY=38364.67 STDDEV_LATENCY=12865.26 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Miller <davem@davemloft.net> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Rui Xiang <rui.xiang@huawei.com>
2015-08-07__ptrace_may_access() should not deny sub-threadsMark Grondona
commit 73af963f9f3036dffed55c3a2898598186db1045 upstream. __ptrace_may_access() checks get_dumpable/ptrace_has_cap/etc if task != current, this can can lead to surprising results. For example, a sub-thread can't readlink("/proc/self/exe") if the executable is not readable. setup_new_exec()->would_dump() notices that inode_permission(MAY_READ) fails and then it does set_dumpable(suid_dumpable). After that get_dumpable() fails. (It is not clear why proc_pid_readlink() checks get_dumpable(), perhaps we could add PTRACE_MODE_NODUMPABLE) Change __ptrace_may_access() to use same_thread_group() instead of "task == current". Any security check is pointless when the tasks share the same ->mm. Signed-off-by: Mark Grondona <mgrondona@llnl.gov> Signed-off-by: Ben Woodard <woodard@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Sheng Yong <shengyong1@huawei.com>
2015-08-07tracing: Have filter check for balanced opsSteven Rostedt
commit 2cf30dc180cea808077f003c5116388183e54f9e upstream. When the following filter is used it causes a warning to trigger: # cd /sys/kernel/debug/tracing # echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter -bash: echo: write error: Invalid argument # cat events/ext4/ext4_truncate_exit/filter ((dev==1)blocks==2) ^ parse_error: No error ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1223 at kernel/trace/trace_events_filter.c:1640 replace_preds+0x3c5/0x990() Modules linked in: bnep lockd grace bluetooth ... CPU: 3 PID: 1223 Comm: bash Tainted: G W 4.1.0-rc3-test+ #450 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 0000000000000668 ffff8800c106bc98 ffffffff816ed4f9 ffff88011ead0cf0 0000000000000000 ffff8800c106bcd8 ffffffff8107fb07 ffffffff8136b46c ffff8800c7d81d48 ffff8800d4c2bc00 ffff8800d4d4f920 00000000ffffffea Call Trace: [<ffffffff816ed4f9>] dump_stack+0x4c/0x6e [<ffffffff8107fb07>] warn_slowpath_common+0x97/0xe0 [<ffffffff8136b46c>] ? _kstrtoull+0x2c/0x80 [<ffffffff8107fb6a>] warn_slowpath_null+0x1a/0x20 [<ffffffff81159065>] replace_preds+0x3c5/0x990 [<ffffffff811596b2>] create_filter+0x82/0xb0 [<ffffffff81159944>] apply_event_filter+0xd4/0x180 [<ffffffff81152bbf>] event_filter_write+0x8f/0x120 [<ffffffff811db2a8>] __vfs_write+0x28/0xe0 [<ffffffff811dda43>] ? __sb_start_write+0x53/0xf0 [<ffffffff812e51e0>] ? security_file_permission+0x30/0xc0 [<ffffffff811dc408>] vfs_write+0xb8/0x1b0 [<ffffffff811dc72f>] SyS_write+0x4f/0xb0 [<ffffffff816f5217>] system_call_fastpath+0x12/0x6a ---[ end trace e11028bd95818dcd ]--- Worse yet, reading the error message (the filter again) it says that there was no error, when there clearly was. The issue is that the code that checks the input does not check for balanced ops. That is, having an op between a closed parenthesis and the next token. This would only cause a warning, and fail out before doing any real harm, but it should still not caues a warning, and the error reported should work: # cd /sys/kernel/debug/tracing # echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter -bash: echo: write error: Invalid argument # cat events/ext4/ext4_truncate_exit/filter ((dev==1)blocks==2) ^ parse_error: Meaningless filter expression And give no kernel warning. Link: http://lkml.kernel.org/r/20150615175025.7e809215@gandalf.local.home Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Reported-by: Vince Weaver <vincent.weaver@maine.edu> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [bwh: Backported to 3.2: drop the check for OP_NOT, which we don't have] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-08-07ring-buffer-benchmark: Fix the wrong sched_priority of producerWang Long
commit 108029323910c5dd1ef8fa2d10da1ce5fbce6e12 upstream. The producer should be used producer_fifo as its sched_priority, so correct it. Link: http://lkml.kernel.org/r/1433923957-67842-1-git-send-email-long.wanglong@huawei.com Signed-off-by: Wang Long <long.wanglong@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-08-07ptrace: fix race between ptrace_resume() and wait_task_stopped()Oleg Nesterov
commit b72c186999e689cb0b055ab1c7b3cd8fffbeb5ed upstream. ptrace_resume() is called when the tracee is still __TASK_TRACED. We set tracee->exit_code and then wake_up_state() changes tracee->state. If the tracer's sub-thread does wait() in between, task_stopped_code(ptrace => T) wrongly looks like another report from tracee. This confuses debugger, and since wait_task_stopped() clears ->exit_code the tracee can miss a signal. Test-case: #include <stdio.h> #include <unistd.h> #include <sys/wait.h> #include <sys/ptrace.h> #include <pthread.h> #include <assert.h> int pid; void *waiter(void *arg) { int stat; for (;;) { assert(pid == wait(&stat)); assert(WIFSTOPPED(stat)); if (WSTOPSIG(stat) == SIGHUP) continue; assert(WSTOPSIG(stat) == SIGCONT); printf("ERR! extra/wrong report:%x\n", stat); } } int main(void) { pthread_t thread; pid = fork(); if (!pid) { assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0); for (;;) kill(getpid(), SIGHUP); } assert(pthread_create(&thread, NULL, waiter, NULL) == 0); for (;;) ptrace(PTRACE_CONT, pid, 0, SIGCONT); return 0; } Note for stable: the bug is very old, but without 9899d11f6544 "ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL" the fix should use lock_task_sighand(child). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Pavel Labath <labath@google.com> Tested-by: Pavel Labath <labath@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-05-09sched: Fix RLIMIT_RTTIME when PI-boosting to RTBrian Silverman
commit 746db9443ea57fd9c059f62c4bfbf41cf224fe13 upstream. When non-realtime tasks get priority-inheritance boosted to a realtime scheduling class, RLIMIT_RTTIME starts to apply to them. However, the counter used for checking this (the same one used for SCHED_RR timeslices) was not getting reset. This meant that tasks running with a non-realtime scheduling class which are repeatedly boosted to a realtime one, but never block while they are running realtime, eventually hit the timeout without ever running for a time over the limit. This patch resets the realtime timeslice counter when un-PI-boosting from an RT to a non-RT scheduling class. I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT mutex which induces priority boosting and spins while boosted that gets killed by a SIGXCPU on non-fixed kernels but doesn't with this patch applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and does happen eventually with PREEMPT_VOLUNTARY kernels. Signed-off-by: Brian Silverman <brian@peloton-tech.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: austin@peloton-tech.com Link: http://lkml.kernel.org/r/1424305436-6716-1-git-send-email-brian@peloton-tech.com Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-05-09perf: Fix irq_work 'tail' recursionPeter Zijlstra
commit d525211f9d1be8b523ec7633f080f2116f5ea536 upstream. Vince reported a watchdog lockup like: [<ffffffff8115e114>] perf_tp_event+0xc4/0x210 [<ffffffff810b4f8a>] perf_trace_lock+0x12a/0x160 [<ffffffff810b7f10>] lock_release+0x130/0x260 [<ffffffff816c7474>] _raw_spin_unlock_irqrestore+0x24/0x40 [<ffffffff8107bb4d>] do_send_sig_info+0x5d/0x80 [<ffffffff811f69df>] send_sigio_to_task+0x12f/0x1a0 [<ffffffff811f71ce>] send_sigio+0xae/0x100 [<ffffffff811f72b7>] kill_fasync+0x97/0xf0 [<ffffffff8115d0b4>] perf_event_wakeup+0xd4/0xf0 [<ffffffff8115d103>] perf_pending_event+0x33/0x60 [<ffffffff8114e3fc>] irq_work_run_list+0x4c/0x80 [<ffffffff8114e448>] irq_work_run+0x18/0x40 [<ffffffff810196af>] smp_trace_irq_work_interrupt+0x3f/0xc0 [<ffffffff816c99bd>] trace_irq_work_interrupt+0x6d/0x80 Which is caused by an irq_work generating new irq_work and therefore not allowing forward progress. This happens because processing the perf irq_work triggers another perf event (tracepoint stuff) which in turn generates an irq_work ad infinitum. Avoid this by raising the recursion counter in the irq_work -- which effectively disables all software events (including tracepoints) from actually triggering again. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20150219170311.GH21418@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-05-09ftrace: Fix ftrace enable ordering of sysctl ftrace_enabledSteven Rostedt (Red Hat)
commit 524a38682573b2e15ab6317ccfe50280441514be upstream. Some archs (specifically PowerPC), are sensitive with the ordering of the enabling of the calls to function tracing and setting of the function to use to be traced. That is, update_ftrace_function() sets what function the ftrace_caller trampoline should call. Some archs require this to be set before calling ftrace_run_update_code(). Another bug was discovered, that ftrace_startup_sysctl() called ftrace_run_update_code() directly. If the function the ftrace_caller trampoline changes, then it will not be updated. Instead a call to ftrace_startup_enable() should be called because it tests to see if the callback changed since the code was disabled, and will tell the arch to update appropriately. Most archs do not need this notification, but PowerPC does. The problem could be seen by the following commands: # echo 0 > /proc/sys/kernel/ftrace_enabled # echo function > /sys/kernel/debug/tracing/current_tracer # echo 1 > /proc/sys/kernel/ftrace_enabled # cat /sys/kernel/debug/tracing/trace The trace will show that function tracing was not active. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-05-09ftrace: Fix en(dis)able graph caller when en(dis)abling record via sysctlPratyush Anand
commit 1619dc3f8f555ee1cdd3c75db3885d5715442b12 upstream. When ftrace is enabled globally through the proc interface, we must check if ftrace_graph_active is set. If it is set, then we should also pass the FTRACE_START_FUNC_RET command to ftrace_run_update_code(). Similarly, when ftrace is disabled globally through the proc interface, we must check if ftrace_graph_active is set. If it is set, then we should also pass the FTRACE_STOP_FUNC_RET command to ftrace_run_update_code(). Consider the following situation. # echo 0 > /proc/sys/kernel/ftrace_enabled After this ftrace_enabled = 0. # echo function_graph > /sys/kernel/debug/tracing/current_tracer Since ftrace_enabled = 0, ftrace_enable_ftrace_graph_caller() is never called. # echo 1 > /proc/sys/kernel/ftrace_enabled Now ftrace_enabled will be set to true, but still ftrace_enable_ftrace_graph_caller() will not be called, which is not desired. Further if we execute the following after this: # echo nop > /sys/kernel/debug/tracing/current_tracer Now since ftrace_enabled is set it will call ftrace_disable_ftrace_graph_caller(), which causes a kernel warning on the ARM platform. On the ARM platform, when ftrace_enable_ftrace_graph_caller() is called, it checks whether the old instruction is a nop or not. If it's not a nop, then it returns an error. If it is a nop then it replaces instruction at that address with a branch to ftrace_graph_caller. ftrace_disable_ftrace_graph_caller() behaves just the opposite. Therefore, if generic ftrace code ever calls either ftrace_enable_ftrace_graph_caller() or ftrace_disable_ftrace_graph_caller() consecutively two times in a row, then it will return an error, which will cause the generic ftrace code to raise a warning. Note, x86 does not have an issue with this because the architecture specific code for ftrace_enable_ftrace_graph_caller() and ftrace_disable_ftrace_graph_caller() does not check the previous state, and calling either of these functions twice in a row has no ill effect. Link: http://lkml.kernel.org/r/e4fbe64cdac0dd0e86a3bf914b0f83c0b419f146.1425666454.git.panand@redhat.com Signed-off-by: Pratyush Anand <panand@redhat.com> [ removed extra if (ftrace_start_up) and defined ftrace_graph_active as 0 if CONFIG_FUNCTION_GRAPH_TRACER is not set. ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-05-09console: Fix console name size mismatchPeter Hurley
commit 30a22c215a0007603ffc08021f2e8b64018517dd upstream. commit 6ae9200f2cab7 ("enlarge console.name") increased the storage for the console name to 16 bytes, but not the corresponding struct console_cmdline::name storage. Console names longer than 8 bytes cause read beyond end-of-string and failure to match console; I'm not sure if there are other unexpected consequences. Signed-off-by: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [bwh: Backported to 3.2: - Adjust filename - Use console_cmdline[i] instead of *c] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-05-09kdb: fix incorrect counts in KDB summary command outputJay Lan
commit 146755923262037fc4c54abc28c04b1103f3cc51 upstream. The output of KDB 'summary' command should report MemTotal, MemFree and Buffers output in kB. Current codes report in unit of pages. A define of K(x) as is defined in the code, but not used. This patch would apply the define to convert the values to kB. Please include me on Cc on replies. I do not subscribe to linux-kernel. Signed-off-by: Jay Lan <jlan@sgi.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-05-09sched/autogroup: Fix failure to set cpu.rt_runtime_usPeter Zijlstra
commit 1fe89e1b6d270aa0d3452c60d38461ea589594e3 upstream. Because task_group() uses a cache of autogroup_task_group(), whose output depends on sched_class, switching classes can generate problems. In particular, when started as fair, the cache points to the autogroup, so when switching to RT the tg_rt_schedulable() test fails for every cpu.rt_{runtime,period}_us change because now the autogroup has tasks and no runtime. Furthermore, going back to the previous semantics of varying task_group() with sched_class has the down-side that the sched_debug output varies as well, even though the task really is in the autogroup. Therefore add an autogroup exception to tg_has_rt_tasks() -- such that both (all) task_group() usages in sched/core now have one. And remove all the remnants of the variable task_group() output. Reported-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Stefan Bader <stefan.bader@canonical.com> Fixes: 8323f26ce342 ("sched: Fix race in task_group()") Link: http://lkml.kernel.org/r/20150209112237.GR5029@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: adjust filenames, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-03-06ntp: Fixup adjtimex freq validation on 32-bit systemsJohn Stultz
commit 29183a70b0b828500816bd794b3fe192fce89f73 upstream. Additional validation of adjtimex freq values to avoid potential multiplication overflows were added in commit 5e5aeb4367b (time: adjtimex: Validate the ADJ_FREQUENCY values) Unfortunately the patch used LONG_MAX/MIN instead of LLONG_MAX/MIN, which was fine on 64-bit systems, but being much smaller on 32-bit systems caused false positives resulting in most direct frequency adjustments to fail w/ EINVAL. ntpd only does direct frequency adjustments at startup, so the issue was not as easily observed there, but other time sync applications like ptpd and chrony were more effected by the bug. See bugs: https://bugzilla.kernel.org/show_bug.cgi?id=92481 https://bugzilla.redhat.com/show_bug.cgi?id=1188074 This patch changes the checks to use LLONG_MAX for clarity, and additionally the checks are disabled on 32-bit systems since LLONG_MAX/PPM_SCALE is always larger then the 32-bit long freq value, so multiplication overflows aren't possible there. Reported-by: Josh Boyer <jwboyer@fedoraproject.org> Reported-by: George Joseph <george.joseph@fairview5.com> Tested-by: George Joseph <george.joseph@fairview5.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sasha Levin <sasha.levin@oracle.com> Link: http://lkml.kernel.org/r/1423553436-29747-1-git-send-email-john.stultz@linaro.org [ Prettified the changelog and the comments a bit. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-03-06time: adjtimex: Validate the ADJ_FREQUENCY valuesSasha Levin
commit 5e5aeb4367b450a28f447f6d5ab57d8f2ab16a5f upstream. Verify that the frequency value from userspace is valid and makes sense. Unverified values can cause overflows later on. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> [jstultz: Fix up bug for negative values and drop redunent cap check] Signed-off-by: John Stultz <john.stultz@linaro.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-03-06sched/rt: Reduce rq lock contention by eliminating locking of non-feasible ↵Tim Chen
target commit 80e3d87b2c5582db0ab5e39610ce3707d97ba409 upstream. This patch adds checks that prevens futile attempts to move rt tasks to a CPU with active tasks of equal or higher priority. This reduces run queue lock contention and improves the performance of a well known OLTP benchmark by 0.7%. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shawn Bohrer <sbohrer@rgmadvisors.com> Cc: Suruchi Kadu <suruchi.a.kadu@intel.com> Cc: Doug Nelson<doug.nelson@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1421430374.2399.27.camel@schen9-desk2.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>