From c2bc11113c50449f23c40b724fe410fc2380a8e9 Mon Sep 17 00:00:00 2001 From: John Stultz Date: Thu, 27 Oct 2011 18:12:42 -0700 Subject: time: Improve documentation of timekeeeping_adjust() After getting a number of questions in private emails about the math around admittedly very complex timekeeping_adjust() and timekeeping_big_adjust(), I figure the code needs some better comments. Hopefully the explanations are clear enough and don't muddy the water any worse. Still needs documentation for ntp_error, but I couldn't recall exactly the full explanation behind the code that's there (although I do recall once working it out when Roman first proposed it). Given a bit more time I can probably work it out, but I don't want to hold back this documentation until then. Signed-off-by: John Stultz Cc: Chen Jie Cc: Steven Rostedt Link: http://lkml.kernel.org/r/1319764362-32367-1-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar --- kernel/time/timekeeping.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 80 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 2b021b0e8507..025e136f3881 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -802,14 +802,44 @@ static void timekeeping_adjust(s64 offset) s64 error, interval = timekeeper.cycle_interval; int adj; + /* + * The point of this is to check if the error is greater then half + * an interval. + * + * First we shift it down from NTP_SHIFT to clocksource->shifted nsecs. + * + * Note we subtract one in the shift, so that error is really error*2. + * This "saves" dividing(shifting) intererval twice, but keeps the + * (error > interval) comparision as still measuring if error is + * larger then half an interval. + * + * Note: It does not "save" on aggrivation when reading the code. + */ error = timekeeper.ntp_error >> (timekeeper.ntp_error_shift - 1); if (error > interval) { + /* + * We now divide error by 4(via shift), which checks if + * the error is greater then twice the interval. + * If it is greater, we need a bigadjust, if its smaller, + * we can adjust by 1. + */ error >>= 2; + /* + * XXX - In update_wall_time, we round up to the next + * nanosecond, and store the amount rounded up into + * the error. This causes the likely below to be unlikely. + * + * The properfix is to avoid rounding up by using + * the high precision timekeeper.xtime_nsec instead of + * xtime.tv_nsec everywhere. Fixing this will take some + * time. + */ if (likely(error <= interval)) adj = 1; else adj = timekeeping_bigadjust(error, &interval, &offset); } else if (error < -interval) { + /* See comment above, this is just switched for the negative */ error >>= 2; if (likely(error >= -interval)) { adj = -1; @@ -817,9 +847,58 @@ static void timekeeping_adjust(s64 offset) offset = -offset; } else adj = timekeeping_bigadjust(error, &interval, &offset); - } else + } else /* No adjustment needed */ return; + /* + * So the following can be confusing. + * + * To keep things simple, lets assume adj == 1 for now. + * + * When adj != 1, remember that the interval and offset values + * have been appropriately scaled so the math is the same. + * + * The basic idea here is that we're increasing the multiplier + * by one, this causes the xtime_interval to be incremented by + * one cycle_interval. This is because: + * xtime_interval = cycle_interval * mult + * So if mult is being incremented by one: + * xtime_interval = cycle_interval * (mult + 1) + * Its the same as: + * xtime_interval = (cycle_interval * mult) + cycle_interval + * Which can be shortened to: + * xtime_interval += cycle_interval + * + * So offset stores the non-accumulated cycles. Thus the current + * time (in shifted nanoseconds) is: + * now = (offset * adj) + xtime_nsec + * Now, even though we're adjusting the clock frequency, we have + * to keep time consistent. In other words, we can't jump back + * in time, and we also want to avoid jumping forward in time. + * + * So given the same offset value, we need the time to be the same + * both before and after the freq adjustment. + * now = (offset * adj_1) + xtime_nsec_1 + * now = (offset * adj_2) + xtime_nsec_2 + * So: + * (offset * adj_1) + xtime_nsec_1 = + * (offset * adj_2) + xtime_nsec_2 + * And we know: + * adj_2 = adj_1 + 1 + * So: + * (offset * adj_1) + xtime_nsec_1 = + * (offset * (adj_1+1)) + xtime_nsec_2 + * (offset * adj_1) + xtime_nsec_1 = + * (offset * adj_1) + offset + xtime_nsec_2 + * Canceling the sides: + * xtime_nsec_1 = offset + xtime_nsec_2 + * Which gives us: + * xtime_nsec_2 = xtime_nsec_1 - offset + * Which simplfies to: + * xtime_nsec -= offset + * + * XXX - TODO: Doc ntp_error calculation. + */ timekeeper.mult += adj; timekeeper.xtime_interval += interval; timekeeper.xtime_nsec -= offset; -- cgit v1.2.3 From d65670a78cdbfae94f20a9e05ec705871d7cdf2b Mon Sep 17 00:00:00 2001 From: John Stultz Date: Mon, 31 Oct 2011 17:06:35 -0400 Subject: clocksource: Avoid selecting mult values that might overflow when adjusted For some frequencies, the clocks_calc_mult_shift() function will unfortunately select mult values very close to 0xffffffff. This has the potential to overflow when NTP adjusts the clock, adding to the mult value. This patch adds a clocksource.maxadj value, which provides an approximation of an 11% adjustment(NTP limits adjustments to 500ppm and the tick adjustment is limited to 10%), which could be made to the clocksource.mult value. This is then used to both check that the current mult value won't overflow/underflow, as well as warning us if the timekeeping_adjust() code pushes over that 11% boundary. v2: Fix max_adjustment calculation, and improve WARN_ONCE messages. v3: Don't warn before maxadj has actually been set CC: Yong Zhang CC: David Daney CC: Thomas Gleixner CC: Chen Jie CC: zhangfx CC: stable@kernel.org Reported-by: Chen Jie Reported-by: zhangfx Tested-by: Yong Zhang Signed-off-by: John Stultz --- include/linux/clocksource.h | 3 ++- kernel/time/clocksource.c | 58 +++++++++++++++++++++++++++++++++++++-------- kernel/time/timekeeping.c | 7 ++++++ 3 files changed, 57 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h index 139c4db55f17..c86c940d1de3 100644 --- a/include/linux/clocksource.h +++ b/include/linux/clocksource.h @@ -156,6 +156,7 @@ extern u64 timecounter_cyc2time(struct timecounter *tc, * @mult: cycle to nanosecond multiplier * @shift: cycle to nanosecond divisor (power of two) * @max_idle_ns: max idle time permitted by the clocksource (nsecs) + * @maxadj maximum adjustment value to mult (~11%) * @flags: flags describing special properties * @archdata: arch-specific data * @suspend: suspend function for the clocksource, if necessary @@ -172,7 +173,7 @@ struct clocksource { u32 mult; u32 shift; u64 max_idle_ns; - + u32 maxadj; #ifdef CONFIG_ARCH_CLOCKSOURCE_DATA struct arch_clocksource_data archdata; #endif diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index cf52fda2e096..cfc65e1eb9fb 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -491,6 +491,22 @@ void clocksource_touch_watchdog(void) clocksource_resume_watchdog(); } +/** + * clocksource_max_adjustment- Returns max adjustment amount + * @cs: Pointer to clocksource + * + */ +static u32 clocksource_max_adjustment(struct clocksource *cs) +{ + u64 ret; + /* + * We won't try to correct for more then 11% adjustments (110,000 ppm), + */ + ret = (u64)cs->mult * 11; + do_div(ret,100); + return (u32)ret; +} + /** * clocksource_max_deferment - Returns max time the clocksource can be deferred * @cs: Pointer to clocksource @@ -503,25 +519,28 @@ static u64 clocksource_max_deferment(struct clocksource *cs) /* * Calculate the maximum number of cycles that we can pass to the * cyc2ns function without overflowing a 64-bit signed result. The - * maximum number of cycles is equal to ULLONG_MAX/cs->mult which - * is equivalent to the below. - * max_cycles < (2^63)/cs->mult - * max_cycles < 2^(log2((2^63)/cs->mult)) - * max_cycles < 2^(log2(2^63) - log2(cs->mult)) - * max_cycles < 2^(63 - log2(cs->mult)) - * max_cycles < 1 << (63 - log2(cs->mult)) + * maximum number of cycles is equal to ULLONG_MAX/(cs->mult+cs->maxadj) + * which is equivalent to the below. + * max_cycles < (2^63)/(cs->mult + cs->maxadj) + * max_cycles < 2^(log2((2^63)/(cs->mult + cs->maxadj))) + * max_cycles < 2^(log2(2^63) - log2(cs->mult + cs->maxadj)) + * max_cycles < 2^(63 - log2(cs->mult + cs->maxadj)) + * max_cycles < 1 << (63 - log2(cs->mult + cs->maxadj)) * Please note that we add 1 to the result of the log2 to account for * any rounding errors, ensure the above inequality is satisfied and * no overflow will occur. */ - max_cycles = 1ULL << (63 - (ilog2(cs->mult) + 1)); + max_cycles = 1ULL << (63 - (ilog2(cs->mult + cs->maxadj) + 1)); /* * The actual maximum number of cycles we can defer the clocksource is * determined by the minimum of max_cycles and cs->mask. + * Note: Here we subtract the maxadj to make sure we don't sleep for + * too long if there's a large negative adjustment. */ max_cycles = min_t(u64, max_cycles, (u64) cs->mask); - max_nsecs = clocksource_cyc2ns(max_cycles, cs->mult, cs->shift); + max_nsecs = clocksource_cyc2ns(max_cycles, cs->mult - cs->maxadj, + cs->shift); /* * To ensure that the clocksource does not wrap whilst we are idle, @@ -640,7 +659,6 @@ static void clocksource_enqueue(struct clocksource *cs) void __clocksource_updatefreq_scale(struct clocksource *cs, u32 scale, u32 freq) { u64 sec; - /* * Calc the maximum number of seconds which we can run before * wrapping around. For clocksources which have a mask > 32bit @@ -661,6 +679,20 @@ void __clocksource_updatefreq_scale(struct clocksource *cs, u32 scale, u32 freq) clocks_calc_mult_shift(&cs->mult, &cs->shift, freq, NSEC_PER_SEC / scale, sec * scale); + + /* + * for clocksources that have large mults, to avoid overflow. + * Since mult may be adjusted by ntp, add an safety extra margin + * + */ + cs->maxadj = clocksource_max_adjustment(cs); + while ((cs->mult + cs->maxadj < cs->mult) + || (cs->mult - cs->maxadj > cs->mult)) { + cs->mult >>= 1; + cs->shift--; + cs->maxadj = clocksource_max_adjustment(cs); + } + cs->max_idle_ns = clocksource_max_deferment(cs); } EXPORT_SYMBOL_GPL(__clocksource_updatefreq_scale); @@ -701,6 +733,12 @@ EXPORT_SYMBOL_GPL(__clocksource_register_scale); */ int clocksource_register(struct clocksource *cs) { + /* calculate max adjustment for given mult/shift */ + cs->maxadj = clocksource_max_adjustment(cs); + WARN_ONCE(cs->mult + cs->maxadj < cs->mult, + "Clocksource %s might overflow on 11%% adjustment\n", + cs->name); + /* calculate max idle time permitted for this clocksource */ cs->max_idle_ns = clocksource_max_deferment(cs); diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 2b021b0e8507..e65ff3171102 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -820,6 +820,13 @@ static void timekeeping_adjust(s64 offset) } else return; + WARN_ONCE(timekeeper.clock->maxadj && + (timekeeper.mult + adj > timekeeper.clock->mult + + timekeeper.clock->maxadj), + "Adjusting %s more then 11%% (%ld vs %ld)\n", + timekeeper.clock->name, (long)timekeeper.mult + adj, + (long)timekeeper.clock->mult + + timekeeper.clock->maxadj); timekeeper.mult += adj; timekeeper.xtime_interval += interval; timekeeper.xtime_nsec -= offset; -- cgit v1.2.3 From cf5f0acf3935c91379e709a71ecf68805d366659 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 13 Oct 2011 16:52:28 +0200 Subject: sched: Add a comment to effective_load() since it's a pain Every time I have to stare at this function I need to completely reverse engineer its workings, about time I write a comment explaining the thing. Collected bits and pieces from previous changelogs, mostly: 4be9daaa1b33701f011f4117f22dc1e45a3e6e34 83378269a5fad98f562ebc0f09c349575e6cbfe1 Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1318518057.27731.2.camel@twins Signed-off-by: Ingo Molnar --- kernel/sched_fair.c | 113 +++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 95 insertions(+), 18 deletions(-) (limited to 'kernel') diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 5c9e67923b7c..aba20f495188 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -772,19 +772,32 @@ static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update) list_del_leaf_cfs_rq(cfs_rq); } +static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq) +{ + long tg_weight; + + /* + * Use this CPU's actual weight instead of the last load_contribution + * to gain a more accurate current total weight. See + * update_cfs_rq_load_contribution(). + */ + tg_weight = atomic_read(&tg->load_weight); + tg_weight -= cfs_rq->load_contribution; + tg_weight += cfs_rq->load.weight; + + return tg_weight; +} + static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg) { - long load_weight, load, shares; + long tg_weight, load, shares; + tg_weight = calc_tg_weight(tg, cfs_rq); load = cfs_rq->load.weight; - load_weight = atomic_read(&tg->load_weight); - load_weight += load; - load_weight -= cfs_rq->load_contribution; - shares = (tg->shares * load); - if (load_weight) - shares /= load_weight; + if (tg_weight) + shares /= tg_weight; if (shares < MIN_SHARES) shares = MIN_SHARES; @@ -2036,36 +2049,100 @@ static void task_waking_fair(struct task_struct *p) * Adding load to a group doesn't make a group heavier, but can cause movement * of group shares between cpus. Assuming the shares were perfectly aligned one * can calculate the shift in shares. + * + * Calculate the effective load difference if @wl is added (subtracted) to @tg + * on this @cpu and results in a total addition (subtraction) of @wg to the + * total group weight. + * + * Given a runqueue weight distribution (rw_i) we can compute a shares + * distribution (s_i) using: + * + * s_i = rw_i / \Sum rw_j (1) + * + * Suppose we have 4 CPUs and our @tg is a direct child of the root group and + * has 7 equal weight tasks, distributed as below (rw_i), with the resulting + * shares distribution (s_i): + * + * rw_i = { 2, 4, 1, 0 } + * s_i = { 2/7, 4/7, 1/7, 0 } + * + * As per wake_affine() we're interested in the load of two CPUs (the CPU the + * task used to run on and the CPU the waker is running on), we need to + * compute the effect of waking a task on either CPU and, in case of a sync + * wakeup, compute the effect of the current task going to sleep. + * + * So for a change of @wl to the local @cpu with an overall group weight change + * of @wl we can compute the new shares distribution (s'_i) using: + * + * s'_i = (rw_i + @wl) / (@wg + \Sum rw_j) (2) + * + * Suppose we're interested in CPUs 0 and 1, and want to compute the load + * differences in waking a task to CPU 0. The additional task changes the + * weight and shares distributions like: + * + * rw'_i = { 3, 4, 1, 0 } + * s'_i = { 3/8, 4/8, 1/8, 0 } + * + * We can then compute the difference in effective weight by using: + * + * dw_i = S * (s'_i - s_i) (3) + * + * Where 'S' is the group weight as seen by its parent. + * + * Therefore the effective change in loads on CPU 0 would be 5/56 (3/8 - 2/7) + * times the weight of the group. The effect on CPU 1 would be -4/56 (4/8 - + * 4/7) times the weight of the group. */ static long effective_load(struct task_group *tg, int cpu, long wl, long wg) { struct sched_entity *se = tg->se[cpu]; - if (!tg->parent) + if (!tg->parent) /* the trivial, non-cgroup case */ return wl; for_each_sched_entity(se) { - long lw, w; + long w, W; tg = se->my_q->tg; - w = se->my_q->load.weight; - /* use this cpu's instantaneous contribution */ - lw = atomic_read(&tg->load_weight); - lw -= se->my_q->load_contribution; - lw += w + wg; + /* + * W = @wg + \Sum rw_j + */ + W = wg + calc_tg_weight(tg, se->my_q); - wl += w; + /* + * w = rw_i + @wl + */ + w = se->my_q->load.weight + wl; - if (lw > 0 && wl < lw) - wl = (wl * tg->shares) / lw; + /* + * wl = S * s'_i; see (2) + */ + if (W > 0 && w < W) + wl = (w * tg->shares) / W; else wl = tg->shares; - /* zero point is MIN_SHARES */ + /* + * Per the above, wl is the new se->load.weight value; since + * those are clipped to [MIN_SHARES, ...) do so now. See + * calc_cfs_shares(). + */ if (wl < MIN_SHARES) wl = MIN_SHARES; + + /* + * wl = dw_i = S * (s'_i - s_i); see (3) + */ wl -= se->load.weight; + + /* + * Recursively apply this logic to all parent groups to compute + * the final effective load change on the root group. Since + * only the @tg group gets extra weight, all parent groups can + * only redistribute existing shares. @wl is the shift in shares + * resulting from this level per the above. + */ wg = 0; } -- cgit v1.2.3 From 461819ac8ee950ce027c72a066156a3df9e60c7e Mon Sep 17 00:00:00 2001 From: Hui Kang Date: Tue, 11 Oct 2011 23:00:59 -0400 Subject: sched_fair: Fix a typo in the comment describing update_sd_lb_stats Signed-off-by: Hui Kang Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1318388459-4427-1-git-send-email-hkang.sunysb@gmail.com Signed-off-by: Ingo Molnar --- kernel/sched_fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index aba20f495188..7e51b5bb27cc 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -3588,7 +3588,7 @@ static bool update_sd_pick_busiest(struct sched_domain *sd, } /** - * update_sd_lb_stats - Update sched_group's statistics for load balancing. + * update_sd_lb_stats - Update sched_domain's statistics for load balancing. * @sd: sched_domain whose statistics are to be updated. * @this_cpu: Cpu for which load balance is currently performed. * @idle: Idle status of this_cpu -- cgit v1.2.3 From c6dc7f055d333ef35f397b8d7c3abcd1918bf8cb Mon Sep 17 00:00:00 2001 From: "J. Bruce Fields" Date: Thu, 6 Oct 2011 15:22:46 -0400 Subject: sched: Document wait_for_completion_*() return values The return-value convention for these functions varies depending on whether they're interruptible or can timeout. It can be a little confusing--document it. Signed-off-by: J. Bruce Fields Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/20111006192246.GB28026@fieldses.org Signed-off-by: Ingo Molnar --- kernel/sched.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 0e9344a71be3..3d2c436959a1 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4810,6 +4810,9 @@ EXPORT_SYMBOL(wait_for_completion); * This waits for either a completion of a specific task to be signaled or for a * specified timeout to expire. The timeout is in jiffies. It is not * interruptible. + * + * The return value is 0 if timed out, and positive (at least 1, or number of + * jiffies left till timeout) if completed. */ unsigned long __sched wait_for_completion_timeout(struct completion *x, unsigned long timeout) @@ -4824,6 +4827,8 @@ EXPORT_SYMBOL(wait_for_completion_timeout); * * This waits for completion of a specific task to be signaled. It is * interruptible. + * + * The return value is -ERESTARTSYS if interrupted, 0 if completed. */ int __sched wait_for_completion_interruptible(struct completion *x) { @@ -4841,6 +4846,9 @@ EXPORT_SYMBOL(wait_for_completion_interruptible); * * This waits for either a completion of a specific task to be signaled or for a * specified timeout to expire. It is interruptible. The timeout is in jiffies. + * + * The return value is -ERESTARTSYS if interrupted, 0 if timed out, + * positive (at least 1, or number of jiffies left till timeout) if completed. */ long __sched wait_for_completion_interruptible_timeout(struct completion *x, @@ -4856,6 +4864,8 @@ EXPORT_SYMBOL(wait_for_completion_interruptible_timeout); * * This waits to be signaled for completion of a specific task. It can be * interrupted by a kill signal. + * + * The return value is -ERESTARTSYS if interrupted, 0 if completed. */ int __sched wait_for_completion_killable(struct completion *x) { @@ -4874,6 +4884,9 @@ EXPORT_SYMBOL(wait_for_completion_killable); * This waits for either a completion of a specific task to be * signaled or for a specified timeout to expire. It can be * interrupted by a kill signal. The timeout is in jiffies. + * + * The return value is -ERESTARTSYS if interrupted, 0 if timed out, + * positive (at least 1, or number of jiffies left till timeout) if completed. */ long __sched wait_for_completion_killable_timeout(struct completion *x, -- cgit v1.2.3 From 4a6184ce7a48c478dee0d8a9ed74c1fa35161858 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 6 Oct 2011 22:39:14 +0200 Subject: sched, rt: Provide means of disabling cross-cpu bandwidth sharing Normally the RT bandwidth scheme will share bandwidth across the entire root_domain. However sometimes its convenient to disable this sharing for debug purposes. Provide a simple feature switch to this end. Signed-off-by: Peter Zijlstra Signed-off-by: Ingo Molnar --- kernel/sched_features.h | 1 + kernel/sched_rt.c | 3 +++ 2 files changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/sched_features.h b/kernel/sched_features.h index efa0a7b75dde..84802245abd2 100644 --- a/kernel/sched_features.h +++ b/kernel/sched_features.h @@ -67,3 +67,4 @@ SCHED_FEAT(NONTASK_POWER, 1) SCHED_FEAT(TTWU_QUEUE, 1) SCHED_FEAT(FORCE_SD_OVERLAP, 0) +SCHED_FEAT(RT_RUNTIME_SHARE, 1) diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c index 056cbd2e2a27..583a1368afe6 100644 --- a/kernel/sched_rt.c +++ b/kernel/sched_rt.c @@ -560,6 +560,9 @@ static int balance_runtime(struct rt_rq *rt_rq) { int more = 0; + if (!sched_feat(RT_RUNTIME_SHARE)) + return more; + if (rt_rq->rt_time > rt_rq->rt_runtime) { raw_spin_unlock(&rt_rq->rt_runtime_lock); more = do_balance_runtime(rt_rq); -- cgit v1.2.3 From f1c6f1a7eed963ed233ba4c8b6fa8addb86c6ddc Mon Sep 17 00:00:00 2001 From: Carsten Emde Date: Wed, 26 Oct 2011 23:14:16 +0200 Subject: sched: Set the command name of the idle tasks in SMP kernels In UP systems, the idle task is initialized using the init_task structure from which the command name is taken (currently "swapper"). In SMP systems, one idle task per CPU is forked by the worker thread from which the task structure is copied. The command name is, therefore, "kworker/0:0" or "kworker/0:1", if not updated. Since such update was lacking, all idle tasks in SMP systems were incorrectly named. This longtime bug was not discovered immediately, because there is no /proc/0 entry - the bug only becomes apparent when tracing is enabled. This patch sets the command name of the idle tasks in SMP systems to the name that is used in the INIT_TASK structure suffixed by a slash and the number of the CPU. Signed-off-by: Carsten Emde Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/20111026211708.768925506@osadl.org Signed-off-by: Ingo Molnar --- include/linux/init_task.h | 4 +++- kernel/sched.c | 4 ++++ 2 files changed, 7 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 08ffab01e76c..b6e5b8b000e0 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -126,6 +126,8 @@ extern struct cred init_cred; # define INIT_PERF_EVENTS(tsk) #endif +#define INIT_TASK_COMM "swapper" + /* * INIT_TASK is used to set up the first task table, touch at * your own risk!. Base=0, limit=0x1fffff (=2MB) @@ -162,7 +164,7 @@ extern struct cred init_cred; .group_leader = &tsk, \ RCU_INIT_POINTER(.real_cred, &init_cred), \ RCU_INIT_POINTER(.cred, &init_cred), \ - .comm = "swapper", \ + .comm = INIT_TASK_COMM, \ .thread = INIT_THREAD, \ .fs = &init_fs, \ .files = &init_files, \ diff --git a/kernel/sched.c b/kernel/sched.c index 3d2c436959a1..d6b149ccf925 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -71,6 +71,7 @@ #include #include #include +#include #include #include @@ -6112,6 +6113,9 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu) */ idle->sched_class = &idle_sched_class; ftrace_graph_init_idle_task(idle, cpu); +#if defined(CONFIG_SMP) + sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu); +#endif } /* -- cgit v1.2.3 From 1d5f003f5a964711853514b04ddc872eec0fdc7b Mon Sep 17 00:00:00 2001 From: Gleb Natapov Date: Sun, 23 Oct 2011 19:10:33 +0200 Subject: perf: Do not set task_ctx pointer in cpuctx if there are no events in the context Do not set task_ctx pointer during sched_in if there are no events associated with the context. Otherwise if during task execution total number of events in the system will become zero perf_event_context_sched_out() will not be called and cpuctx->task_ctx will be left with a stale value. Signed-off-by: Gleb Natapov Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/20111023171033.GI17571@redhat.com Signed-off-by: Ingo Molnar --- kernel/events/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 0e8457da6f95..b0c1186fd97b 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -2173,7 +2173,8 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx, perf_event_sched_in(cpuctx, ctx, task); - cpuctx->task_ctx = ctx; + if (ctx->nr_events) + cpuctx->task_ctx = ctx; perf_pmu_enable(ctx->pmu); perf_ctx_unlock(cpuctx, ctx); -- cgit v1.2.3 From 4dcfe1025b513c2c1da5bf5586adb0e80148f612 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 10 Nov 2011 13:01:10 +0100 Subject: sched: Avoid SMT siblings in select_idle_sibling() if possible Avoid select_idle_sibling() from picking a sibling thread if there's an idle core that shares cache. This fixes SMT balancing in the increasingly common case where there's a shared cache core available to balance to. Tested-by: Mike Galbraith Signed-off-by: Peter Zijlstra Cc: Suresh Siddha Link: http://lkml.kernel.org/r/1321350377.1421.55.camel@twins Signed-off-by: Ingo Molnar --- kernel/sched_fair.c | 42 ++++++++++++++++++++++++++++-------------- 1 file changed, 28 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 7e51b5bb27cc..ba0e1f49a22f 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -2326,7 +2326,8 @@ static int select_idle_sibling(struct task_struct *p, int target) int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); struct sched_domain *sd; - int i; + struct sched_group *sg; + int i, smt = 0; /* * If the task is going to be woken-up on this cpu and if it is @@ -2346,25 +2347,38 @@ static int select_idle_sibling(struct task_struct *p, int target) * Otherwise, iterate the domains and find an elegible idle cpu. */ rcu_read_lock(); +again: for_each_domain(target, sd) { - if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) - break; + if (!smt && (sd->flags & SD_SHARE_CPUPOWER)) + continue; - for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) { - if (idle_cpu(i)) { - target = i; - break; + if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) { + if (!smt) { + smt = 1; + goto again; } + break; } - /* - * Lets stop looking for an idle sibling when we reached - * the domain that spans the current cpu and prev_cpu. - */ - if (cpumask_test_cpu(cpu, sched_domain_span(sd)) && - cpumask_test_cpu(prev_cpu, sched_domain_span(sd))) - break; + sg = sd->groups; + do { + if (!cpumask_intersects(sched_group_cpus(sg), + tsk_cpus_allowed(p))) + goto next; + + for_each_cpu(i, sched_group_cpus(sg)) { + if (!idle_cpu(i)) + goto next; + } + + target = cpumask_first_and(sched_group_cpus(sg), + tsk_cpus_allowed(p)); + goto done; +next: + sg = sg->next; + } while (sg != sd->groups); } +done: rcu_read_unlock(); return target; -- cgit v1.2.3 From fccfdc6f0d8c83c854eeb6d93aa158f0e551bd49 Mon Sep 17 00:00:00 2001 From: Paul Turner Date: Mon, 7 Nov 2011 20:26:34 -0800 Subject: sched: Fix buglet in return_cfs_rq_runtime() In return_cfs_rq_runtime() we want to return bandwidth when there are no remaining tasks, not "return" when this is the case. Signed-off-by: Paul Turner Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/20111108042736.623812423@google.com Signed-off-by: Ingo Molnar --- kernel/sched_fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index ba0e1f49a22f..a78ed2736ba7 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -1756,7 +1756,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) { - if (!cfs_rq->runtime_enabled || !cfs_rq->nr_running) + if (!cfs_rq->runtime_enabled || cfs_rq->nr_running) return; __return_cfs_rq_runtime(cfs_rq); -- cgit v1.2.3 From 2ed0e645f358c26f4f4a7aed56a9488db0020ad1 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 16 Nov 2011 12:27:39 +0000 Subject: genirq: Don't allow per cpu interrupts to be suspended The power management functions related to interrupts do not know (yet) about per-cpu interrupts and end up calling the wrong low-level methods to enable/disable interrupts. This leads to all kind of interesting issues (action taken on one CPU only, updating a refcount which is not used otherwise...). The workaround for the time being is simply to flag these interrupts with IRQF_NO_SUSPEND. At least on ARM, these interrupts are actually dealt with at the architecture level. Reported-by: Santosh Shilimkar Tested-by: Santosh Shilimkar Signed-off-by: Marc Zyngier Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/1321446459-31409-1-git-send-email-marc.zyngier@arm.com Signed-off-by: Thomas Gleixner --- kernel/irq/manage.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 67ce837ae52c..0e2b179bc7b3 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -1596,7 +1596,7 @@ int request_percpu_irq(unsigned int irq, irq_handler_t handler, return -ENOMEM; action->handler = handler; - action->flags = IRQF_PERCPU; + action->flags = IRQF_PERCPU | IRQF_NO_SUSPEND; action->name = devname; action->percpu_dev_id = dev_id; -- cgit v1.2.3 From d004e024058a0eaca097513ce62cbcf978913e0a Mon Sep 17 00:00:00 2001 From: Hector Palacios Date: Mon, 14 Nov 2011 11:15:25 +0100 Subject: timekeeping: add arch_offset hook to ktime_get functions ktime_get and ktime_get_ts were calling timekeeping_get_ns() but later they were not calling arch_gettimeoffset() so architectures using this mechanism returned 0 ns when calling these functions. This happened for example when running Busybox's ping which calls syscall(__NR_clock_gettime, CLOCK_MONOTONIC, ts) which eventually calls ktime_get. As a result the returned ping travel time was zero. CC: stable@kernel.org Signed-off-by: Hector Palacios Signed-off-by: John Stultz --- kernel/time/timekeeping.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index e9f60d311436..237841378c03 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -249,6 +249,8 @@ ktime_t ktime_get(void) secs = xtime.tv_sec + wall_to_monotonic.tv_sec; nsecs = xtime.tv_nsec + wall_to_monotonic.tv_nsec; nsecs += timekeeping_get_ns(); + /* If arch requires, add in gettimeoffset() */ + nsecs += arch_gettimeoffset(); } while (read_seqretry(&xtime_lock, seq)); /* @@ -280,6 +282,8 @@ void ktime_get_ts(struct timespec *ts) *ts = xtime; tomono = wall_to_monotonic; nsecs = timekeeping_get_ns(); + /* If arch requires, add in gettimeoffset() */ + nsecs += arch_gettimeoffset(); } while (read_seqretry(&xtime_lock, seq)); -- cgit v1.2.3 From 27c9cd7e601632b3794e1c3344d37b86917ffb43 Mon Sep 17 00:00:00 2001 From: Jeff Ohlstein Date: Fri, 18 Nov 2011 15:47:10 -0800 Subject: hrtimer: Fix extra wakeups from __remove_hrtimer() __remove_hrtimer() attempts to reprogram the clockevent device when the timer being removed is the next to expire. However, __remove_hrtimer() reprograms the clockevent *before* removing the timer from the timerqueue and thus when hrtimer_force_reprogram() finds the next timer to expire it finds the timer we're trying to remove. This is especially noticeable when the system switches to NOHz mode and the system tick is removed. The timer tick is removed from the system but the clockevent is programmed to wakeup in another HZ anyway. Silence the extra wakeup by removing the timer from the timerqueue before calling hrtimer_force_reprogram() so that we actually program the clockevent for the next timer to expire. This was broken by 998adc3 "hrtimers: Convert hrtimers to use timerlist infrastructure". Signed-off-by: Jeff Ohlstein Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1321660030-8520-1-git-send-email-johlstei@codeaurora.org Signed-off-by: Thomas Gleixner --- kernel/hrtimer.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index a9205e32a059..2043c08d36c8 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -885,10 +885,13 @@ static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, unsigned long newstate, int reprogram) { + struct timerqueue_node *next_timer; if (!(timer->state & HRTIMER_STATE_ENQUEUED)) goto out; - if (&timer->node == timerqueue_getnext(&base->active)) { + next_timer = timerqueue_getnext(&base->active); + timerqueue_del(&base->active, &timer->node); + if (&timer->node == next_timer) { #ifdef CONFIG_HIGH_RES_TIMERS /* Reprogram the clock event device. if enabled */ if (reprogram && hrtimer_hres_active()) { @@ -901,7 +904,6 @@ static void __remove_hrtimer(struct hrtimer *timer, } #endif } - timerqueue_del(&base->active, &timer->node); if (!timerqueue_getnext(&base->active)) base->cpu_base->active_bases &= ~(1 << base->index); out: -- cgit v1.2.3 From 884a45d964dd395eda945842afff5e16bcaedf56 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 22 Nov 2011 07:44:47 -0800 Subject: cgroup_freezer: fix freezing groups with stopped tasks 2d3cbf8b (cgroup_freezer: update_freezer_state() does incorrect state transitions) removed is_task_frozen_enough and replaced it with a simple frozen call. This, however, breaks freezing for a group with stopped tasks because those cannot be frozen and so the group remains in CGROUP_FREEZING state (update_if_frozen doesn't count stopped tasks) and never reaches CGROUP_FROZEN. Let's add is_task_frozen_enough back and use it at the original locations (update_if_frozen and try_to_freeze_cgroup). Semantically we consider stopped tasks as frozen enough so we should consider both cases when testing frozen tasks. Testcase: mkdir /dev/freezer mount -t cgroup -o freezer none /dev/freezer mkdir /dev/freezer/foo sleep 1h & pid=$! kill -STOP $pid echo $pid > /dev/freezer/foo/tasks echo FROZEN > /dev/freezer/foo/freezer.state while true do cat /dev/freezer/foo/freezer.state [ "`cat /dev/freezer/foo/freezer.state`" = "FROZEN" ] && break sleep 1 done echo OK Signed-off-by: Michal Hocko Acked-by: Li Zefan Cc: Tomasz Buchert Cc: Paul Menage Cc: Andrew Morton Cc: stable@kernel.org Signed-off-by: Tejun Heo --- kernel/cgroup_freezer.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c index 5e828a2ca8e6..213c0351dad8 100644 --- a/kernel/cgroup_freezer.c +++ b/kernel/cgroup_freezer.c @@ -153,6 +153,13 @@ static void freezer_destroy(struct cgroup_subsys *ss, kfree(cgroup_freezer(cgroup)); } +/* task is frozen or will freeze immediately when next it gets woken */ +static bool is_task_frozen_enough(struct task_struct *task) +{ + return frozen(task) || + (task_is_stopped_or_traced(task) && freezing(task)); +} + /* * The call to cgroup_lock() in the freezer.state write method prevents * a write to that file racing against an attach, and hence the @@ -231,7 +238,7 @@ static void update_if_frozen(struct cgroup *cgroup, cgroup_iter_start(cgroup, &it); while ((task = cgroup_iter_next(cgroup, &it))) { ntotal++; - if (frozen(task)) + if (is_task_frozen_enough(task)) nfrozen++; } @@ -284,7 +291,7 @@ static int try_to_freeze_cgroup(struct cgroup *cgroup, struct freezer *freezer) while ((task = cgroup_iter_next(cgroup, &it))) { if (!freeze_task(task, true)) continue; - if (frozen(task)) + if (is_task_frozen_enough(task)) continue; if (!freezing(task) && !freezer_should_skip(task)) num_cant_freeze_now++; -- cgit v1.2.3 From 52553ddffad76ccf192d4dd9ce88d5818f57f62a Mon Sep 17 00:00:00 2001 From: Edward Donovan Date: Sun, 27 Nov 2011 23:07:34 -0500 Subject: genirq: fix regression in irqfixup, irqpoll MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Commit fa27271bc8d2("genirq: Fixup poll handling") introduced a regression that broke irqfixup/irqpoll for some hardware configurations. Amidst reorganizing 'try_one_irq', that patch removed a test that checked for 'action->handler' returning IRQ_HANDLED, before acting on the interrupt. Restoring this test back returns the functionality lost since 2.6.39. In the current set of tests, after 'action' is set, it must precede '!action->next' to take effect. With this and my previous patch to irq/spurious.c, c75d720fca8a, all IRQ regressions that I have encountered are fixed. Signed-off-by: Edward Donovan Reported-and-tested-by: Rogério Brito Cc: Thomas Gleixner Cc: stable@kernel.org (2.6.39+) Signed-off-by: Linus Torvalds --- kernel/irq/spurious.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c index b5f4742693c0..dc813a948be2 100644 --- a/kernel/irq/spurious.c +++ b/kernel/irq/spurious.c @@ -84,7 +84,9 @@ static int try_one_irq(int irq, struct irq_desc *desc, bool force) */ action = desc->action; if (!action || !(action->flags & IRQF_SHARED) || - (action->flags & __IRQF_TIMER) || !action->next) + (action->flags & __IRQF_TIMER) || + (action->handler(irq, action->dev_id) == IRQ_HANDLED) || + !action->next) goto out; /* Already running on another processor */ -- cgit v1.2.3 From b1f919664d04a8d0ba29cb76673c7ca3325a2006 Mon Sep 17 00:00:00 2001 From: "Yang Honggang (Joseph)" Date: Thu, 1 Dec 2011 22:22:41 -0500 Subject: clocksource: Fix bug with max_deferment margin calculation In order to leave a margin of 12.5% we should >> 3 not >> 5. CC: stable@kernel.org Signed-off-by: Yang Honggang (Joseph) [jstultz: Modified commit subject] Signed-off-by: John Stultz --- kernel/time/clocksource.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index cfc65e1eb9fb..da2f760e780c 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -548,7 +548,7 @@ static u64 clocksource_max_deferment(struct clocksource *cs) * note a margin of 12.5% is used because this can be computed with * a shift, versus say 10% which would require division. */ - return max_nsecs - (max_nsecs >> 5); + return max_nsecs - (max_nsecs >> 3); } #ifndef CONFIG_ARCH_USES_GETTIMEOFFSET @@ -669,7 +669,7 @@ void __clocksource_updatefreq_scale(struct clocksource *cs, u32 scale, u32 freq) * ~ 0.06ppm granularity for NTP. We apply the same 12.5% * margin as we do in clocksource_max_deferment() */ - sec = (cs->mask - (cs->mask >> 5)); + sec = (cs->mask - (cs->mask >> 3)); do_div(sec, freq); do_div(sec, scale); if (!sec) -- cgit v1.2.3 From d3d9acf646679c1981032b0985b386d12fccc60c Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Wed, 23 Nov 2011 08:49:49 -0800 Subject: trace_events_filter: Use rcu_assign_pointer() when setting ftrace_event_call->filter ftrace_event_call->filter is sched RCU protected but didn't use rcu_assign_pointer(). Use it. TODO: Add proper __rcu annotation to call->filter and all its users. -v2: Use RCU_INIT_POINTER() for %NULL clearing as suggested by Eric. Link: http://lkml.kernel.org/r/20111123164949.GA29639@google.com Cc: Eric Dumazet Cc: Frederic Weisbecker Cc: Jiri Olsa Cc: stable@kernel.org # (2.6.39+) Signed-off-by: Tejun Heo Signed-off-by: Steven Rostedt --- kernel/trace/trace_events_filter.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index 816d3d074979..d6e7926dcd26 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -1686,7 +1686,7 @@ static int replace_system_preds(struct event_subsystem *system, * replace the filter for the call. */ filter = call->filter; - call->filter = filter_item->filter; + rcu_assign_pointer(call->filter, filter_item->filter); filter_item->filter = filter; fail = false; @@ -1741,7 +1741,7 @@ int apply_event_filter(struct ftrace_event_call *call, char *filter_string) filter = call->filter; if (!filter) goto out_unlock; - call->filter = NULL; + RCU_INIT_POINTER(call->filter, NULL); /* Make sure the filter is not being used */ synchronize_sched(); __free_filter(filter); @@ -1782,7 +1782,7 @@ out: * string */ tmp = call->filter; - call->filter = filter; + rcu_assign_pointer(call->filter, filter); if (tmp) { /* Make sure the call is done with the filter */ synchronize_sched(); -- cgit v1.2.3 From 550acb19269d65f32e9ac4ddb26c2b2070e37f1c Mon Sep 17 00:00:00 2001 From: Ido Yariv Date: Thu, 1 Dec 2011 13:55:08 +0200 Subject: genirq: Fix race condition when stopping the irq thread In irq_wait_for_interrupt(), the should_stop member is verified before setting the task's state to TASK_INTERRUPTIBLE and calling schedule(). In case kthread_stop sets should_stop and wakes up the process after should_stop is checked by the irq thread but before the task's state is changed, the irq thread might never exit: kthread_stop irq_wait_for_interrupt ------------ ---------------------- ... ... while (!kthread_should_stop()) { kthread->should_stop = 1; wake_up_process(k); wait_for_completion(&kthread->exited); ... set_current_state(TASK_INTERRUPTIBLE); ... schedule(); } Fix this by checking if the thread should stop after modifying the task's state. [ tglx: Simplified it a bit ] Signed-off-by: Ido Yariv Link: http://lkml.kernel.org/r/1322740508-22640-1-git-send-email-ido@wizery.com Signed-off-by: Thomas Gleixner Cc: stable@kernel.org --- kernel/irq/manage.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 0e2b179bc7b3..1da999f5e746 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -623,8 +623,9 @@ static irqreturn_t irq_nested_primary_handler(int irq, void *dev_id) static int irq_wait_for_interrupt(struct irqaction *action) { + set_current_state(TASK_INTERRUPTIBLE); + while (!kthread_should_stop()) { - set_current_state(TASK_INTERRUPTIBLE); if (test_and_clear_bit(IRQTF_RUNTHREAD, &action->thread_flags)) { @@ -632,7 +633,9 @@ static int irq_wait_for_interrupt(struct irqaction *action) return 0; } schedule(); + set_current_state(TASK_INTERRUPTIBLE); } + __set_current_state(TASK_RUNNING); return -1; } -- cgit v1.2.3 From c1be84309c58b1e7c6d626e28fba41a22b364c3d Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 2 Dec 2011 12:34:16 +0100 Subject: tick-broadcast: Stop active broadcast device when replacing it When a better rated broadcast device is installed, then the current active device is not disabled, which results in two running broadcast devices. Signed-off-by: Thomas Gleixner Cc: stable@vger.kernel.org --- kernel/time/tick-broadcast.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index f954282d9a82..fd4a7b1625a2 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -71,7 +71,7 @@ int tick_check_broadcast_device(struct clock_event_device *dev) (dev->features & CLOCK_EVT_FEAT_C3STOP)) return 0; - clockevents_exchange_device(NULL, dev); + clockevents_exchange_device(tick_broadcast_device.evtdev, dev); tick_broadcast_device.evtdev = dev; if (!cpumask_empty(tick_get_broadcast_mask())) tick_broadcast_start_periodic(dev); -- cgit v1.2.3 From de28f25e8244c7353abed8de0c7792f5f883588c Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 2 Dec 2011 16:02:45 +0100 Subject: clockevents: Set noop handler in clockevents_exchange_device() If a device is shutdown, then there might be a pending interrupt, which will be processed after we reenable interrupts, which causes the original handler to be run. If the old handler is the (broadcast) periodic handler the shutdown state might hang the kernel completely. Signed-off-by: Thomas Gleixner Cc: stable@vger.kernel.org --- kernel/time/clockevents.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 1ecd6ba36d6c..c4eb71c8b2ea 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -387,6 +387,7 @@ void clockevents_exchange_device(struct clock_event_device *old, * released list and do a notify add later. */ if (old) { + old->event_handler = clockevents_handle_noop; clockevents_set_mode(old, CLOCK_EVT_MODE_UNUSED); list_del(&old->list); list_add(&old->list, &clockevents_released); -- cgit v1.2.3 From 10c6db110d0eb4466b59812c49088ab56218fc2e Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Sat, 26 Nov 2011 02:47:31 +0100 Subject: perf: Fix loss of notification with multi-event When you do: $ perf record -e cycles,cycles,cycles noploop 10 You expect about 10,000 samples for each event, i.e., 10s at 1000samples/sec. However, this is not what's happening. You get much fewer samples, maybe 3700 samples/event: $ perf report -D | tail -15 Aggregated stats: TOTAL events: 10998 MMAP events: 66 COMM events: 2 SAMPLE events: 10930 cycles stats: TOTAL events: 3644 SAMPLE events: 3644 cycles stats: TOTAL events: 3642 SAMPLE events: 3642 cycles stats: TOTAL events: 3644 SAMPLE events: 3644 On a Intel Nehalem or even AMD64, there are 4 counters capable of measuring cycles, so there is plenty of space to measure those events without multiplexing (even with the NMI watchdog active). And even with multiplexing, we'd expect roughly the same number of samples per event. The root of the problem was that when the event that caused the buffer to become full was not the first event passed on the cmdline, the user notification would get lost. The notification was sent to the file descriptor of the overflowed event but the perf tool was not polling on it. The perf tool aggregates all samples into a single buffer, i.e., the buffer of the first event. Consequently, it assumes notifications for any event will come via that descriptor. The seemingly straight forward solution of moving the waitq into the ringbuffer object doesn't work because of life-time issues. One could perf_event_set_output() on a fd that you're also blocking on and cause the old rb object to be freed while its waitq would still be referenced by the blocked thread -> FAIL. Therefore link all events to the ringbuffer and broadcast the wakeup from the ringbuffer object to all possible events that could be waited upon. This is rather ugly, and we're open to better solutions but it works for now. Reported-by: Stephane Eranian Finished-by: Stephane Eranian Reviewed-by: Stephane Eranian Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/20111126014731.GA7030@quad Signed-off-by: Ingo Molnar --- include/linux/perf_event.h | 1 + kernel/events/core.c | 86 +++++++++++++++++++++++++++++++++++++++++++-- kernel/events/internal.h | 3 ++ kernel/events/ring_buffer.c | 3 ++ 4 files changed, 91 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 1e9ebe5e0091..b1f89122bf6a 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -822,6 +822,7 @@ struct perf_event { int mmap_locked; struct user_struct *mmap_user; struct ring_buffer *rb; + struct list_head rb_entry; /* poll related */ wait_queue_head_t waitq; diff --git a/kernel/events/core.c b/kernel/events/core.c index b0c1186fd97b..600c1629b64d 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -185,6 +185,9 @@ static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx, static void update_context_time(struct perf_event_context *ctx); static u64 perf_event_time(struct perf_event *event); +static void ring_buffer_attach(struct perf_event *event, + struct ring_buffer *rb); + void __weak perf_event_print_debug(void) { } extern __weak const char *perf_pmu_name(void) @@ -3191,12 +3194,33 @@ static unsigned int perf_poll(struct file *file, poll_table *wait) struct ring_buffer *rb; unsigned int events = POLL_HUP; + /* + * Race between perf_event_set_output() and perf_poll(): perf_poll() + * grabs the rb reference but perf_event_set_output() overrides it. + * Here is the timeline for two threads T1, T2: + * t0: T1, rb = rcu_dereference(event->rb) + * t1: T2, old_rb = event->rb + * t2: T2, event->rb = new rb + * t3: T2, ring_buffer_detach(old_rb) + * t4: T1, ring_buffer_attach(rb1) + * t5: T1, poll_wait(event->waitq) + * + * To avoid this problem, we grab mmap_mutex in perf_poll() + * thereby ensuring that the assignment of the new ring buffer + * and the detachment of the old buffer appear atomic to perf_poll() + */ + mutex_lock(&event->mmap_mutex); + rcu_read_lock(); rb = rcu_dereference(event->rb); - if (rb) + if (rb) { + ring_buffer_attach(event, rb); events = atomic_xchg(&rb->poll, 0); + } rcu_read_unlock(); + mutex_unlock(&event->mmap_mutex); + poll_wait(file, &event->waitq, wait); return events; @@ -3497,6 +3521,49 @@ unlock: return ret; } +static void ring_buffer_attach(struct perf_event *event, + struct ring_buffer *rb) +{ + unsigned long flags; + + if (!list_empty(&event->rb_entry)) + return; + + spin_lock_irqsave(&rb->event_lock, flags); + if (!list_empty(&event->rb_entry)) + goto unlock; + + list_add(&event->rb_entry, &rb->event_list); +unlock: + spin_unlock_irqrestore(&rb->event_lock, flags); +} + +static void ring_buffer_detach(struct perf_event *event, + struct ring_buffer *rb) +{ + unsigned long flags; + + if (list_empty(&event->rb_entry)) + return; + + spin_lock_irqsave(&rb->event_lock, flags); + list_del_init(&event->rb_entry); + wake_up_all(&event->waitq); + spin_unlock_irqrestore(&rb->event_lock, flags); +} + +static void ring_buffer_wakeup(struct perf_event *event) +{ + struct ring_buffer *rb; + + rcu_read_lock(); + rb = rcu_dereference(event->rb); + list_for_each_entry_rcu(event, &rb->event_list, rb_entry) { + wake_up_all(&event->waitq); + } + rcu_read_unlock(); +} + static void rb_free_rcu(struct rcu_head *rcu_head) { struct ring_buffer *rb; @@ -3522,9 +3589,19 @@ static struct ring_buffer *ring_buffer_get(struct perf_event *event) static void ring_buffer_put(struct ring_buffer *rb) { + struct perf_event *event, *n; + unsigned long flags; + if (!atomic_dec_and_test(&rb->refcount)) return; + spin_lock_irqsave(&rb->event_lock, flags); + list_for_each_entry_safe(event, n, &rb->event_list, rb_entry) { + list_del_init(&event->rb_entry); + wake_up_all(&event->waitq); + } + spin_unlock_irqrestore(&rb->event_lock, flags); + call_rcu(&rb->rcu_head, rb_free_rcu); } @@ -3547,6 +3624,7 @@ static void perf_mmap_close(struct vm_area_struct *vma) atomic_long_sub((size >> PAGE_SHIFT) + 1, &user->locked_vm); vma->vm_mm->pinned_vm -= event->mmap_locked; rcu_assign_pointer(event->rb, NULL); + ring_buffer_detach(event, rb); mutex_unlock(&event->mmap_mutex); ring_buffer_put(rb); @@ -3701,7 +3779,7 @@ static const struct file_operations perf_fops = { void perf_event_wakeup(struct perf_event *event) { - wake_up_all(&event->waitq); + ring_buffer_wakeup(event); if (event->pending_kill) { kill_fasync(&event->fasync, SIGIO, event->pending_kill); @@ -5823,6 +5901,8 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, INIT_LIST_HEAD(&event->group_entry); INIT_LIST_HEAD(&event->event_entry); INIT_LIST_HEAD(&event->sibling_list); + INIT_LIST_HEAD(&event->rb_entry); + init_waitqueue_head(&event->waitq); init_irq_work(&event->pending, perf_pending_event); @@ -6029,6 +6109,8 @@ set: old_rb = event->rb; rcu_assign_pointer(event->rb, rb); + if (old_rb) + ring_buffer_detach(event, old_rb); ret = 0; unlock: mutex_unlock(&event->mmap_mutex); diff --git a/kernel/events/internal.h b/kernel/events/internal.h index 09097dd8116c..64568a699375 100644 --- a/kernel/events/internal.h +++ b/kernel/events/internal.h @@ -22,6 +22,9 @@ struct ring_buffer { local_t lost; /* nr records lost */ long watermark; /* wakeup watermark */ + /* poll crap */ + spinlock_t event_lock; + struct list_head event_list; struct perf_event_mmap_page *user_page; void *data_pages[0]; diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index a2a29205cc0f..7f3011c6b57f 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -209,6 +209,9 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags) rb->writable = 1; atomic_set(&rb->refcount, 1); + + INIT_LIST_HEAD(&rb->event_list); + spin_lock_init(&rb->event_lock); } #ifndef CONFIG_PERF_USE_VMALLOC -- cgit v1.2.3 From cb59974742aea24adf6637eb0c4b8e7b48bca6fb Mon Sep 17 00:00:00 2001 From: Ilya Dryomov Date: Mon, 31 Oct 2011 11:07:42 +0200 Subject: tracing: fix event_subsystem ref counting Fix a bug introduced by e9dbfae5, which prevents event_subsystem from ever being released. Ref_count was added to keep track of subsystem users, not for counting events. Subsystem is created with ref_count = 1, so there is no need to increment it for every event, we have nr_events for that. Fix this by touching ref_count only when we actually have a new user - subsystem_open(). Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov Link: http://lkml.kernel.org/r/1320052062-7846-1-git-send-email-idryomov@gmail.com Signed-off-by: Steven Rostedt --- kernel/trace/trace_events.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index 581876f9f387..c212a7f934ec 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -1078,7 +1078,6 @@ event_subsystem_dir(const char *name, struct dentry *d_events) /* First see if we did not already create this dir */ list_for_each_entry(system, &event_subsystems, list) { if (strcmp(system->name, name) == 0) { - __get_system(system); system->nr_events++; return system->entry; } -- cgit v1.2.3 From 27b14b56af081ec7edeefb3a38b2c9577cc5ef48 Mon Sep 17 00:00:00 2001 From: Li Zefan Date: Tue, 1 Nov 2011 09:09:35 +0800 Subject: tracing: Restore system filter behavior Though not all events have field 'prev_pid', it was allowed to do this: # echo 'prev_pid == 100' > events/sched/filter but commit 75b8e98263fdb0bfbdeba60d4db463259f1fe8a2 (tracing/filter: Swap entire filter of events) broke it without any reason. Link: http://lkml.kernel.org/r/4EAF46CF.8040408@cn.fujitsu.com Signed-off-by: Li Zefan Signed-off-by: Steven Rostedt --- include/linux/ftrace_event.h | 2 ++ kernel/trace/trace_events_filter.c | 7 ++++++- 2 files changed, 8 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h index 96efa6794ea5..c3da42dd22ba 100644 --- a/include/linux/ftrace_event.h +++ b/include/linux/ftrace_event.h @@ -172,6 +172,7 @@ enum { TRACE_EVENT_FL_FILTERED_BIT, TRACE_EVENT_FL_RECORDED_CMD_BIT, TRACE_EVENT_FL_CAP_ANY_BIT, + TRACE_EVENT_FL_NO_SET_FILTER_BIT, }; enum { @@ -179,6 +180,7 @@ enum { TRACE_EVENT_FL_FILTERED = (1 << TRACE_EVENT_FL_FILTERED_BIT), TRACE_EVENT_FL_RECORDED_CMD = (1 << TRACE_EVENT_FL_RECORDED_CMD_BIT), TRACE_EVENT_FL_CAP_ANY = (1 << TRACE_EVENT_FL_CAP_ANY_BIT), + TRACE_EVENT_FL_NO_SET_FILTER = (1 << TRACE_EVENT_FL_NO_SET_FILTER_BIT), }; struct ftrace_event_call { diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index d6e7926dcd26..95dc31efd6dd 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -1649,7 +1649,9 @@ static int replace_system_preds(struct event_subsystem *system, */ err = replace_preds(call, NULL, ps, filter_string, true); if (err) - goto fail; + call->flags |= TRACE_EVENT_FL_NO_SET_FILTER; + else + call->flags &= ~TRACE_EVENT_FL_NO_SET_FILTER; } list_for_each_entry(call, &ftrace_events, list) { @@ -1658,6 +1660,9 @@ static int replace_system_preds(struct event_subsystem *system, if (strcmp(call->class->system, system->name) != 0) continue; + if (call->flags & TRACE_EVENT_FL_NO_SET_FILTER) + continue; + filter_item = kzalloc(sizeof(*filter_item), GFP_KERNEL); if (!filter_item) goto fail_mem; -- cgit v1.2.3 From c7c6ec8becaf742b223c7b491f4893014be23a07 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Fri, 4 Nov 2011 10:45:23 -0400 Subject: ftrace: Remove force undef config value left for testing A forced undef of a config value was used for testing and was accidently left in during the final commit. This causes x86 to run slower than needed while running function tracing as well as causes the function graph selftest to fail when DYNMAIC_FTRACE is not set. This is because the code in MCOUNT expects the ftrace code to be processed with the config value set that happened to be forced not set. The forced config option was left in by: commit 6331c28c962561aee59e5a493b7556a4bb585957 ftrace: Fix dynamic selftest failure on some archs Link: http://lkml.kernel.org/r/20111102150255.GA6973@debian Cc: stable@vger.kernel.org Reported-by: Rabin Vincent Signed-off-by: Steven Rostedt --- kernel/trace/ftrace.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 900b409543db..1276b3ca1598 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -152,7 +152,6 @@ void clear_ftrace_function(void) ftrace_pid_function = ftrace_stub; } -#undef CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST #ifndef CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST /* * For those archs that do not test ftrace_trace_stop in their -- cgit v1.2.3 From bbbf7af4bf8fc69bc751818cf30521080fa47dcb Mon Sep 17 00:00:00 2001 From: Gleb Natapov Date: Tue, 18 Oct 2011 19:55:51 +0200 Subject: jump_label: jump_label_inc may return before the code is patched If cpu A calls jump_label_inc() just after atomic_add_return() is called by cpu B, atomic_inc_not_zero() will return value greater then zero and jump_label_inc() will return to a caller before jump_label_update() finishes its job on cpu B. Link: http://lkml.kernel.org/r/20111018175551.GH17571@redhat.com Cc: stable@vger.kernel.org Cc: Peter Zijlstra Acked-by: Jason Baron Signed-off-by: Gleb Natapov Signed-off-by: Steven Rostedt --- kernel/jump_label.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/jump_label.c b/kernel/jump_label.c index bbdfe2a462a0..66ff7109f697 100644 --- a/kernel/jump_label.c +++ b/kernel/jump_label.c @@ -66,8 +66,9 @@ void jump_label_inc(struct jump_label_key *key) return; jump_label_lock(); - if (atomic_add_return(1, &key->enabled) == 1) + if (atomic_read(&key->enabled) == 0) jump_label_update(key, JUMP_LABEL_ENABLE); + atomic_inc(&key->enabled); jump_label_unlock(); } -- cgit v1.2.3 From ddf6e0e50723b62ac76ed18eb53e9417c6eefba7 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Fri, 4 Nov 2011 20:32:39 -0400 Subject: ftrace: Fix hash record accounting bug If the set_ftrace_filter is cleared by writing just whitespace to it, then the filter hash refcounts will be decremented but not updated. This causes two bugs: 1) No functions will be enabled for tracing when they all should be 2) If the users clears the set_ftrace_filter twice, it will crash ftrace: ------------[ cut here ]------------ WARNING: at /home/rostedt/work/git/linux-trace.git/kernel/trace/ftrace.c:1384 __ftrace_hash_rec_update.part.27+0x157/0x1a7() Modules linked in: Pid: 2330, comm: bash Not tainted 3.1.0-test+ #32 Call Trace: [] warn_slowpath_common+0x83/0x9b [] warn_slowpath_null+0x1a/0x1c [] __ftrace_hash_rec_update.part.27+0x157/0x1a7 [] ? ftrace_regex_release+0xa7/0x10f [] ? kfree+0xe5/0x115 [] ftrace_hash_move+0x2e/0x151 [] ftrace_regex_release+0xba/0x10f [] fput+0xfd/0x1c2 [] filp_close+0x6d/0x78 [] sys_dup3+0x197/0x1c1 [] sys_dup2+0x4f/0x54 [] system_call_fastpath+0x16/0x1b ---[ end trace 77a3a7ee73794a02 ]--- Link: http://lkml.kernel.org/r/20111101141420.GA4918@debian Reported-by: Rabin Vincent Signed-off-by: Steven Rostedt --- kernel/trace/ftrace.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 1276b3ca1598..b1e8943fed1d 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1211,7 +1211,9 @@ ftrace_hash_move(struct ftrace_ops *ops, int enable, if (!src->count) { free_ftrace_hash_rcu(*dst); rcu_assign_pointer(*dst, EMPTY_HASH); - return 0; + /* still need to update the function records */ + ret = 0; + goto out; } /* -- cgit v1.2.3 From c9c024b3f3e07d087974db4c0dc46217fff3a6c0 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Mon, 5 Dec 2011 21:20:23 +0100 Subject: alarmtimers: Fix time comparison The expiry function compares the timer against current time and does not expire the timer when the expiry time is >= now. That's wrong. If the timer is set for now, then it must expire. Make the condition expiry > now for breaking out the loop. Signed-off-by: Thomas Gleixner Acked-by: John Stultz Cc: stable@kernel.org --- kernel/time/alarmtimer.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c index c436e790b21b..8a46f5d64504 100644 --- a/kernel/time/alarmtimer.c +++ b/kernel/time/alarmtimer.c @@ -195,7 +195,7 @@ static enum hrtimer_restart alarmtimer_fired(struct hrtimer *timer) struct alarm *alarm; ktime_t expired = next->expires; - if (expired.tv64 >= now.tv64) + if (expired.tv64 > now.tv64) break; alarm = container_of(next, struct alarm, node); -- cgit v1.2.3 From a33caeb118198286309859f014c0662f3ed54ed4 Mon Sep 17 00:00:00 2001 From: Yong Zhang Date: Wed, 9 Nov 2011 16:04:51 +0800 Subject: lockdep, kmemcheck: Annotate ->lock in lockdep_init_map() Since commit f59de89 ("lockdep: Clear whole lockdep_map on initialization"), lockdep_init_map() will clear all the struct. But it will break lock_set_class()/lock_set_subclass(). A typical race condition is like below: CPU A CPU B lock_set_subclass(lockA); lock_set_class(lockA); lockdep_init_map(lockA); /* lockA->name is cleared */ memset(lockA); __lock_acquire(lockA); /* lockA->class_cache[] is cleared */ register_lock_class(lockA); look_up_lock_class(lockA); WARN_ON_ONCE(class->name != lock->name); lock->name = name; So restore to what we have done before commit f59de89 but annotate ->lock with kmemcheck_mark_initialized() to suppress the kmemcheck warning reported in commit f59de89. Reported-by: Sergey Senozhatsky Reported-by: Borislav Petkov Suggested-by: Vegard Nossum Signed-off-by: Yong Zhang Cc: Tejun Heo Cc: David Rientjes Cc: Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/20111109080451.GB8124@zhy Signed-off-by: Ingo Molnar --- kernel/lockdep.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/lockdep.c b/kernel/lockdep.c index e69434b070da..b2e08c932d91 100644 --- a/kernel/lockdep.c +++ b/kernel/lockdep.c @@ -44,6 +44,7 @@ #include #include #include +#include #include @@ -2948,7 +2949,12 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this, void lockdep_init_map(struct lockdep_map *lock, const char *name, struct lock_class_key *key, int subclass) { - memset(lock, 0, sizeof(*lock)); + int i; + + kmemcheck_mark_initialized(lock, sizeof(*lock)); + + for (i = 0; i < NR_LOCKDEP_CACHING_CLASSES; i++) + lock->class_cache[i] = NULL; #ifdef CONFIG_LOCK_STAT lock->cpu = raw_smp_processor_id(); -- cgit v1.2.3 From 86b47c25494b824da655b95f6fdb4fdb3f17aa77 Mon Sep 17 00:00:00 2001 From: Gleb Natapov Date: Tue, 22 Nov 2011 16:08:21 +0200 Subject: perf: Do no try to schedule task events if there are none perf_event_sched_in() shouldn't try to schedule task events if there are none otherwise task's ctx->is_active will be set and will not be cleared during sched_out. This will prevent newly added events from being scheduled into the task context. Fixes a boo-boo in commit 1d5f003f5a9 ("perf: Do not set task_ctx pointer in cpuctx if there are no events in the context"). Signed-off-by: Gleb Natapov Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/20111122140821.GF2557@redhat.com Signed-off-by: Ingo Molnar --- kernel/events/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 600c1629b64d..d3b9df5962c2 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -2174,11 +2174,11 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx, */ cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE); - perf_event_sched_in(cpuctx, ctx, task); - if (ctx->nr_events) cpuctx->task_ctx = ctx; + perf_event_sched_in(cpuctx, cpuctx->task_ctx, task); + perf_pmu_enable(ctx->pmu); perf_ctx_unlock(cpuctx, ctx); -- cgit v1.2.3 From 09dc3cf93f7d16fdd37a0ad8486faebb5e2769ec Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 8 Dec 2011 14:34:13 -0800 Subject: printk: avoid double lock acquire Commit 4f2a8d3cf5e ("printk: Fix console_sem vs logbuf_lock unlock race") introduced another silly bug where we would want to acquire an already held lock. Avoid this. Reported-by: Andrea Arcangeli Signed-off-by: Peter Zijlstra Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/printk.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/printk.c b/kernel/printk.c index 1455a0d4eedd..7982a0a841ea 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -1293,10 +1293,11 @@ again: raw_spin_lock(&logbuf_lock); if (con_start != log_end) retry = 1; + raw_spin_unlock_irqrestore(&logbuf_lock, flags); + if (retry && console_trylock()) goto again; - raw_spin_unlock_irqrestore(&logbuf_lock, flags); if (wake_klogd) wake_up_klogd(); } -- cgit v1.2.3 From 031af165b1cd295ef04d2bfbcae9bc3cb9180735 Mon Sep 17 00:00:00 2001 From: Mandeep Singh Baines Date: Thu, 8 Dec 2011 14:34:44 -0800 Subject: sys_getppid: add missing rcu_dereference In order to safely dereference current->real_parent inside an rcu_read_lock, we need an rcu_dereference. Signed-off-by: Mandeep Singh Baines Cc: Thomas Gleixner Cc: Pavel Emelyanov Cc: Oleg Nesterov Cc: Kees Cook Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/timer.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/timer.c b/kernel/timer.c index dbaa62422b13..9c3c62b0c4bc 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -1368,7 +1368,7 @@ SYSCALL_DEFINE0(getppid) int pid; rcu_read_lock(); - pid = task_tgid_vnr(current->real_parent); + pid = task_tgid_vnr(rcu_dereference(current->real_parent)); rcu_read_unlock(); return pid; -- cgit v1.2.3 From 44b7f4b98d8877e2a4427f2a2f2e42ae8227a58f Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Tue, 13 Dec 2011 20:40:45 +0100 Subject: perf events: Fix ring_buffer_wakeup() brown paperbag bug Commit 10c6db11 ("perf: Fix loss of notification with multi-event") seems to unconditionally dereference event->rb in the wakeup handler, this is wrong, there might not be a buffer attached. Signed-off-by: Will Deacon Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/20111213152651.GP20297@mudshark.cambridge.arm.com [ minor edits ] Signed-off-by: Ingo Molnar --- kernel/events/core.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index d3b9df5962c2..58690af323e4 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3558,9 +3558,13 @@ static void ring_buffer_wakeup(struct perf_event *event) rcu_read_lock(); rb = rcu_dereference(event->rb); - list_for_each_entry_rcu(event, &rb->event_list, rb_entry) { + if (!rb) + goto unlock; + + list_for_each_entry_rcu(event, &rb->event_list, rb_entry) wake_up_all(&event->waitq); - } + +unlock: rcu_read_unlock(); } -- cgit v1.2.3 From ab2789213d224202237292d78aaa0c386c7b28b2 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 15 Dec 2011 11:47:00 -0800 Subject: sched: Fix select_idle_sibling() regression in selecting an idle SMT sibling Mike Galbraith reported that this recent commit: commit 4dcfe1025b513c2c1da5bf5586adb0e80148f612 Author: Peter Zijlstra Date: Thu Nov 10 13:01:10 2011 +0100 sched: Avoid SMT siblings in select_idle_sibling() if possible stopped selecting an idle SMT sibling when there are no idle cores in a single socket system. Intent of the select_idle_sibling() was to fallback to an idle SMT sibling, if it fails to identify an idle core. But this fallback was not happening on systems where all the scheduler domains had `SD_SHARE_PKG_RESOURCES' flag set. Fix it. Slightly bigger patch of cleaning all these goto's etc is queued up for the next release. Reported-by: Mike Galbraith Reported-by: Alex Shi Signed-off-by: Peter Zijlstra Signed-off-by: Suresh Siddha Link: http://lkml.kernel.org/r/1323978421.1984.244.camel@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar --- kernel/sched_fair.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index a78ed2736ba7..8a39fa3e3c6c 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -2352,13 +2352,11 @@ again: if (!smt && (sd->flags & SD_SHARE_CPUPOWER)) continue; - if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) { - if (!smt) { - smt = 1; - goto again; - } + if (smt && !(sd->flags & SD_SHARE_CPUPOWER)) + break; + + if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) break; - } sg = sd->groups; do { @@ -2378,6 +2376,10 @@ next: sg = sg->next; } while (sg != sd->groups); } + if (!smt) { + smt = 1; + goto again; + } done: rcu_read_unlock(); -- cgit v1.2.3 From b1b73d095084e754562961c443aa8f6587a55f8e Mon Sep 17 00:00:00 2001 From: Kusanagi Kouichi Date: Mon, 19 Dec 2011 18:13:19 +0900 Subject: time/clocksource: Fix kernel-doc warnings Fix various KernelDoc build warnings. Signed-off-by: Kusanagi Kouichi Cc: John Stultz Link: http://lkml.kernel.org/r/20111219091320.0D5AF6FC03D@msa105.auone-net.jp Signed-off-by: Ingo Molnar --- include/linux/clocksource.h | 11 ++++++++--- kernel/time/clocksource.c | 12 +++++++++--- 2 files changed, 17 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h index c86c940d1de3..081147da0564 100644 --- a/include/linux/clocksource.h +++ b/include/linux/clocksource.h @@ -71,7 +71,7 @@ struct timecounter { /** * cyclecounter_cyc2ns - converts cycle counter cycles to nanoseconds - * @tc: Pointer to cycle counter. + * @cc: Pointer to cycle counter. * @cycles: Cycles * * XXX - This could use some mult_lxl_ll() asm optimization. Same code @@ -114,7 +114,7 @@ extern u64 timecounter_read(struct timecounter *tc); * time base as values returned by * timecounter_read() * @tc: Pointer to time counter. - * @cycle: a value returned by tc->cc->read() + * @cycle_tstamp: a value returned by tc->cc->read() * * Cycle counts that are converted correctly as long as they * fall into the interval [-1/2 max cycle count, +1/2 max cycle count], @@ -156,11 +156,12 @@ extern u64 timecounter_cyc2time(struct timecounter *tc, * @mult: cycle to nanosecond multiplier * @shift: cycle to nanosecond divisor (power of two) * @max_idle_ns: max idle time permitted by the clocksource (nsecs) - * @maxadj maximum adjustment value to mult (~11%) + * @maxadj: maximum adjustment value to mult (~11%) * @flags: flags describing special properties * @archdata: arch-specific data * @suspend: suspend function for the clocksource, if necessary * @resume: resume function for the clocksource, if necessary + * @cycle_last: most recent cycle counter value seen by ::read() */ struct clocksource { /* @@ -187,6 +188,7 @@ struct clocksource { void (*suspend)(struct clocksource *cs); void (*resume)(struct clocksource *cs); + /* private: */ #ifdef CONFIG_CLOCKSOURCE_WATCHDOG /* Watchdog related data, used by the framework */ struct list_head wd_list; @@ -261,6 +263,9 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 shift_constant) /** * clocksource_cyc2ns - converts clocksource cycles to nanoseconds + * @cycles: cycles + * @mult: cycle to nanosecond multiplier + * @shift: cycle to nanosecond divisor (power of two) * * Converts cycles to nanoseconds, using the given mult and shift. * diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index da2f760e780c..d3ad022136e5 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -647,7 +647,7 @@ static void clocksource_enqueue(struct clocksource *cs) /** * __clocksource_updatefreq_scale - Used update clocksource with new freq - * @t: clocksource to be registered + * @cs: clocksource to be registered * @scale: Scale factor multiplied against freq to get clocksource hz * @freq: clocksource frequency (cycles per second) divided by scale * @@ -699,7 +699,7 @@ EXPORT_SYMBOL_GPL(__clocksource_updatefreq_scale); /** * __clocksource_register_scale - Used to install new clocksources - * @t: clocksource to be registered + * @cs: clocksource to be registered * @scale: Scale factor multiplied against freq to get clocksource hz * @freq: clocksource frequency (cycles per second) divided by scale * @@ -727,7 +727,7 @@ EXPORT_SYMBOL_GPL(__clocksource_register_scale); /** * clocksource_register - Used to install new clocksources - * @t: clocksource to be registered + * @cs: clocksource to be registered * * Returns -EBUSY if registration fails, zero otherwise. */ @@ -761,6 +761,8 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating) /** * clocksource_change_rating - Change the rating of a registered clocksource + * @cs: clocksource to be changed + * @rating: new rating */ void clocksource_change_rating(struct clocksource *cs, int rating) { @@ -772,6 +774,7 @@ EXPORT_SYMBOL(clocksource_change_rating); /** * clocksource_unregister - remove a registered clocksource + * @cs: clocksource to be unregistered */ void clocksource_unregister(struct clocksource *cs) { @@ -787,6 +790,7 @@ EXPORT_SYMBOL(clocksource_unregister); /** * sysfs_show_current_clocksources - sysfs interface for current clocksource * @dev: unused + * @attr: unused * @buf: char buffer to be filled with clocksource list * * Provides sysfs interface for listing current clocksource. @@ -807,6 +811,7 @@ sysfs_show_current_clocksources(struct sys_device *dev, /** * sysfs_override_clocksource - interface for manually overriding clocksource * @dev: unused + * @attr: unused * @buf: name of override clocksource * @count: length of buffer * @@ -842,6 +847,7 @@ static ssize_t sysfs_override_clocksource(struct sys_device *dev, /** * sysfs_show_available_clocksources - sysfs interface for listing clocksource * @dev: unused + * @attr: unused * @buf: char buffer to be filled with clocksource list * * Provides sysfs interface for listing registered clocksources -- cgit v1.2.3 From e0197aae59e55c06db172bfbe1a1cdb8c0e1cab3 Mon Sep 17 00:00:00 2001 From: Mandeep Singh Baines Date: Thu, 15 Dec 2011 11:36:43 -0800 Subject: cgroups: fix a css_set not found bug in cgroup_attach_proc There is a BUG when migrating a PF_EXITING proc. Since css_set_prefetch() is not called for the PF_EXITING case, find_existing_css_set() will return NULL inside cgroup_task_migrate() causing a BUG. This bug is easy to reproduce. Create a zombie and echo its pid to cgroup.procs. $ cat zombie.c \#include int main() { if (fork()) pause(); return 0; } $ We are hitting this bug pretty regularly on ChromeOS. This bug is already fixed by Tejun Heo's cgroup patchset which is targetted for the next merge window: https://lkml.org/lkml/2011/11/1/356 I've create a smaller patch here which just fixes this bug so that a fix can be merged into the current release and stable. Signed-off-by: Mandeep Singh Baines Downstream-Bug-Report: http://crosbug.com/23953 Reviewed-by: Li Zefan Signed-off-by: Tejun Heo Cc: containers@lists.linux-foundation.org Cc: cgroups@vger.kernel.org Cc: stable@kernel.org Cc: KAMEZAWA Hiroyuki Cc: Frederic Weisbecker Cc: Oleg Nesterov Cc: Andrew Morton Cc: Paul Menage Cc: Olof Johansson --- kernel/cgroup.c | 5 ----- 1 file changed, 5 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index d9d5648f3cdc..a184470cf9b5 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2098,11 +2098,6 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader) continue; /* get old css_set pointer */ task_lock(tsk); - if (tsk->flags & PF_EXITING) { - /* ignore this task if it's going away */ - task_unlock(tsk); - continue; - } oldcg = tsk->cgroups; get_css_set(oldcg); task_unlock(tsk); -- cgit v1.2.3 From b246272ecc5ac68c743b15c9e41a2275f7ce70e2 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Mon, 19 Dec 2011 17:11:52 -0800 Subject: cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a new set of allowed cpuset nodes where the two nodemasks, as a result of the remap, are now disjoint. c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") adds get_mems_allowed() to prevent the set of allowed nodes from changing for a thread. This causes any update to a set of allowed nodes to stall until put_mems_allowed() is called. This stall is unncessary, however, if at least one node remains unchanged in the update to the set of allowed nodes. This was addressed by 89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one node remains set"), but it's still possible that an empty nodemask may be read from a mempolicy because the old nodemask may be remapped to the new nodemask during rebind. To prevent this, only avoid the stall if there is no mempolicy for the thread being changed. This is a temporary solution until all reads from mempolicy nodemasks can be guaranteed to not be empty without the get_mems_allowed() synchronization. Also moves the check for nodemask intersection inside task_lock() so that tsk->mems_allowed cannot change. This ensures that nothing can set this tsk's mems_allowed out from under us and also protects tsk->mempolicy. Reported-by: Miao Xie Signed-off-by: David Rientjes Cc: KOSAKI Motohiro Cc: Paul Menage Cc: Stephen Rothwell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpuset.c | 29 ++++++++++++++++++++++++----- 1 file changed, 24 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 9fe58c46a426..0b1712dba587 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -123,6 +123,19 @@ static inline struct cpuset *task_cs(struct task_struct *task) struct cpuset, css); } +#ifdef CONFIG_NUMA +static inline bool task_has_mempolicy(struct task_struct *task) +{ + return task->mempolicy; +} +#else +static inline bool task_has_mempolicy(struct task_struct *task) +{ + return false; +} +#endif + + /* bits in struct cpuset flags field */ typedef enum { CS_CPU_EXCLUSIVE, @@ -949,7 +962,7 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from, static void cpuset_change_task_nodemask(struct task_struct *tsk, nodemask_t *newmems) { - bool masks_disjoint = !nodes_intersects(*newmems, tsk->mems_allowed); + bool need_loop; repeat: /* @@ -962,6 +975,14 @@ repeat: return; task_lock(tsk); + /* + * Determine if a loop is necessary if another thread is doing + * get_mems_allowed(). If at least one node remains unchanged and + * tsk does not have a mempolicy, then an empty nodemask will not be + * possible when mems_allowed is larger than a word. + */ + need_loop = task_has_mempolicy(tsk) || + !nodes_intersects(*newmems, tsk->mems_allowed); nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems); mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1); @@ -981,11 +1002,9 @@ repeat: /* * Allocation of memory is very fast, we needn't sleep when waiting - * for the read-side. No wait is necessary, however, if at least one - * node remains unchanged. + * for the read-side. */ - while (masks_disjoint && - ACCESS_ONCE(tsk->mems_allowed_change_disable)) { + while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) { task_unlock(tsk); if (!task_curr(tsk)) yield(); -- cgit v1.2.3 From 3d3c8f93a237b64580c5c5e138edeb1377e98230 Mon Sep 17 00:00:00 2001 From: Michel Lespinasse Date: Mon, 19 Dec 2011 17:12:06 -0800 Subject: binary_sysctl(): fix memory leak binary_sysctl() calls sysctl_getname() which allocates from names_cache slab usin __getname() The matching function to free the name is __putname(), and not putname() which should be used only to match getname() allocations. This is because when auditing is enabled, putname() calls audit_putname *instead* (not in addition) to __putname(). Then, if a syscall is in progress, audit_putname does not release the name - instead, it expects the name to get released when the syscall completes, but that will happen only if audit_getname() was called previously, i.e. if the name was allocated with getname() rather than the naked __getname(). So, __getname() followed by putname() ends up leaking memory. Signed-off-by: Michel Lespinasse Acked-by: Al Viro Cc: Christoph Hellwig Cc: Eric Paris Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sysctl_binary.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c index 6318b511afa1..a650694883a1 100644 --- a/kernel/sysctl_binary.c +++ b/kernel/sysctl_binary.c @@ -1354,7 +1354,7 @@ static ssize_t binary_sysctl(const int *name, int nlen, fput(file); out_putname: - putname(pathname); + __putname(pathname); out: return result; } -- cgit v1.2.3