Age | Commit message (Collapse) | Author |
|
synchronize_rcu can be very expensive, averaging 100 ms in
some cases. In cgroup_attach_task, it is used to prevent
a task->cgroups pointer dereferenced in an RCU read side
critical section from being invalidated, by delaying the
call to put_css_set until after an RCU grace period.
To avoid the call to synchronize_rcu, make the put_css_set
call rcu-safe by moving the deletion of the css_set links
into free_css_set_work, scheduled by the rcu callback
free_css_set_rcu.
The decrement of the cgroup refcount is no longer
synchronous with the call to put_css_set, which can result
in the cgroup refcount staying positive after the last call
to cgroup_attach_task returns. To allow the cgroup to be
deleted with cgroup_rmdir synchronously after
cgroup_attach_task, have rmdir check the refcount of all
associated css_sets. If cgroup_rmdir is called on a cgroup
for which the css_sets all have refcount zero but the
cgroup refcount is nonzero, reuse the rmdir waitqueue to
block the rmdir until free_css_set_work is called.
Change-Id: I41fcb90395cc8401866f14d0beb00e9682835402
Signed-off-by: Colin Cross <ccross@android.com>
|
|
Changes the meaning of CGRP_RELEASABLE to be set on any cgroup
that has ever had a task or cgroup in it, or had css_get called
on it. The bit is set in cgroup_attach_task, cgroup_create,
and __css_get. It is not necessary to set the bit in
cgroup_fork, as the task is either in the root cgroup, in
which can never be released, or the task it was forked from
already set the bit in croup_attach_task.
Change-Id: I0d9579e205c437efa73266dd3b34eb4f2dfe4fd2
Signed-off-by: Colin Cross <ccross@android.com>
|
|
* android-2.6.35:
ARM: etm: Support multiple ETMs/PTMs.
ARM: etm: Return the entire trace buffer if it is empty after reset
ARM: etm: Add some missing locks and error checks
ARM: etm: Configure data tracing
ARM: etm: Allow range selection
ARM: etm: Don't try to clear the buffer full status after reading the buffer
ARM: etm: Don't limit tracing to only non-secure code.
ARM: etm: Don't require clock control
ARM: 6293/1: coresight: cosmetic fixes
ARM: 6294/1: etm: do a dummy read from OSSRR during initialization
ARM: 6291/1: coresight: move struct tracectx inside etm driver
ARM: 6292/1: coresight: add ETM management registers
staging: binder: Fix use of uninitialized variable.
net: Add UDP stats and pkt count to uid_stat
cgroup: leave cg_list valid upon cgroup_exit
rtc: alarm: Update hrtimer if alarm at the head of the queue is reprogrammed
net: wireless: bcm4329: Cumulative update to Version 4.218.248-18
Change-Id: I3505a792764d03e93877f0dd8b7f273b49eda959
|
|
One criterion of C-states selection is based on the load factor.
High load prevents deep C-states.The load is evaluated and
updated at each scheduler tick.
During the time when NOHZ mode is active the evaluation of the
load will not be accurate. So when a high load is evaluated on
a tick happening on a burst of activity, this load value is kept
until the next tick, which can take few ms to happen in NOHZ mode.
Therefore, this patch ensures that the load is taken into
consideration before going tickless, so that the load-based
calculations done in CPUidle are done using an accurate value.
Delaying NOHZ decisions until the load is zero, improved the load
estimation on our ARM/OMAP4 platform where HZ =128 and increased
the time spent in deep C-states (~50% of idle time in C-states
deeper than C1).A power saving of ~20mA at battery level is
observed during MP3 playback on OMAP4/Blaze board.
Change-Id: Ib94e2e94ade843fd0500f8484aac2f88722b6e8d
Signed-off-by: Nicole Chalhoub <n-chalhoub@ti.com>
Signed-off-by: Vincent Bour <v-bour@ti.com>
Acked-by: Akash Choudhari <akashc@ti.com>
|
|
A thread/process in cgroup_attach_task() could have called
list_del(&tsk->cg_list) after cgroup_exit() had already called
list_del() on the same list. Since it only checked for
!list_empty(&tsk->cg_list) before doing this, the list_del()
call would thus be made twice.
The solution is to leave tsk->cg_list in a valid state in
cgroup_exit() with list_del_init(&tsk->cg_list), which leaves
an empty list.
Signed-off-by: Simon Wilson <simonwilson@google.com>
|
|
* android-2.6.35: (456 commits)
ext4: initialize the percpu counters before replaying the journal
staging: android: lowmemorykiller: Ignore shmem pages in page-cache
staging: android: lowmemorykiller: Don't wait more than one second for a process to die
lowmemorykiller: don't unregister notifier from atomic context
net: wireless: bcm4329: Fix race conditions for sysioc_thread
net: wireless: bcm4329: Add check for out of bounds scan buffer
net: wireless: bcm4329: Check for out of bounds in scan results parsing
ext4: fix kernel oops if the journal superblock has a non-zero j_errno
staging: remove Greg's TODO, now obsolete.
yaffs: Import yaffs from Thu Oct 7 10:05:05 2010 +1300
pmem: remove the extra up_write on data sem in a rare path
mmc: Fix pm_notifier obeying deferred resume (part 2)
mmc: Fix pm_notifier obeying deferred resume
mmc: make pm_notifier obey deferred resume
mmc: Add "ignore mmc pm notify" functionality
Revert "net: Fix CONFIG_RPS option to be turned off"
Linux 2.6.35.7
Xen: fix typo in previous patch
Linux 2.6.35.6
alpha: Fix printk format errors
...
Conflicts:
drivers/usb/gadget/rndis.c
drivers/usb/host/ehci-sched.c
drivers/usb/serial/ftdi_sio.c
Signed-off-by: Leed Aguilar <leed.aguilar@ti.com>
|
|
Conflicts:
drivers/mmc/core/core.c
Change-Id: If12b25725eccb07b385f5898be75d052ff75a3f2
|
|
Change-Id: I9478e4c609493c8795b5a1fd9e08a2ab4e9b7975
Signed-off-by: Vikram Pandita <vikram.pandita@ti.com>
|
|
After pulling the thread off the run-queue during a cgroup change,
the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime
then gets normalized to this new value. This can then lead to the thread
getting an unfair boost in the new group if the vruntime of the next
task in the old run-queue was way further ahead.
Cc: Arve Hjønnevåg <arve@android.com>
Signed-off-by: Dima Zavin <dima@android.com>
|
|
If you switch the cgroup of a sleeping thread, its vruntime does
not get adjusted correctly for the difference between the
min_vruntime values of the two groups.
This patch adds a new callback, prep_move_task, to struct sched_class
to give sched_fair the opportunity to adjust the task's vruntime
just before setting its new group. This allows us to properly normalize
a sleeping task's vruntime when moving it between different cgroups.
More details about the problem:
http://lkml.org/lkml/2010/9/28/24
Cc: Arve Hjønnevåg <arve@android.com>
Signed-off-by: Dima Zavin <dima@android.com>
|
|
commit f362b73244fb16ea4ae127ced1467dd8adaa7733 upstream.
Using a program like the following:
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
int main() {
id_t id;
siginfo_t infop;
pid_t res;
id = fork();
if (id == 0) { sleep(1); exit(0); }
kill(id, SIGSTOP);
alarm(1);
waitid(P_PID, id, &infop, WCONTINUED);
return 0;
}
to call waitid() on a stopped process results in access to the child task's
credentials without the RCU read lock being held - which may be replaced in the
meantime - eliciting the following warning:
===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
kernel/exit.c:1460 invoked rcu_dereference_check() without protection!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 1
2 locks held by waitid02/22252:
#0: (tasklist_lock){.?.?..}, at: [<ffffffff81061ce5>] do_wait+0xc5/0x310
#1: (&(&sighand->siglock)->rlock){-.-...}, at: [<ffffffff810611da>]
wait_consider_task+0x19a/0xbe0
stack backtrace:
Pid: 22252, comm: waitid02 Not tainted 2.6.35-323cd+ #3
Call Trace:
[<ffffffff81095da4>] lockdep_rcu_dereference+0xa4/0xc0
[<ffffffff81061b31>] wait_consider_task+0xaf1/0xbe0
[<ffffffff81061d15>] do_wait+0xf5/0x310
[<ffffffff810620b6>] sys_waitid+0x86/0x1f0
[<ffffffff8105fce0>] ? child_wait_callback+0x0/0x70
[<ffffffff81003282>] system_call_fastpath+0x16/0x1b
This is fixed by holding the RCU read lock in wait_task_continued() to ensure
that the task's current credentials aren't destroyed between us reading the
cred pointer and us reading the UID from those credentials.
Furthermore, protect wait_task_stopped() in the same way.
We don't need to keep holding the RCU read lock once we've read the UID from
the credentials as holding the RCU read lock doesn't stop the target task from
changing its creds under us - so the credentials may be outdated immediately
after we've read the pointer, lock or no lock.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 6715045ddc7472a22be5e49d4047d2d89b391f45 upstream.
There is a problem in hibernate_preallocate_memory() that it calls
preallocate_image_memory() with an argument that may be greater than
the total number of available non-highmem memory pages. If that's
the case, the OOM condition is guaranteed to trigger, which in turn
can cause significant slowdown to occur during hibernation.
To avoid that, make preallocate_image_memory() adjust its argument
before calling preallocate_image_pages(), so that the total number of
saveable non-highem pages left is not less than the minimum size of
a hibernation image. Change hibernate_preallocate_memory() to try to
allocate from highmem if the number of pages allocated by
preallocate_image_memory() is too low.
Modify free_unnecessary_pages() to take all possible memory
allocation patterns into account.
Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Tested-by: M. Vefa Bicakci <bicave@superonline.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit e75e863dd5c7d96b91ebbd241da5328fc38a78cc upstream.
We have 32-bit variable overflow possibility when multiply in
task_times() and thread_group_times() functions. When the
overflow happens then the scaled utime value becomes erroneously
small and the scaled stime becomes i erroneously big.
Reported here:
https://bugzilla.redhat.com/show_bug.cgi?id=633037
https://bugzilla.kernel.org/show_bug.cgi?id=16559
Reported-by: Michael Chapman <redhat-bugzilla@very.puzzling.org>
Reported-by: Ciriaco Garcia de Celis <sysman@etherpilot.com>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
LKML-Reference: <20100914143513.GB8415@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 950eaaca681c44aab87a46225c9e44f902c080aa upstream.
[ 23.584719]
[ 23.584720] ===================================================
[ 23.585059] [ INFO: suspicious rcu_dereference_check() usage. ]
[ 23.585176] ---------------------------------------------------
[ 23.585176] kernel/pid.c:419 invoked rcu_dereference_check() without protection!
[ 23.585176]
[ 23.585176] other info that might help us debug this:
[ 23.585176]
[ 23.585176]
[ 23.585176] rcu_scheduler_active = 1, debug_locks = 1
[ 23.585176] 1 lock held by rc.sysinit/728:
[ 23.585176] #0: (tasklist_lock){.+.+..}, at: [<ffffffff8104771f>] sys_setpgid+0x5f/0x193
[ 23.585176]
[ 23.585176] stack backtrace:
[ 23.585176] Pid: 728, comm: rc.sysinit Not tainted 2.6.36-rc2 #2
[ 23.585176] Call Trace:
[ 23.585176] [<ffffffff8105b436>] lockdep_rcu_dereference+0x99/0xa2
[ 23.585176] [<ffffffff8104c324>] find_task_by_pid_ns+0x50/0x6a
[ 23.585176] [<ffffffff8104c35b>] find_task_by_vpid+0x1d/0x1f
[ 23.585176] [<ffffffff81047727>] sys_setpgid+0x67/0x193
[ 23.585176] [<ffffffff810029eb>] system_call_fastpath+0x16/0x1b
[ 24.959669] type=1400 audit(1282938522.956:4): avc: denied { module_request } for pid=766 comm="hwclock" kmod="char-major-10-135" scontext=system_u:system_r:hwclock_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclas
It turns out that the setpgid() system call fails to enter an RCU
read-side critical section before doing a PID-to-task_struct translation.
This commit therefore does rcu_read_lock() before the translation, and
also does rcu_read_unlock() after the last use of the returned pointer.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 068e35eee9ef98eb4cab55181977e24995d273be upstream.
Hardware breakpoints can't be registered within pid namespaces
because tsk->pid is passed rather than the pid in the current
namespace.
(See https://bugzilla.kernel.org/show_bug.cgi?id=17281 )
This is a quick fix demonstrating the problem but is not the
best method of solving the problem since passing pids internally
is not the best way to avoid pid namespace bugs. Subsequent patches
will show a better solution.
Much thanks to Frederic Weisbecker <fweisbec@gmail.com> for doing
the bulk of the work finding this bug.
Reported-by: Robin Green <greenrd@greenrd.org>
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
LKML-Reference: <f63454af09fb1915717251570423eb9ddd338340.1284407762.git.matthltc@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit c41d68a513c71e35a14f66d71782d27a79a81ea6 upstream.
compat_alloc_user_space() expects the caller to independently call
access_ok() to verify the returned area. A missing call could
introduce problems on some architectures.
This patch incorporates the access_ok() check into
compat_alloc_user_space() and also adds a sanity check on the length.
The existing compat_alloc_user_space() implementations are renamed
arch_compat_alloc_user_space() and are used as part of the
implementation of the new global function.
This patch assumes NULL will cause __get_user()/__put_user() to either
fail or access userspace on all architectures. This should be
followed by checking the return value of compat_access_user_space()
for NULL in the callers, at which time the access_ok() in the callers
can also be removed.
Reported-by: Ben Hawkes <hawkes@sota.gen.nz>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: James Bottomley <jejb@parisc-linux.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 1c24de60e50fb19b94d94225458da17c720f0729 upstream.
gid_t is a unsigned int. If group_info contains a gid greater than
MAX_INT, groups_search() function may look on the wrong side of the search
tree.
This solves some unfair "permission denied" problems.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 85a0fdfd0f967507f3903e8419bc7e408f5a59de upstream.
The gcov-kernel infrastructure expects that each object file is loaded
only once. This may not be true, e.g. when loading multiple kernel
modules which are linked to the same object file. As a result, loading
such kernel modules will result in incorrect gcov results while unloading
will cause a null-pointer dereference.
This patch fixes these problems by changing the gcov-kernel infrastructure
so that multiple profiling data sets can be associated with one debugfs
entry. It applies to 2.6.36-rc1.
Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Reported-by: Werner Spies <werner.spies@thalesgroup.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit df09162550fbb53354f0c88e85b5d0e6129ee9cc upstream.
Be sure to avoid entering t_show() with FTRACE_ITER_HASH set without
having properly started the iterator to iterate the hash. This case is
degenerate and, as discovered by Robert Swiecki, can cause t_hash_show()
to misuse a pointer. This causes a NULL ptr deref with possible security
implications. Tracked as CVE-2010-3079.
Cc: Robert Swiecki <swiecki@google.com>
Cc: Eugene Teo <eugene@redhat.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 9c55cb12c1c172e2d51e85fbb5a4796ca86b77e7 upstream.
Reading the file set_ftrace_filter does three things.
1) shows whether or not filters are set for the function tracer
2) shows what functions are set for the function tracer
3) shows what triggers are set on any functions
3 is independent from 1 and 2.
The way this file currently works is that it is a state machine,
and as you read it, it may change state. But this assumption breaks
when you use lseek() on the file. The state machine gets out of sync
and the t_show() may use the wrong pointer and cause a kernel oops.
Luckily, this will only kill the app that does the lseek, but the app
dies while holding a mutex. This prevents anyone else from using the
set_ftrace_filter file (or any other function tracing file for that matter).
A real fix for this is to rewrite the code, but that is too much for
a -rc release or stable. This patch simply disables llseek on the
set_ftrace_filter() file for now, and we can do the proper fix for the
next major release.
Reported-by: Robert Swiecki <swiecki@google.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Eugene Teo <eugene@redhat.com>
Cc: vendor-sec@lst.de
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 3aaba20f26f58843e8f20611e5c0b1c06954310f upstream.
While we are reading trace_stat/functionX and someone just
disabled function_profile at that time, we can trigger this:
divide error: 0000 [#1] PREEMPT SMP
...
EIP is at function_stat_show+0x90/0x230
...
This fix just takes the ftrace_profile_lock and checks if
rec->counter is 0. If it's 0, we know the profile buffer
has been reset.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4C723644.4040708@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
1) Add 2.6.35 android-common code to vanilla .35 kernel
2) Add android_4430sdp_defconfig
3) fixed compilation break with drivers/misc/pmem.c
Conflicts:
arch/arm/kernel/entry-armv.S
drivers/misc/Kconfig
drivers/misc/Makefile
drivers/mmc/core/core.c
drivers/mmc/core/mmc.c
drivers/mmc/core/sdio.c
drivers/mmc/core/sdio_bus.c
drivers/usb/gadget/composite.c
include/linux/mmc/host.h
Signed-off-by: Vikram Pandita <vikram.pandita@ti.com>
|
|
This patch is an attempt to keep ES1 working and some PM options
configurable. This is intended only for internal use and will be
eventually removed, when we no longer need ES1 to be supported
and when PM features are well tested and expected to work.
Signed-off-by: Rajendra Nayak <rnayak@ti.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
|
|
When a secondary CPU is being brought up, it is not uncommon for
printk() to be invoked when cpu_online(smp_processor_id()) == 0. The
case that I witnessed personally was on MIPS:
http://lkml.org/lkml/2010/5/30/4
If (can_use_console() == 0), printk() will spool its output to log_buf
and it will be visible in "dmesg", but that output will NOT be echoed to
the console until somebody calls release_console_sem() from a CPU that
is online. Therefore, the boot time messages from the new CPU can get
stuck in "limbo" for a long time, and might suddenly appear on the
screen when a completely unrelated event (e.g. "eth0: link is down")
occurs.
This patch modifies the console code so that any pending messages are
automatically flushed out to the console whenever a CPU hotplug
operation completes successfully or aborts.
This is true even when CPU is getting hot-plugged out(offline) so
need to add additional hotplug events.
The issue was seen on 2.6.34.
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Signed-off-by: Kevin Cernekee <cernekee@gmail.com>
|
|
Enable relevant Errata's and disable PM
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
|
|
commit 9d0f4dcc5c4d1c5dd01172172684a45b5f49d740 upstream.
There is a scalability issue for current implementation of optimistic
mutex spin in the kernel. It is found on a 8 node 64 core Nehalem-EX
system (HT mode).
The intention of the optimistic mutex spin is to busy wait and spin on a
mutex if the owner of the mutex is running, in the hope that the mutex
will be released soon and be acquired, without the thread trying to
acquire mutex going to sleep. However, when we have a large number of
threads, contending for the mutex, we could have the mutex grabbed by
other thread, and then another ……, and we will keep spinning, wasting cpu
cycles and adding to the contention. One possible fix is to quit
spinning and put the current thread on wait-list if mutex lock switch to
a new owner while we spin, indicating heavy contention (see the patch
included).
I did some testing on a 8 socket Nehalem-EX system with a total of 64
cores. Using Ingo's test-mutex program that creates/delete files with 256
threads (http://lkml.org/lkml/2006/1/8/50) , I see the following speed up
after putting in the mutex spin fix:
./mutex-test V 256 10
Ops/sec
2.6.34 62864
With fix 197200
Repeating the test with Aim7 fserver workload, again there is a speed up
with the fix:
Jobs/min
2.6.34 91657
With fix 149325
To look at the impact on the distribution of mutex acquisition time, I
collected the mutex acquisition time on Aim7 fserver workload with some
instrumentation. The average acquisition time is reduced by 48% and
number of contentions reduced by 32%.
#contentions Time to acquire mutex (cycles)
2.6.34 72973 44765791
With fix 49210 23067129
The histogram of mutex acquisition time is listed below. The acquisition
time is in 2^bin cycles. We see that without the fix, the acquisition
time is mostly around 2^26 cycles. With the fix, we the distribution get
spread out a lot more towards the lower cycles, starting from 2^13.
However, there is an increase of the tail distribution with the fix at
2^28 and 2^29 cycles. It seems a small price to pay for the reduced
average acquisition time and also getting the cpu to do useful work.
Mutex acquisition time distribution (acq time = 2^bin cycles):
2.6.34 With Fix
bin #occurrence % #occurrence %
11 2 0.00% 120 0.24%
12 10 0.01% 790 1.61%
13 14 0.02% 2058 4.18%
14 86 0.12% 3378 6.86%
15 393 0.54% 4831 9.82%
16 710 0.97% 4893 9.94%
17 815 1.12% 4667 9.48%
18 790 1.08% 5147 10.46%
19 580 0.80% 6250 12.70%
20 429 0.59% 6870 13.96%
21 311 0.43% 1809 3.68%
22 255 0.35% 2305 4.68%
23 317 0.44% 916 1.86%
24 610 0.84% 233 0.47%
25 3128 4.29% 95 0.19%
26 63902 87.69% 122 0.25%
27 619 0.85% 286 0.58%
28 0 0.00% 3536 7.19%
29 0 0.00% 903 1.83%
30 0 0.00% 0 0.00%
I've done similar experiments with 2.6.35 kernel on smaller boxes as
well. One is on a dual-socket Westmere box (12 cores total, with HT).
Another experiment is on an old dual-socket Core 2 box (4 cores total, no
HT)
On the 12-core Westmere box, I see a 250% increase for Ingo's mutex-test
program with my mutex patch but no significant difference in aim7's
fserver workload.
On the 4-core Core 2 box, I see the difference with the patch for both
mutex-test and aim7 fserver are negligible.
So far, it seems like the patch has not caused regression on smaller
systems.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1282168827.9542.72.camel@schen9-DESK>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit c7dcf87a6881bf796faee83003163eb3de41a309 upstream.
Early 4.3 versions of gcc apparently aggressively optimize the raw
time accumulation loop, replacing it with a divide.
On 32bit systems, this causes the following link errors:
undefined reference to `__umoddi3'
undefined reference to `__udivdi3'
The gcc issue has been fixed in 4.4 and greater.
This patch replaces the accumulation loop with a do_div, as suggested
by Linus.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
CC: Jason Wessel <jason.wessel@windriver.com>
CC: Larry Finger <Larry.Finger@lwfinger.net>
CC: Ingo Molnar <mingo@elte.hu>
CC: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit deda2e81961e96be4f2c09328baca4710a2fd1a0 upstream.
The tv_nsec is a long and when added to the shifted interval it can wrap
and become negative which later causes looping problems in the
getrawmonotonic(). The edge case occurs when the system has slept for
a short period of time of ~2 seconds.
A trace printk of the values in this patch illustrate the problem:
ftrace time stamp: log
43.716079: logarithmic_accumulation: raw: 3d0913 tv_nsec d687faa
43.718513: logarithmic_accumulation: raw: 3d0913 tv_nsec da588bd
43.722161: logarithmic_accumulation: raw: 3d0913 tv_nsec de291d0
46.349925: logarithmic_accumulation: raw: 7a122600 tv_nsec e1f9ae3
46.349930: logarithmic_accumulation: raw: 1e848980 tv_nsec 8831c0e3
The kernel starts looping at 46.349925 in the getrawmonotonic() due to
the negative value from adding the raw value to tv_nsec.
A simple solution is to accumulate into a u64, and then normalize it
to a timespec_t.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
[ Reworked variable names and simplified some of the code. - John ]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 297c5eee372478fc32fec5fe8eed711eedb13f3d upstream.
It's a really simple list, and several of the users want to go backwards
in it to find the previous vma. So rather than have to look up the
previous entry with 'find_vma_prev()' or something similar, just make it
doubly linked instead.
Tested-by: Ian Campbell <ijc@hellion.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 18fab912d4fa70133df164d2dcf3310be0c38c34 upstream.
With the configuration: CONFIG_DEBUG_PAGEALLOC=y and Shaohua's patch:
[PATCH]x86: make spurious_fault check correct pte bit
Function call graph trace with the following will trigger a page fault.
# cd /sys/kernel/debug/tracing/
# echo function_graph > current_tracer
# cat per_cpu/cpu1/trace_pipe_raw > /dev/null
BUG: unable to handle kernel paging request at ffff880006e99000
IP: [<ffffffff81085572>] rb_event_length+0x1/0x3f
PGD 1b19063 PUD 1b1d063 PMD 3f067 PTE 6e99160
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
last sysfs file: /sys/devices/virtual/net/lo/operstate
CPU 1
Modules linked in:
Pid: 1982, comm: cat Not tainted 2.6.35-rc6-aes+ #300 /Bochs
RIP: 0010:[<ffffffff81085572>] [<ffffffff81085572>] rb_event_length+0x1/0x3f
RSP: 0018:ffff880006475e38 EFLAGS: 00010006
RAX: 0000000000000ff0 RBX: ffff88000786c630 RCX: 000000000000001d
RDX: ffff880006e98000 RSI: 0000000000000ff0 RDI: ffff880006e99000
RBP: ffff880006475eb8 R08: 000000145d7008bd R09: 0000000000000000
R10: 0000000000008000 R11: ffffffff815d9336 R12: ffff880006d08000
R13: ffff880006e605d8 R14: 0000000000000000 R15: 0000000000000018
FS: 00007f2b83e456f0(0000) GS:ffff880002100000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff880006e99000 CR3: 00000000064a8000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process cat (pid: 1982, threadinfo ffff880006474000, task ffff880006e40770)
Stack:
ffff880006475eb8 ffffffff8108730f 0000000000000ff0 000000145d7008bd
<0> ffff880006e98010 ffff880006d08010 0000000000000296 ffff88000786c640
<0> ffffffff81002956 0000000000000000 ffff8800071f4680 ffff8800071f4680
Call Trace:
[<ffffffff8108730f>] ? ring_buffer_read_page+0x15a/0x24a
[<ffffffff81002956>] ? return_to_handler+0x15/0x2f
[<ffffffff8108a575>] tracing_buffers_read+0xb9/0x164
[<ffffffff810debfe>] vfs_read+0xaf/0x150
[<ffffffff81002941>] return_to_handler+0x0/0x2f
[<ffffffff810248b0>] __bad_area_nosemaphore+0x17e/0x1a1
[<ffffffff81002941>] return_to_handler+0x0/0x2f
[<ffffffff810248e6>] bad_area_nosemaphore+0x13/0x15
Code: 80 25 b2 16 b3 00 fe c9 c3 55 48 89 e5 f0 80 0d a4 16 b3 00 02 c9 c3 55 31 c0 48 89 e5 48 83 3d 94 16 b3 00 01 c9 0f 94 c0 c3 55 <8a> 0f 48 89 e5 83 e1 1f b8 08 00 00 00 0f b6 d1 83 fa 1e 74 27
RIP [<ffffffff81085572>] rb_event_length+0x1/0x3f
RSP <ffff880006475e38>
CR2: ffff880006e99000
---[ end trace a6877bb92ccb36bb ]---
The root cause is that ring_buffer_read_page() may read out of page
boundary, because the boundary checking is done after reading. This is
fixed via doing boundary checking before reading.
Reported-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
LKML-Reference: <1280297641.2771.307.camel@yhuang-dev>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 575570f02761bd680ba5731c1dfd4701062e7fb2 upstream.
With CONFIG_DEBUG_PAGEALLOC, I observed an unallocated memory access in
function_graph trace. It appears we find a small size entry in ring buffer,
but we access it as a big size entry. The access overflows the page size
and touches an unallocated page.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
LKML-Reference: <1280217994.32400.76.camel@sli10-desk.sh.intel.com>
[ Added a comment to explain the problem - SDR ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
Change-Id: I21366ace371d1b8f4684ddbe4ea8d555a926ac21
Signed-off-by: Colin Cross <ccross@google.com>
|
|
commit 685fd0b4ea3f0f1d5385610b0d5b57775a8d5842 upstream.
A small number of users of IRQF_TIMER are using it for the implied no
suspend behaviour on interrupts which are not timer interrupts.
Therefore add a new IRQF_NO_SUSPEND flag, rename IRQF_TIMER to
__IRQF_TIMER and redefine IRQF_TIMER in terms of these new flags.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: xen-devel@lists.xensource.com
Cc: linux-input@vger.kernel.org
Cc: linuxppc-dev@ozlabs.org
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1280398595-29708-1-git-send-email-ian.campbell@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 396e894d289d69bacf5acd983c97cd6e21a14c08 upstream.
Norbert reported that nohz_ratelimit() causes his laptop to burn about
4W (40%) extra. For now back out the change and see if we can adjust
the power management code to make better decisions.
Reported-by: Norbert Preining <preining@logic.at>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
Cc: Arjan van de Ven <arjan@infradead.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 694f690d27dadccc8cb9d90532e76593b61fe098 upstream.
Commit 8f92054e7ca1 ("CRED: Fix __task_cred()'s lockdep check and banner
comment") fixed the lockdep checks on __task_cred(). This has shown up
a place in the signalling code where a lock should be held - namely that
check_kill_permission() requires its callers to hold the RCU lock.
Fix group_send_sig_info() to get the RCU read lock around its call to
check_kill_permission().
Without this patch, the following warning can occur:
===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
kernel/signal.c:660 invoked rcu_dereference_check() without protection!
...
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
|
|
It's possible for get_task_cred() as it currently stands to 'corrupt' a set of
credentials by incrementing their usage count after their replacement by the
task being accessed.
What happens is that get_task_cred() can race with commit_creds():
TASK_1 TASK_2 RCU_CLEANER
-->get_task_cred(TASK_2)
rcu_read_lock()
__cred = __task_cred(TASK_2)
-->commit_creds()
old_cred = TASK_2->real_cred
TASK_2->real_cred = ...
put_cred(old_cred)
call_rcu(old_cred)
[__cred->usage == 0]
get_cred(__cred)
[__cred->usage == 1]
rcu_read_unlock()
-->put_cred_rcu()
[__cred->usage == 1]
panic()
However, since a tasks credentials are generally not changed very often, we can
reasonably make use of a loop involving reading the creds pointer and using
atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero.
If successful, we can safely return the credentials in the knowledge that, even
if the task we're accessing has released them, they haven't gone to the RCU
cleanup code.
We then change task_state() in procfs to use get_task_cred() rather than
calling get_cred() on the result of __task_cred(), as that suffers from the
same problem.
Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be
tripped when it is noticed that the usage count is not zero as it ought to be,
for example:
kernel BUG at kernel/cred.c:168!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/run
CPU 0
Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex
745
RIP: 0010:[<ffffffff81069881>] [<ffffffff81069881>] __put_cred+0xc/0x45
RSP: 0018:ffff88019e7e9eb8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff
RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0
RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0
R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001
FS: 00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0)
Stack:
ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45
<0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000
<0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246
Call Trace:
[<ffffffff810698cd>] put_cred+0x13/0x15
[<ffffffff81069b45>] commit_creds+0x16b/0x175
[<ffffffff8106aace>] set_current_groups+0x47/0x4e
[<ffffffff8106ac89>] sys_setgroups+0xf6/0x105
[<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00
48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b
04 25 00 cc 00 00 48 3b b8 58 04 00 00 75
RIP [<ffffffff81069881>] __put_cred+0xc/0x45
RSP <ffff88019e7e9eb8>
---[ end trace df391256a100ebdd ]---
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The command
echo "file ec.c +p" >/sys/kernel/debug/dynamic_debug/control
causes an oops.
Move the call to ddebug_remove_module() down into free_module(). In this
way it should be called from all error paths. Currently, we are missing
the remove if the module init routine fails.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Reported-by: Thomas Renninger <trenn@suse.de>
Tested-by: Thomas Renninger <trenn@suse.de>
Cc: <stable@kernel.org> [2.6.32+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Platform must register cpu power function that return power in
milliWatt seconds.
Change-Id: I1caa0335e316c352eee3b1ddf326fcd4942bcbe8
Signed-off-by: Mike Chan <mike@android.com>
|
|
Introduce new platform callback hooks for cpuacct for tracking CPU frequencies
Not all platforms / architectures have a set CPU_FREQ_TABLE defined
for CPU transition speeds. In order to track time spent in at various
CPU frequencies, we enable platform callbacks from cpuacct for this accounting.
Architectures that support overclock boosting, or don't have pre-defined
frequency tables can implement their own bucketing system that makes sense
given their cpufreq scaling abilities.
New file:
cpuacct.cpufreq reports the CPU time (in nanoseconds) spent at each CPU
frequency.
Change-Id: I10a80b3162e6fff3a8a2f74dd6bb37e88b12ba96
Signed-off-by: Mike Chan <mike@android.com>
|
|
This patch adds a notifier which can be used by subsystems that may
be interested in when a task has completely died and is about to
have it's last resource freed.
The Android lowmemory killer uses this to determine when a task
it has killed has finally given up its goods.
Signed-off-by: San Mehat <san@google.com>
|
|
When DEBUG_SUSPEND is enabled print active wakelocks when we check
if there are any active wakelocks.
In print_active_locks(), print expired wakelocks if DEBUG_EXPIRE is enabled
Change-Id: Ib1cb795555e71ff23143a2bac7c8a58cbce16547
Signed-off-by: Mike Chan <mike@android.com>
|
|
Signed-off-by: San Mehat <san@google.com>
|
|
If idx was non-zero and the log had wrapped, len did not get truncated
to stop at the last byte written to the log.
|
|
This reverts commit acff181d3574244e651913df77332e897b88bff4.
|
|
Change-Id: Ib82f6a716686a3ebb4592112400fc5e4a2ce066c
Signed-off-by: Colin Cross <ccross@android.com>
|
|
Rather than using explicit euid == 0 checks when trying to move
tasks into a cgroup via CFS, move permission checks into each
specific cgroup subsystem. If a subsystem does not specify a
'can_attach' handler, then we fall back to doing our checks the old way.
This way non-root processes can add arbitrary processes to
a cgroup if all the registered subsystems on that cgroup agree.
Also change explicit euid == 0 check to CAP_SYS_ADMIN
Signed-off-by: San Mehat <san@google.com>
|
|
Rather than signaling a full update of the display from userspace via a
console switch, this patch introduces 2 files int /sys/power,
wait_for_fb_sleep and wait_for_fb_wake. Reading these files will block
until the requested state has been entered. When a read from
wait_for_fb_sleep returns userspace should stop drawing. When
wait_for_fb_wake returns, it should do a full update. If either are called
when the fb driver is already in the requested state, they will return
immediately.
Signed-off-by: Rebecca Schultz <rschultz@google.com>
Signed-off-by: Arve Hjønnevåg <arve@android.com>
|
|
vt_waitactive now needs a 1 based console number
Change-Id: I07ab9a3773c93d67c09d928c8d5494ce823ffa2e
|
|
Signed-off-by: Arve Hjønnevåg <arve@android.com>
|