summaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
authorStephen Rothwell <sfr@canb.auug.org.au>2013-11-05 14:58:01 +1100
committerStephen Rothwell <sfr@canb.auug.org.au>2013-11-05 14:58:04 +1100
commit7f1546329db7b573f3a640d75cc1af40dc5ee9ed (patch)
tree2fe1446a800cfd5a5f52b3bdc16f9dd43968eb57 /Documentation
parentddb758d5bb04265c29afbec3496f4b2f2fc2aa09 (diff)
parentf497c33a948932e2a11b71d1ea07b2eedbc2f0d7 (diff)
Merge remote-tracking branch 'tip/auto-latest'
Conflicts: arch/h8300/include/asm/Kbuild include/linux/acpi.h include/linux/wait.h tools/perf/config/Makefile tools/perf/config/feature-tests.mak
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/DocBook/device-drivers.tmpl5
-rw-r--r--Documentation/DocBook/genericirq.tmpl64
-rw-r--r--Documentation/RCU/checklist.txt4
-rw-r--r--Documentation/RCU/stallwarn.txt22
-rw-r--r--Documentation/devicetree/bindings/timer/efm32,timer.txt23
-rw-r--r--Documentation/efi-stub.txt (renamed from Documentation/x86/efi-stub.txt)0
-rw-r--r--Documentation/kernel-parameters.txt113
-rw-r--r--Documentation/kernel-per-CPU-kthreads.txt17
-rw-r--r--Documentation/lockstat.txt123
-rw-r--r--Documentation/sysctl/kernel.txt76
10 files changed, 293 insertions, 154 deletions
diff --git a/Documentation/DocBook/device-drivers.tmpl b/Documentation/DocBook/device-drivers.tmpl
index fe397f90a34f..6c9d9d37c83a 100644
--- a/Documentation/DocBook/device-drivers.tmpl
+++ b/Documentation/DocBook/device-drivers.tmpl
@@ -87,7 +87,10 @@ X!Iinclude/linux/kobject.h
!Ekernel/printk/printk.c
!Ekernel/panic.c
!Ekernel/sys.c
-!Ekernel/rcupdate.c
+!Ekernel/rcu/srcu.c
+!Ekernel/rcu/tree.c
+!Ekernel/rcu/tree_plugin.h
+!Ekernel/rcu/update.c
</sect1>
<sect1><title>Device Resource Management</title>
diff --git a/Documentation/DocBook/genericirq.tmpl b/Documentation/DocBook/genericirq.tmpl
index d16d21b7a3b7..46347f603353 100644
--- a/Documentation/DocBook/genericirq.tmpl
+++ b/Documentation/DocBook/genericirq.tmpl
@@ -87,7 +87,7 @@
<chapter id="rationale">
<title>Rationale</title>
<para>
- The original implementation of interrupt handling in Linux is using
+ The original implementation of interrupt handling in Linux uses
the __do_IRQ() super-handler, which is able to deal with every
type of interrupt logic.
</para>
@@ -111,19 +111,19 @@
</itemizedlist>
</para>
<para>
- This split implementation of highlevel IRQ handlers allows us to
+ This split implementation of high-level IRQ handlers allows us to
optimize the flow of the interrupt handling for each specific
- interrupt type. This reduces complexity in that particular codepath
+ interrupt type. This reduces complexity in that particular code path
and allows the optimized handling of a given type.
</para>
<para>
The original general IRQ implementation used hw_interrupt_type
structures and their ->ack(), ->end() [etc.] callbacks to
differentiate the flow control in the super-handler. This leads to
- a mix of flow logic and lowlevel hardware logic, and it also leads
- to unnecessary code duplication: for example in i386, there is a
- ioapic_level_irq and a ioapic_edge_irq irq-type which share many
- of the lowlevel details but have different flow handling.
+ a mix of flow logic and low-level hardware logic, and it also leads
+ to unnecessary code duplication: for example in i386, there is an
+ ioapic_level_irq and an ioapic_edge_irq IRQ-type which share many
+ of the low-level details but have different flow handling.
</para>
<para>
A more natural abstraction is the clean separation of the
@@ -132,23 +132,23 @@
<para>
Analysing a couple of architecture's IRQ subsystem implementations
reveals that most of them can use a generic set of 'irq flow'
- methods and only need to add the chip level specific code.
+ methods and only need to add the chip-level specific code.
The separation is also valuable for (sub)architectures
- which need specific quirks in the irq flow itself but not in the
- chip-details - and thus provides a more transparent IRQ subsystem
+ which need specific quirks in the IRQ flow itself but not in the
+ chip details - and thus provides a more transparent IRQ subsystem
design.
</para>
<para>
- Each interrupt descriptor is assigned its own highlevel flow
+ Each interrupt descriptor is assigned its own high-level flow
handler, which is normally one of the generic
- implementations. (This highlevel flow handler implementation also
+ implementations. (This high-level flow handler implementation also
makes it simple to provide demultiplexing handlers which can be
found in embedded platforms on various architectures.)
</para>
<para>
The separation makes the generic interrupt handling layer more
flexible and extensible. For example, an (sub)architecture can
- use a generic irq-flow implementation for 'level type' interrupts
+ use a generic IRQ-flow implementation for 'level type' interrupts
and add a (sub)architecture specific 'edge type' implementation.
</para>
<para>
@@ -172,9 +172,9 @@
<para>
There are three main levels of abstraction in the interrupt code:
<orderedlist>
- <listitem><para>Highlevel driver API</para></listitem>
- <listitem><para>Highlevel IRQ flow handlers</para></listitem>
- <listitem><para>Chiplevel hardware encapsulation</para></listitem>
+ <listitem><para>High-level driver API</para></listitem>
+ <listitem><para>High-level IRQ flow handlers</para></listitem>
+ <listitem><para>Chip-level hardware encapsulation</para></listitem>
</orderedlist>
</para>
<sect1 id="Interrupt_control_flow">
@@ -189,16 +189,16 @@
which are assigned to this interrupt.
</para>
<para>
- Whenever an interrupt triggers, the lowlevel arch code calls into
- the generic interrupt code by calling desc->handle_irq().
- This highlevel IRQ handling function only uses desc->irq_data.chip
+ Whenever an interrupt triggers, the low-level architecture code calls
+ into the generic interrupt code by calling desc->handle_irq().
+ This high-level IRQ handling function only uses desc->irq_data.chip
primitives referenced by the assigned chip descriptor structure.
</para>
</sect1>
<sect1 id="Highlevel_Driver_API">
- <title>Highlevel Driver API</title>
+ <title>High-level Driver API</title>
<para>
- The highlevel Driver API consists of following functions:
+ The high-level Driver API consists of following functions:
<itemizedlist>
<listitem><para>request_irq()</para></listitem>
<listitem><para>free_irq()</para></listitem>
@@ -216,7 +216,7 @@
</para>
</sect1>
<sect1 id="Highlevel_IRQ_flow_handlers">
- <title>Highlevel IRQ flow handlers</title>
+ <title>High-level IRQ flow handlers</title>
<para>
The generic layer provides a set of pre-defined irq-flow methods:
<itemizedlist>
@@ -228,7 +228,7 @@
<listitem><para>handle_edge_eoi_irq</para></listitem>
<listitem><para>handle_bad_irq</para></listitem>
</itemizedlist>
- The interrupt flow handlers (either predefined or architecture
+ The interrupt flow handlers (either pre-defined or architecture
specific) are assigned to specific interrupts by the architecture
either during bootup or during device initialization.
</para>
@@ -297,7 +297,7 @@ desc->irq_data.chip->irq_unmask();
<para>
handle_fasteoi_irq provides a generic implementation
for interrupts, which only need an EOI at the end of
- the handler
+ the handler.
</para>
<para>
The following control flow is implemented (simplified excerpt):
@@ -394,7 +394,7 @@ if (desc->irq_data.chip->irq_eoi)
The generic functions are intended for 'clean' architectures and chips,
which have no platform-specific IRQ handling quirks. If an architecture
needs to implement quirks on the 'flow' level then it can do so by
- overriding the highlevel irq-flow handler.
+ overriding the high-level irq-flow handler.
</para>
</sect2>
<sect2 id="Delayed_interrupt_disable">
@@ -419,9 +419,9 @@ if (desc->irq_data.chip->irq_eoi)
</sect2>
</sect1>
<sect1 id="Chiplevel_hardware_encapsulation">
- <title>Chiplevel hardware encapsulation</title>
+ <title>Chip-level hardware encapsulation</title>
<para>
- The chip level hardware descriptor structure irq_chip
+ The chip-level hardware descriptor structure irq_chip
contains all the direct chip relevant functions, which
can be utilized by the irq flow implementations.
<itemizedlist>
@@ -429,14 +429,14 @@ if (desc->irq_data.chip->irq_eoi)
<listitem><para>irq_mask_ack() - Optional, recommended for performance</para></listitem>
<listitem><para>irq_mask()</para></listitem>
<listitem><para>irq_unmask()</para></listitem>
- <listitem><para>irq_eoi() - Optional, required for eoi flow handlers</para></listitem>
+ <listitem><para>irq_eoi() - Optional, required for EOI flow handlers</para></listitem>
<listitem><para>irq_retrigger() - Optional</para></listitem>
<listitem><para>irq_set_type() - Optional</para></listitem>
<listitem><para>irq_set_wake() - Optional</para></listitem>
</itemizedlist>
These primitives are strictly intended to mean what they say: ack means
ACK, masking means masking of an IRQ line, etc. It is up to the flow
- handler(s) to use these basic units of lowlevel functionality.
+ handler(s) to use these basic units of low-level functionality.
</para>
</sect1>
</chapter>
@@ -445,7 +445,7 @@ if (desc->irq_data.chip->irq_eoi)
<title>__do_IRQ entry point</title>
<para>
The original implementation __do_IRQ() was an alternative entry
- point for all types of interrupts. It not longer exists.
+ point for all types of interrupts. It no longer exists.
</para>
<para>
This handler turned out to be not suitable for all
@@ -468,11 +468,11 @@ if (desc->irq_data.chip->irq_eoi)
<chapter id="genericchip">
<title>Generic interrupt chip</title>
<para>
- To avoid copies of identical implementations of irq chips the
+ To avoid copies of identical implementations of IRQ chips the
core provides a configurable generic interrupt chip
implementation. Developers should check carefuly whether the
generic chip fits their needs before implementing the same
- functionality slightly different themself.
+ functionality slightly differently themselves.
</para>
!Ekernel/irq/generic-chip.c
</chapter>
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
index 7703ec73a9bb..91266193b8f4 100644
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -202,8 +202,8 @@ over a rather long period of time, but improvements are always welcome!
updater uses call_rcu_sched() or synchronize_sched(), then
the corresponding readers must disable preemption, possibly
by calling rcu_read_lock_sched() and rcu_read_unlock_sched().
- If the updater uses synchronize_srcu() or call_srcu(),
- the the corresponding readers must use srcu_read_lock() and
+ If the updater uses synchronize_srcu() or call_srcu(), then
+ the corresponding readers must use srcu_read_lock() and
srcu_read_unlock(), and with the same srcu_struct. The rules for
the expedited primitives are the same as for their non-expedited
counterparts. Mixing things up will result in confusion and
diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt
index 8e9359de1d28..6f3a0057548e 100644
--- a/Documentation/RCU/stallwarn.txt
+++ b/Documentation/RCU/stallwarn.txt
@@ -12,12 +12,12 @@ CONFIG_RCU_CPU_STALL_TIMEOUT
This kernel configuration parameter defines the period of time
that RCU will wait from the beginning of a grace period until it
issues an RCU CPU stall warning. This time period is normally
- sixty seconds.
+ 21 seconds.
This configuration parameter may be changed at runtime via the
/sys/module/rcutree/parameters/rcu_cpu_stall_timeout, however
this parameter is checked only at the beginning of a cycle.
- So if you are 30 seconds into a 70-second stall, setting this
+ So if you are 10 seconds into a 40-second stall, setting this
sysfs parameter to (say) five will shorten the timeout for the
-next- stall, or the following warning for the current stall
(assuming the stall lasts long enough). It will not affect the
@@ -32,7 +32,7 @@ CONFIG_RCU_CPU_STALL_VERBOSE
also dump the stacks of any tasks that are blocking the current
RCU-preempt grace period.
-RCU_CPU_STALL_INFO
+CONFIG_RCU_CPU_STALL_INFO
This kernel configuration parameter causes the stall warning to
print out additional per-CPU diagnostic information, including
@@ -43,7 +43,8 @@ RCU_STALL_DELAY_DELTA
Although the lockdep facility is extremely useful, it does add
some overhead. Therefore, under CONFIG_PROVE_RCU, the
RCU_STALL_DELAY_DELTA macro allows five extra seconds before
- giving an RCU CPU stall warning message.
+ giving an RCU CPU stall warning message. (This is a cpp
+ macro, not a kernel configuration parameter.)
RCU_STALL_RAT_DELAY
@@ -52,7 +53,8 @@ RCU_STALL_RAT_DELAY
However, if the offending CPU does not detect its own stall in
the number of jiffies specified by RCU_STALL_RAT_DELAY, then
some other CPU will complain. This delay is normally set to
- two jiffies.
+ two jiffies. (This is a cpp macro, not a kernel configuration
+ parameter.)
When a CPU detects that it is stalling, it will print a message similar
to the following:
@@ -86,7 +88,12 @@ printing, there will be a spurious stall-warning message:
INFO: rcu_bh_state detected stalls on CPUs/tasks: { } (detected by 4, 2502 jiffies)
-This is rare, but does happen from time to time in real life.
+This is rare, but does happen from time to time in real life. It is also
+possible for a zero-jiffy stall to be flagged in this case, depending
+on how the stall warning and the grace-period initialization happen to
+interact. Please note that it is not possible to entirely eliminate this
+sort of false positive without resorting to things like stop_machine(),
+which is overkill for this sort of problem.
If the CONFIG_RCU_CPU_STALL_INFO kernel configuration parameter is set,
more information is printed with the stall-warning message, for example:
@@ -216,4 +223,5 @@ that portion of the stack which remains the same from trace to trace.
If you can reliably trigger the stall, ftrace can be quite helpful.
RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE
-and with RCU's event tracing.
+and with RCU's event tracing. For information on RCU's event tracing,
+see include/trace/events/rcu.h.
diff --git a/Documentation/devicetree/bindings/timer/efm32,timer.txt b/Documentation/devicetree/bindings/timer/efm32,timer.txt
new file mode 100644
index 000000000000..97a568f696c9
--- /dev/null
+++ b/Documentation/devicetree/bindings/timer/efm32,timer.txt
@@ -0,0 +1,23 @@
+* EFM32 timer hardware
+
+The efm32 Giant Gecko SoCs come with four 16 bit timers. Two counters can be
+connected to form a 32 bit counter. Each timer has three Compare/Capture
+channels and can be used as PWM or Quadrature Decoder. Available clock sources
+are the cpu's HFPERCLK (with a 10-bit prescaler) or an external pin.
+
+Required properties:
+- compatible : Should be efm32,timer
+- reg : Address and length of the register set
+- clocks : Should contain a reference to the HFPERCLK
+
+Optional properties:
+- interrupts : Reference to the timer interrupt
+
+Example:
+
+timer@40010c00 {
+ compatible = "efm32,timer";
+ reg = <0x40010c00 0x400>;
+ interrupts = <14>;
+ clocks = <&cmu clk_HFPERCLKTIMER3>;
+};
diff --git a/Documentation/x86/efi-stub.txt b/Documentation/efi-stub.txt
index 44e6bb6ead10..44e6bb6ead10 100644
--- a/Documentation/x86/efi-stub.txt
+++ b/Documentation/efi-stub.txt
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index f416ea9de5a6..dcab4b64fb3b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -847,6 +847,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
earlyprintk= [X86,SH,BLACKFIN,ARM]
earlyprintk=vga
+ earlyprintk=efi
earlyprintk=xen
earlyprintk=serial[,ttySn[,baudrate]]
earlyprintk=serial[,0x...[,baudrate]]
@@ -860,7 +861,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
Append ",keep" to not disable it when the real console
takes over.
- Only vga or serial or usb debug port at a time.
+ Only one of vga, efi, serial, or usb debug port can
+ be used at a time.
Currently only ttyS0 and ttyS1 may be specified by
name. Other I/O ports may be explicitly specified
@@ -874,8 +876,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
Interaction with the standard serial driver is not
very good.
- The VGA output is eventually overwritten by the real
- console.
+ The VGA and EFI output is eventually overwritten by
+ the real console.
The xen output can only be used by Xen PV guests.
@@ -1984,6 +1986,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
noapic [SMP,APIC] Tells the kernel to not make use of any
IOAPICs that may be present in the system.
+ nokaslr [X86]
+ Disable kernel base offset ASLR (Address Space
+ Layout Randomization) if built into the kernel.
+
noautogroup Disable scheduler automatic task group creation.
nobats [PPC] Do not use BATs for mapping kernel lowmem
@@ -2608,7 +2614,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
ramdisk_size= [RAM] Sizes of RAM disks in kilobytes
See Documentation/blockdev/ramdisk.txt.
- rcu_nocbs= [KNL,BOOT]
+ rcu_nocbs= [KNL]
In kernels built with CONFIG_RCU_NOCB_CPU=y, set
the specified list of CPUs to be no-callback CPUs.
Invocation of these CPUs' RCU callbacks will
@@ -2621,7 +2627,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
real-time workloads. It can also improve energy
efficiency for asymmetric multiprocessors.
- rcu_nocb_poll [KNL,BOOT]
+ rcu_nocb_poll [KNL]
Rather than requiring that offloaded CPUs
(specified by rcu_nocbs= above) explicitly
awaken the corresponding "rcuoN" kthreads,
@@ -2632,126 +2638,145 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
energy efficiency by requiring that the kthreads
periodically wake up to do the polling.
- rcutree.blimit= [KNL,BOOT]
+ rcutree.blimit= [KNL]
Set maximum number of finished RCU callbacks to process
in one batch.
- rcutree.fanout_leaf= [KNL,BOOT]
+ rcutree.rcu_fanout_leaf= [KNL]
Increase the number of CPUs assigned to each
leaf rcu_node structure. Useful for very large
systems.
- rcutree.jiffies_till_first_fqs= [KNL,BOOT]
+ rcutree.jiffies_till_first_fqs= [KNL]
Set delay from grace-period initialization to
first attempt to force quiescent states.
Units are jiffies, minimum value is zero,
and maximum value is HZ.
- rcutree.jiffies_till_next_fqs= [KNL,BOOT]
+ rcutree.jiffies_till_next_fqs= [KNL]
Set delay between subsequent attempts to force
quiescent states. Units are jiffies, minimum
value is one, and maximum value is HZ.
- rcutree.qhimark= [KNL,BOOT]
+ rcutree.qhimark= [KNL]
Set threshold of queued
RCU callbacks over which batch limiting is disabled.
- rcutree.qlowmark= [KNL,BOOT]
+ rcutree.qlowmark= [KNL]
Set threshold of queued RCU callbacks below which
batch limiting is re-enabled.
- rcutree.rcu_cpu_stall_suppress= [KNL,BOOT]
- Suppress RCU CPU stall warning messages.
-
- rcutree.rcu_cpu_stall_timeout= [KNL,BOOT]
- Set timeout for RCU CPU stall warning messages.
-
- rcutree.rcu_idle_gp_delay= [KNL,BOOT]
+ rcutree.rcu_idle_gp_delay= [KNL]
Set wakeup interval for idle CPUs that have
RCU callbacks (RCU_FAST_NO_HZ=y).
- rcutree.rcu_idle_lazy_gp_delay= [KNL,BOOT]
+ rcutree.rcu_idle_lazy_gp_delay= [KNL]
Set wakeup interval for idle CPUs that have
only "lazy" RCU callbacks (RCU_FAST_NO_HZ=y).
Lazy RCU callbacks are those which RCU can
prove do nothing more than free memory.
- rcutorture.fqs_duration= [KNL,BOOT]
+ rcutorture.fqs_duration= [KNL]
Set duration of force_quiescent_state bursts.
- rcutorture.fqs_holdoff= [KNL,BOOT]
+ rcutorture.fqs_holdoff= [KNL]
Set holdoff time within force_quiescent_state bursts.
- rcutorture.fqs_stutter= [KNL,BOOT]
+ rcutorture.fqs_stutter= [KNL]
Set wait time between force_quiescent_state bursts.
- rcutorture.irqreader= [KNL,BOOT]
- Test RCU readers from irq handlers.
+ rcutorture.gp_exp= [KNL]
+ Use expedited update-side primitives.
+
+ rcutorture.gp_normal= [KNL]
+ Use normal (non-expedited) update-side primitives.
+ If both gp_exp and gp_normal are set, do both.
+ If neither gp_exp nor gp_normal are set, still
+ do both.
- rcutorture.n_barrier_cbs= [KNL,BOOT]
+ rcutorture.n_barrier_cbs= [KNL]
Set callbacks/threads for rcu_barrier() testing.
- rcutorture.nfakewriters= [KNL,BOOT]
+ rcutorture.nfakewriters= [KNL]
Set number of concurrent RCU writers. These just
stress RCU, they don't participate in the actual
test, hence the "fake".
- rcutorture.nreaders= [KNL,BOOT]
+ rcutorture.nreaders= [KNL]
Set number of RCU readers.
- rcutorture.onoff_holdoff= [KNL,BOOT]
+ rcutorture.object_debug= [KNL]
+ Enable debug-object double-call_rcu() testing.
+
+ rcutorture.onoff_holdoff= [KNL]
Set time (s) after boot for CPU-hotplug testing.
- rcutorture.onoff_interval= [KNL,BOOT]
+ rcutorture.onoff_interval= [KNL]
Set time (s) between CPU-hotplug operations, or
zero to disable CPU-hotplug testing.
- rcutorture.shuffle_interval= [KNL,BOOT]
+ rcutorture.rcutorture_runnable= [BOOT]
+ Start rcutorture running at boot time.
+
+ rcutorture.shuffle_interval= [KNL]
Set task-shuffle interval (s). Shuffling tasks
allows some CPUs to go into dyntick-idle mode
during the rcutorture test.
- rcutorture.shutdown_secs= [KNL,BOOT]
+ rcutorture.shutdown_secs= [KNL]
Set time (s) after boot system shutdown. This
is useful for hands-off automated testing.
- rcutorture.stall_cpu= [KNL,BOOT]
+ rcutorture.stall_cpu= [KNL]
Duration of CPU stall (s) to test RCU CPU stall
warnings, zero to disable.
- rcutorture.stall_cpu_holdoff= [KNL,BOOT]
+ rcutorture.stall_cpu_holdoff= [KNL]
Time to wait (s) after boot before inducing stall.
- rcutorture.stat_interval= [KNL,BOOT]
+ rcutorture.stat_interval= [KNL]
Time (s) between statistics printk()s.
- rcutorture.stutter= [KNL,BOOT]
+ rcutorture.stutter= [KNL]
Time (s) to stutter testing, for example, specifying
five seconds causes the test to run for five seconds,
wait for five seconds, and so on. This tests RCU's
ability to transition abruptly to and from idle.
- rcutorture.test_boost= [KNL,BOOT]
+ rcutorture.test_boost= [KNL]
Test RCU priority boosting? 0=no, 1=maybe, 2=yes.
"Maybe" means test if the RCU implementation
under test support RCU priority boosting.
- rcutorture.test_boost_duration= [KNL,BOOT]
+ rcutorture.test_boost_duration= [KNL]
Duration (s) of each individual boost test.
- rcutorture.test_boost_interval= [KNL,BOOT]
+ rcutorture.test_boost_interval= [KNL]
Interval (s) between each boost test.
- rcutorture.test_no_idle_hz= [KNL,BOOT]
+ rcutorture.test_no_idle_hz= [KNL]
Test RCU's dyntick-idle handling. See also the
rcutorture.shuffle_interval parameter.
- rcutorture.torture_type= [KNL,BOOT]
+ rcutorture.torture_type= [KNL]
Specify the RCU implementation to test.
- rcutorture.verbose= [KNL,BOOT]
+ rcutorture.verbose= [KNL]
Enable additional printk() statements.
+ rcupdate.rcu_expedited= [KNL]
+ Use expedited grace-period primitives, for
+ example, synchronize_rcu_expedited() instead
+ of synchronize_rcu(). This reduces latency,
+ but can increase CPU utilization, degrade
+ real-time latency, and degrade energy efficiency.
+
+ rcupdate.rcu_cpu_stall_suppress= [KNL]
+ Suppress RCU CPU stall warning messages.
+
+ rcupdate.rcu_cpu_stall_timeout= [KNL]
+ Set timeout for RCU CPU stall warning messages.
+
rdinit= [KNL]
Format: <full_path>
Run specified binary instead of /init from the ramdisk,
@@ -3480,11 +3505,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
default x2apic cluster mode on platforms
supporting x2apic.
- x86_mrst_timer= [X86-32,APBT]
- Choose timer option for x86 Moorestown MID platform.
+ x86_intel_mid_timer= [X86-32,APBT]
+ Choose timer option for x86 Intel MID platform.
Two valid options are apbt timer only and lapic timer
plus one apbt timer for broadcast timer.
- x86_mrst_timer=apbt_only | lapic_and_apbt
+ x86_intel_mid_timer=apbt_only | lapic_and_apbt
xen_emul_unplug= [HW,X86,XEN]
Unplug Xen emulated devices
diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
index 32351bfabf20..827104fb9364 100644
--- a/Documentation/kernel-per-CPU-kthreads.txt
+++ b/Documentation/kernel-per-CPU-kthreads.txt
@@ -181,12 +181,17 @@ To reduce its OS jitter, do any of the following:
make sure that this is safe on your particular system.
d. It is not possible to entirely get rid of OS jitter
from vmstat_update() on CONFIG_SMP=y systems, but you
- can decrease its frequency by writing a large value to
- /proc/sys/vm/stat_interval. The default value is HZ,
- for an interval of one second. Of course, larger values
- will make your virtual-memory statistics update more
- slowly. Of course, you can also run your workload at
- a real-time priority, thus preempting vmstat_update().
+ can decrease its frequency by writing a large value
+ to /proc/sys/vm/stat_interval. The default value is
+ HZ, for an interval of one second. Of course, larger
+ values will make your virtual-memory statistics update
+ more slowly. Of course, you can also run your workload
+ at a real-time priority, thus preempting vmstat_update(),
+ but if your workload is CPU-bound, this is a bad idea.
+ However, there is an RFC patch from Christoph Lameter
+ (based on an earlier one from Gilad Ben-Yossef) that
+ reduces or even eliminates vmstat overhead for some
+ workloads at https://lkml.org/lkml/2013/9/4/379.
e. If running on high-end powerpc servers, build with
CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS
daemon from running on each CPU every second or so.
diff --git a/Documentation/lockstat.txt b/Documentation/lockstat.txt
index dd2f7b26ca30..72d010689751 100644
--- a/Documentation/lockstat.txt
+++ b/Documentation/lockstat.txt
@@ -46,16 +46,14 @@ With these hooks we provide the following statistics:
contentions - number of lock acquisitions that had to wait
wait time min - shortest (non-0) time we ever had to wait for a lock
max - longest time we ever had to wait for a lock
- total - total time we spend waiting on this lock
+ total - total time we spend waiting on this lock
+ avg - average time spent waiting on this lock
acq-bounces - number of lock acquisitions that involved x-cpu data
acquisitions - number of times we took the lock
hold time min - shortest (non-0) time we ever held the lock
- max - longest time we ever held the lock
- total - total time this lock was held
-
-From these number various other statistics can be derived, such as:
-
- hold time average = hold time total / acquisitions
+ max - longest time we ever held the lock
+ total - total time this lock was held
+ avg - average time this lock was held
These numbers are gathered per lock class, per read/write state (when
applicable).
@@ -84,37 +82,38 @@ Look at the current lock statistics:
# less /proc/lock_stat
-01 lock_stat version 0.3
-02 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-03 class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
-04 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+01 lock_stat version 0.4
+02-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+03 class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
+04-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
05
-06 &mm->mmap_sem-W: 233 538 18446744073708 22924.27 607243.51 1342 45806 1.71 8595.89 1180582.34
-07 &mm->mmap_sem-R: 205 587 18446744073708 28403.36 731975.00 1940 412426 0.58 187825.45 6307502.88
-08 ---------------
-09 &mm->mmap_sem 487 [<ffffffff8053491f>] do_page_fault+0x466/0x928
-10 &mm->mmap_sem 179 [<ffffffff802a6200>] sys_mprotect+0xcd/0x21d
-11 &mm->mmap_sem 279 [<ffffffff80210a57>] sys_mmap+0x75/0xce
-12 &mm->mmap_sem 76 [<ffffffff802a490b>] sys_munmap+0x32/0x59
-13 ---------------
-14 &mm->mmap_sem 270 [<ffffffff80210a57>] sys_mmap+0x75/0xce
-15 &mm->mmap_sem 431 [<ffffffff8053491f>] do_page_fault+0x466/0x928
-16 &mm->mmap_sem 138 [<ffffffff802a490b>] sys_munmap+0x32/0x59
-17 &mm->mmap_sem 145 [<ffffffff802a6200>] sys_mprotect+0xcd/0x21d
+06 &mm->mmap_sem-W: 46 84 0.26 939.10 16371.53 194.90 47291 2922365 0.16 2220301.69 17464026916.32 5975.99
+07 &mm->mmap_sem-R: 37 100 1.31 299502.61 325629.52 3256.30 212344 34316685 0.10 7744.91 95016910.20 2.77
+08 ---------------
+09 &mm->mmap_sem 1 [<ffffffff811502a7>] khugepaged_scan_mm_slot+0x57/0x280
+19 &mm->mmap_sem 96 [<ffffffff815351c4>] __do_page_fault+0x1d4/0x510
+11 &mm->mmap_sem 34 [<ffffffff81113d77>] vm_mmap_pgoff+0x87/0xd0
+12 &mm->mmap_sem 17 [<ffffffff81127e71>] vm_munmap+0x41/0x80
+13 ---------------
+14 &mm->mmap_sem 1 [<ffffffff81046fda>] dup_mmap+0x2a/0x3f0
+15 &mm->mmap_sem 60 [<ffffffff81129e29>] SyS_mprotect+0xe9/0x250
+16 &mm->mmap_sem 41 [<ffffffff815351c4>] __do_page_fault+0x1d4/0x510
+17 &mm->mmap_sem 68 [<ffffffff81113d77>] vm_mmap_pgoff+0x87/0xd0
18
-19 ...............................................................................................................................................................................................
+19.............................................................................................................................................................................................................................
20
-21 dcache_lock: 621 623 0.52 118.26 1053.02 6745 91930 0.29 316.29 118423.41
-22 -----------
-23 dcache_lock 179 [<ffffffff80378274>] _atomic_dec_and_lock+0x34/0x54
-24 dcache_lock 113 [<ffffffff802cc17b>] d_alloc+0x19a/0x1eb
-25 dcache_lock 99 [<ffffffff802ca0dc>] d_rehash+0x1b/0x44
-26 dcache_lock 104 [<ffffffff802cbca0>] d_instantiate+0x36/0x8a
-27 -----------
-28 dcache_lock 192 [<ffffffff80378274>] _atomic_dec_and_lock+0x34/0x54
-29 dcache_lock 98 [<ffffffff802ca0dc>] d_rehash+0x1b/0x44
-30 dcache_lock 72 [<ffffffff802cc17b>] d_alloc+0x19a/0x1eb
-31 dcache_lock 112 [<ffffffff802cbca0>] d_instantiate+0x36/0x8a
+21 unix_table_lock: 110 112 0.21 49.24 163.91 1.46 21094 66312 0.12 624.42 31589.81 0.48
+22 ---------------
+23 unix_table_lock 45 [<ffffffff8150ad8e>] unix_create1+0x16e/0x1b0
+24 unix_table_lock 47 [<ffffffff8150b111>] unix_release_sock+0x31/0x250
+25 unix_table_lock 15 [<ffffffff8150ca37>] unix_find_other+0x117/0x230
+26 unix_table_lock 5 [<ffffffff8150a09f>] unix_autobind+0x11f/0x1b0
+27 ---------------
+28 unix_table_lock 39 [<ffffffff8150b111>] unix_release_sock+0x31/0x250
+29 unix_table_lock 49 [<ffffffff8150ad8e>] unix_create1+0x16e/0x1b0
+30 unix_table_lock 20 [<ffffffff8150ca37>] unix_find_other+0x117/0x230
+31 unix_table_lock 4 [<ffffffff8150a09f>] unix_autobind+0x11f/0x1b0
+
This excerpt shows the first two lock class statistics. Line 01 shows the
output version - each time the format changes this will be updated. Line 02-04
@@ -131,30 +130,30 @@ The integer part of the time values is in us.
Dealing with nested locks, subclasses may appear:
-32...............................................................................................................................................................................................
+32...........................................................................................................................................................................................................................
33
-34 &rq->lock: 13128 13128 0.43 190.53 103881.26 97454 3453404 0.00 401.11 13224683.11
+34 &rq->lock: 13128 13128 0.43 190.53 103881.26 7.91 97454 3453404 0.00 401.11 13224683.11 3.82
35 ---------
-36 &rq->lock 645 [<ffffffff8103bfc4>] task_rq_lock+0x43/0x75
-37 &rq->lock 297 [<ffffffff8104ba65>] try_to_wake_up+0x127/0x25a
-38 &rq->lock 360 [<ffffffff8103c4c5>] select_task_rq_fair+0x1f0/0x74a
-39 &rq->lock 428 [<ffffffff81045f98>] scheduler_tick+0x46/0x1fb
+36 &rq->lock 645 [<ffffffff8103bfc4>] task_rq_lock+0x43/0x75
+37 &rq->lock 297 [<ffffffff8104ba65>] try_to_wake_up+0x127/0x25a
+38 &rq->lock 360 [<ffffffff8103c4c5>] select_task_rq_fair+0x1f0/0x74a
+39 &rq->lock 428 [<ffffffff81045f98>] scheduler_tick+0x46/0x1fb
40 ---------
-41 &rq->lock 77 [<ffffffff8103bfc4>] task_rq_lock+0x43/0x75
-42 &rq->lock 174 [<ffffffff8104ba65>] try_to_wake_up+0x127/0x25a
-43 &rq->lock 4715 [<ffffffff8103ed4b>] double_rq_lock+0x42/0x54
-44 &rq->lock 893 [<ffffffff81340524>] schedule+0x157/0x7b8
+41 &rq->lock 77 [<ffffffff8103bfc4>] task_rq_lock+0x43/0x75
+42 &rq->lock 174 [<ffffffff8104ba65>] try_to_wake_up+0x127/0x25a
+43 &rq->lock 4715 [<ffffffff8103ed4b>] double_rq_lock+0x42/0x54
+44 &rq->lock 893 [<ffffffff81340524>] schedule+0x157/0x7b8
45
-46...............................................................................................................................................................................................
+46...........................................................................................................................................................................................................................
47
-48 &rq->lock/1: 11526 11488 0.33 388.73 136294.31 21461 38404 0.00 37.93 109388.53
+48 &rq->lock/1: 1526 11488 0.33 388.73 136294.31 11.86 21461 38404 0.00 37.93 109388.53 2.84
49 -----------
-50 &rq->lock/1 11526 [<ffffffff8103ed58>] double_rq_lock+0x4f/0x54
+50 &rq->lock/1 11526 [<ffffffff8103ed58>] double_rq_lock+0x4f/0x54
51 -----------
-52 &rq->lock/1 5645 [<ffffffff8103ed4b>] double_rq_lock+0x42/0x54
-53 &rq->lock/1 1224 [<ffffffff81340524>] schedule+0x157/0x7b8
-54 &rq->lock/1 4336 [<ffffffff8103ed58>] double_rq_lock+0x4f/0x54
-55 &rq->lock/1 181 [<ffffffff8104ba65>] try_to_wake_up+0x127/0x25a
+52 &rq->lock/1 5645 [<ffffffff8103ed4b>] double_rq_lock+0x42/0x54
+53 &rq->lock/1 1224 [<ffffffff81340524>] schedule+0x157/0x7b8
+54 &rq->lock/1 4336 [<ffffffff8103ed58>] double_rq_lock+0x4f/0x54
+55 &rq->lock/1 181 [<ffffffff8104ba65>] try_to_wake_up+0x127/0x25a
Line 48 shows statistics for the second subclass (/1) of &rq->lock class
(subclass starts from 0), since in this case, as line 50 suggests,
@@ -163,16 +162,16 @@ double_rq_lock actually acquires a nested lock of two spinlocks.
View the top contending locks:
# grep : /proc/lock_stat | head
- &inode->i_data.tree_lock-W: 15 21657 0.18 1093295.30 11547131054.85 58 10415 0.16 87.51 6387.60
- &inode->i_data.tree_lock-R: 0 0 0.00 0.00 0.00 23302 231198 0.25 8.45 98023.38
- dcache_lock: 1037 1161 0.38 45.32 774.51 6611 243371 0.15 306.48 77387.24
- &inode->i_mutex: 161 286 18446744073709 62882.54 1244614.55 3653 20598 18446744073709 62318.60 1693822.74
- &zone->lru_lock: 94 94 0.53 7.33 92.10 4366 32690 0.29 59.81 16350.06
- &inode->i_data.i_mmap_mutex: 79 79 0.40 3.77 53.03 11779 87755 0.28 116.93 29898.44
- &q->__queue_lock: 48 50 0.52 31.62 86.31 774 13131 0.17 113.08 12277.52
- &rq->rq_lock_key: 43 47 0.74 68.50 170.63 3706 33929 0.22 107.99 17460.62
- &rq->rq_lock_key#2: 39 46 0.75 6.68 49.03 2979 32292 0.17 125.17 17137.63
- tasklist_lock-W: 15 15 1.45 10.87 32.70 1201 7390 0.58 62.55 13648.47
+ clockevents_lock: 2926159 2947636 0.15 46882.81 1784540466.34 605.41 3381345 3879161 0.00 2260.97 53178395.68 13.71
+ tick_broadcast_lock: 346460 346717 0.18 2257.43 39364622.71 113.54 3642919 4242696 0.00 2263.79 49173646.60 11.59
+ &mapping->i_mmap_mutex: 203896 203899 3.36 645530.05 31767507988.39 155800.21 3361776 8893984 0.17 2254.15 14110121.02 1.59
+ &rq->lock: 135014 136909 0.18 606.09 842160.68 6.15 1540728 10436146 0.00 728.72 17606683.41 1.69
+ &(&zone->lru_lock)->rlock: 93000 94934 0.16 59.18 188253.78 1.98 1199912 3809894 0.15 391.40 3559518.81 0.93
+ tasklist_lock-W: 40667 41130 0.23 1189.42 428980.51 10.43 270278 510106 0.16 653.51 3939674.91 7.72
+ tasklist_lock-R: 21298 21305 0.20 1310.05 215511.12 10.12 186204 241258 0.14 1162.33 1179779.23 4.89
+ rcu_node_1: 47656 49022 0.16 635.41 193616.41 3.95 844888 1865423 0.00 764.26 1656226.96 0.89
+ &(&dentry->d_lockref.lock)->rlock: 39791 40179 0.15 1302.08 88851.96 2.21 2790851 12527025 0.10 1910.75 3379714.27 0.27
+ rcu_node_0: 29203 30064 0.16 786.55 1555573.00 51.74 88963 244254 0.00 398.87 428872.51 1.76
Clear the statistics:
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 9d4c1d18ad44..4273b2d71a27 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -355,6 +355,82 @@ utilize.
==============================================================
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
+numa_balancing_scan_size_mb, numa_balancing_settle_count sysctls and
+numa_balancing_migrate_deferred.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running. Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases. The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases. The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_settle_count is how many scan periods must complete before
+the schedule balancer stops pushing the task towards a preferred node. This
+gives the scheduler a chance to place the task on an alternative node if the
+preferred node is overloaded.
+
+numa_balancing_migrate_deferred is how many page migrations get skipped
+unconditionally, after a page migration is skipped because a page is shared
+with other tasks. This reduces page migration overhead, and determines
+how much stronger the "move task near its memory" policy scheduler becomes,
+versus the "move memory near its task" memory management policy, for workloads
+with shared memory.
+
+==============================================================
+
osrelease, ostype & version:
# cat osrelease