summaryrefslogtreecommitdiff
path: root/Documentation/admin-guide
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/admin-guide')
-rw-r--r--Documentation/admin-guide/cgroup-v1/memcg_test.rst2
-rw-r--r--Documentation/admin-guide/cgroup-v1/memory.rst8
-rw-r--r--Documentation/admin-guide/cgroup-v2.rst2
-rw-r--r--Documentation/admin-guide/devices.txt16
-rw-r--r--Documentation/admin-guide/dynamic-debug-howto.rst5
-rw-r--r--Documentation/admin-guide/hw-vuln/gather_data_sampling.rst109
-rw-r--r--Documentation/admin-guide/hw-vuln/index.rst14
-rw-r--r--Documentation/admin-guide/hw-vuln/spectre.rst11
-rw-r--r--Documentation/admin-guide/hw-vuln/srso.rst150
-rw-r--r--Documentation/admin-guide/kdump/vmcoreinfo.rst20
-rw-r--r--Documentation/admin-guide/kernel-parameters.rst25
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt250
-rw-r--r--Documentation/admin-guide/media/qcom_camss.rst6
-rw-r--r--Documentation/admin-guide/mm/damon/usage.rst80
-rw-r--r--Documentation/admin-guide/mm/ksm.rst27
-rw-r--r--Documentation/admin-guide/mm/memory-hotplug.rst22
-rw-r--r--Documentation/admin-guide/mm/numa_memory_policy.rst2
-rw-r--r--Documentation/admin-guide/mm/userfaultfd.rst15
-rw-r--r--Documentation/admin-guide/mm/zswap.rst14
-rw-r--r--Documentation/admin-guide/module-signing.rst2
-rw-r--r--Documentation/admin-guide/perf/alibaba_pmu.rst5
-rw-r--r--Documentation/admin-guide/serial-console.rst2
-rw-r--r--Documentation/admin-guide/sysctl/kernel.rst56
-rw-r--r--Documentation/admin-guide/xfs.rst2
24 files changed, 672 insertions, 173 deletions
diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
index a402359abb99..1f128458ddea 100644
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -62,7 +62,7 @@ Please note that implementation details can be changed.
At cancel(), simply usage -= PAGE_SIZE.
-Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
+Under below explanation, we assume CONFIG_SWAP=y.
4. Anonymous
============
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index fabaad3fd9c2..5f502bf68fbc 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -92,8 +92,6 @@ Brief summary of control files.
memory.oom_control set/show oom controls.
memory.numa_stat show the number of memory usage per numa
node
- memory.kmem.limit_in_bytes This knob is deprecated and writing to
- it will return -ENOTSUPP.
memory.kmem.usage_in_bytes show current kernel memory allocation
memory.kmem.failcnt show the number of kernel memory usage
hits limits
@@ -197,11 +195,11 @@ are not accounted. We just account pages under usual VM management.
RSS pages are accounted at page_fault unless they've already been accounted
for earlier. A file page will be accounted for as Page Cache when it's
-inserted into inode (radix-tree). While it's mapped into the page tables of
+inserted into inode (xarray). While it's mapped into the page tables of
processes, duplicate accounting is carefully avoided.
An RSS page is unaccounted when it's fully unmapped. A PageCache page is
-unaccounted when it's removed from radix-tree. Even if RSS pages are fully
+unaccounted when it's removed from xarray. Even if RSS pages are fully
unmapped (by kswapd), they may exist as SwapCache in the system until they
are really freed. Such SwapCaches are also accounted.
A swapped-in page is accounted after adding into swapcache.
@@ -909,7 +907,7 @@ experiences some pressure. In this situation, only group C will receive the
notification, i.e. groups A and B will not receive it. This is done to avoid
excessive "broadcasting" of messages, which disturbs the system and which is
especially bad if we are low on memory or thrashing. Group B, will receive
-notification only if there are no event listers for group C.
+notification only if there are no event listeners for group C.
There are three optional modes that specify different propagation behavior:
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 4ef890191196..b26b5274eaaf 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1045,7 +1045,7 @@ All time durations are in microseconds.
- user_usec
- system_usec
- and the following three when the controller is enabled:
+ and the following five when the controller is enabled:
- nr_periods
- nr_throttled
diff --git a/Documentation/admin-guide/devices.txt b/Documentation/admin-guide/devices.txt
index 06c525e01ea5..839054923530 100644
--- a/Documentation/admin-guide/devices.txt
+++ b/Documentation/admin-guide/devices.txt
@@ -2691,18 +2691,9 @@
45 = /dev/ttyMM1 Marvell MPSC - port 1 (obsolete unused)
46 = /dev/ttyCPM0 PPC CPM (SCC or SMC) - port 0
...
- 47 = /dev/ttyCPM5 PPC CPM (SCC or SMC) - port 5
- 50 = /dev/ttyIOC0 Altix serial card
- ...
- 81 = /dev/ttyIOC31 Altix serial card
+ 51 = /dev/ttyCPM5 PPC CPM (SCC or SMC) - port 5
82 = /dev/ttyVR0 NEC VR4100 series SIU
83 = /dev/ttyVR1 NEC VR4100 series DSIU
- 84 = /dev/ttyIOC84 Altix ioc4 serial card
- ...
- 115 = /dev/ttyIOC115 Altix ioc4 serial card
- 116 = /dev/ttySIOC0 Altix ioc3 serial card
- ...
- 147 = /dev/ttySIOC31 Altix ioc3 serial card
148 = /dev/ttyPSC0 PPC PSC - port 0
...
153 = /dev/ttyPSC5 PPC PSC - port 5
@@ -2761,10 +2752,7 @@
43 = /dev/ttycusmx2 Callout device for ttySMX2
46 = /dev/cucpm0 Callout device for ttyCPM0
...
- 49 = /dev/cucpm5 Callout device for ttyCPM5
- 50 = /dev/cuioc40 Callout device for ttyIOC40
- ...
- 81 = /dev/cuioc431 Callout device for ttyIOC431
+ 51 = /dev/cucpm5 Callout device for ttyCPM5
82 = /dev/cuvr0 Callout device for ttyVR0
83 = /dev/cuvr1 Callout device for ttyVR1
diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst
index 8dc668cc1216..0b3d39c610d9 100644
--- a/Documentation/admin-guide/dynamic-debug-howto.rst
+++ b/Documentation/admin-guide/dynamic-debug-howto.rst
@@ -216,13 +216,14 @@ The flags are::
t Include thread ID, or <intr>
m Include module name
f Include the function name
+ s Include the source file name
l Include line number
For ``print_hex_dump_debug()`` and ``print_hex_dump_bytes()``, only
the ``p`` flag has meaning, other flags are ignored.
-Note the regexp ``^[-+=][flmpt_]+$`` matches a flags specification.
-To clear all flags at once, use ``=_`` or ``-flmpt``.
+Note the regexp ``^[-+=][fslmpt_]+$`` matches a flags specification.
+To clear all flags at once, use ``=_`` or ``-fslmpt``.
Debug messages during Boot Process
diff --git a/Documentation/admin-guide/hw-vuln/gather_data_sampling.rst b/Documentation/admin-guide/hw-vuln/gather_data_sampling.rst
new file mode 100644
index 000000000000..264bfa937f7d
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/gather_data_sampling.rst
@@ -0,0 +1,109 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+GDS - Gather Data Sampling
+==========================
+
+Gather Data Sampling is a hardware vulnerability which allows unprivileged
+speculative access to data which was previously stored in vector registers.
+
+Problem
+-------
+When a gather instruction performs loads from memory, different data elements
+are merged into the destination vector register. However, when a gather
+instruction that is transiently executed encounters a fault, stale data from
+architectural or internal vector registers may get transiently forwarded to the
+destination vector register instead. This will allow a malicious attacker to
+infer stale data using typical side channel techniques like cache timing
+attacks. GDS is a purely sampling-based attack.
+
+The attacker uses gather instructions to infer the stale vector register data.
+The victim does not need to do anything special other than use the vector
+registers. The victim does not need to use gather instructions to be
+vulnerable.
+
+Because the buffers are shared between Hyper-Threads cross Hyper-Thread attacks
+are possible.
+
+Attack scenarios
+----------------
+Without mitigation, GDS can infer stale data across virtually all
+permission boundaries:
+
+ Non-enclaves can infer SGX enclave data
+ Userspace can infer kernel data
+ Guests can infer data from hosts
+ Guest can infer guest from other guests
+ Users can infer data from other users
+
+Because of this, it is important to ensure that the mitigation stays enabled in
+lower-privilege contexts like guests and when running outside SGX enclaves.
+
+The hardware enforces the mitigation for SGX. Likewise, VMMs should ensure
+that guests are not allowed to disable the GDS mitigation. If a host erred and
+allowed this, a guest could theoretically disable GDS mitigation, mount an
+attack, and re-enable it.
+
+Mitigation mechanism
+--------------------
+This issue is mitigated in microcode. The microcode defines the following new
+bits:
+
+ ================================ === ============================
+ IA32_ARCH_CAPABILITIES[GDS_CTRL] R/O Enumerates GDS vulnerability
+ and mitigation support.
+ IA32_ARCH_CAPABILITIES[GDS_NO] R/O Processor is not vulnerable.
+ IA32_MCU_OPT_CTRL[GDS_MITG_DIS] R/W Disables the mitigation
+ 0 by default.
+ IA32_MCU_OPT_CTRL[GDS_MITG_LOCK] R/W Locks GDS_MITG_DIS=0. Writes
+ to GDS_MITG_DIS are ignored
+ Can't be cleared once set.
+ ================================ === ============================
+
+GDS can also be mitigated on systems that don't have updated microcode by
+disabling AVX. This can be done by setting gather_data_sampling="force" or
+"clearcpuid=avx" on the kernel command-line.
+
+If used, these options will disable AVX use by turning off XSAVE YMM support.
+However, the processor will still enumerate AVX support. Userspace that
+does not follow proper AVX enumeration to check both AVX *and* XSAVE YMM
+support will break.
+
+Mitigation control on the kernel command line
+---------------------------------------------
+The mitigation can be disabled by setting "gather_data_sampling=off" or
+"mitigations=off" on the kernel command line. Not specifying either will default
+to the mitigation being enabled. Specifying "gather_data_sampling=force" will
+use the microcode mitigation when available or disable AVX on affected systems
+where the microcode hasn't been updated to include the mitigation.
+
+GDS System Information
+------------------------
+The kernel provides vulnerability status information through sysfs. For
+GDS this can be accessed by the following sysfs file:
+
+/sys/devices/system/cpu/vulnerabilities/gather_data_sampling
+
+The possible values contained in this file are:
+
+ ============================== =============================================
+ Not affected Processor not vulnerable.
+ Vulnerable Processor vulnerable and mitigation disabled.
+ Vulnerable: No microcode Processor vulnerable and microcode is missing
+ mitigation.
+ Mitigation: AVX disabled,
+ no microcode Processor is vulnerable and microcode is missing
+ mitigation. AVX disabled as mitigation.
+ Mitigation: Microcode Processor is vulnerable and mitigation is in
+ effect.
+ Mitigation: Microcode (locked) Processor is vulnerable and mitigation is in
+ effect and cannot be disabled.
+ Unknown: Dependent on
+ hypervisor status Running on a virtual guest processor that is
+ affected but with no way to know if host
+ processor is mitigated or vulnerable.
+ ============================== =============================================
+
+GDS Default mitigation
+----------------------
+The updated microcode will enable the mitigation by default. The kernel's
+default action is to leave the mitigation enabled.
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index e0614760a99e..de99caabf65a 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -13,9 +13,11 @@ are configurable at compile, boot or run time.
l1tf
mds
tsx_async_abort
- multihit.rst
- special-register-buffer-data-sampling.rst
- core-scheduling.rst
- l1d_flush.rst
- processor_mmio_stale_data.rst
- cross-thread-rsb.rst
+ multihit
+ special-register-buffer-data-sampling
+ core-scheduling
+ l1d_flush
+ processor_mmio_stale_data
+ cross-thread-rsb
+ srso
+ gather_data_sampling
diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst b/Documentation/admin-guide/hw-vuln/spectre.rst
index 4d186f599d90..32a8893e5617 100644
--- a/Documentation/admin-guide/hw-vuln/spectre.rst
+++ b/Documentation/admin-guide/hw-vuln/spectre.rst
@@ -484,11 +484,14 @@ Spectre variant 2
Systems which support enhanced IBRS (eIBRS) enable IBRS protection once at
boot, by setting the IBRS bit, and they're automatically protected against
- Spectre v2 variant attacks, including cross-thread branch target injections
- on SMT systems (STIBP). In other words, eIBRS enables STIBP too.
+ Spectre v2 variant attacks.
- Legacy IBRS systems clear the IBRS bit on exit to userspace and
- therefore explicitly enable STIBP for that
+ On Intel's enhanced IBRS systems, this includes cross-thread branch target
+ injections on SMT systems (STIBP). In other words, Intel eIBRS enables
+ STIBP, too.
+
+ AMD Automatic IBRS does not protect userspace, and Legacy IBRS systems clear
+ the IBRS bit on exit to userspace, therefore both explicitly enable STIBP.
The retpoline mitigation is turned on by default on vulnerable
CPUs. It can be forced on or off by the administrator
diff --git a/Documentation/admin-guide/hw-vuln/srso.rst b/Documentation/admin-guide/hw-vuln/srso.rst
new file mode 100644
index 000000000000..b6cfb51cb0b4
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/srso.rst
@@ -0,0 +1,150 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Speculative Return Stack Overflow (SRSO)
+========================================
+
+This is a mitigation for the speculative return stack overflow (SRSO)
+vulnerability found on AMD processors. The mechanism is by now the well
+known scenario of poisoning CPU functional units - the Branch Target
+Buffer (BTB) and Return Address Predictor (RAP) in this case - and then
+tricking the elevated privilege domain (the kernel) into leaking
+sensitive data.
+
+AMD CPUs predict RET instructions using a Return Address Predictor (aka
+Return Address Stack/Return Stack Buffer). In some cases, a non-architectural
+CALL instruction (i.e., an instruction predicted to be a CALL but is
+not actually a CALL) can create an entry in the RAP which may be used
+to predict the target of a subsequent RET instruction.
+
+The specific circumstances that lead to this varies by microarchitecture
+but the concern is that an attacker can mis-train the CPU BTB to predict
+non-architectural CALL instructions in kernel space and use this to
+control the speculative target of a subsequent kernel RET, potentially
+leading to information disclosure via a speculative side-channel.
+
+The issue is tracked under CVE-2023-20569.
+
+Affected processors
+-------------------
+
+AMD Zen, generations 1-4. That is, all families 0x17 and 0x19. Older
+processors have not been investigated.
+
+System information and options
+------------------------------
+
+First of all, it is required that the latest microcode be loaded for
+mitigations to be effective.
+
+The sysfs file showing SRSO mitigation status is:
+
+ /sys/devices/system/cpu/vulnerabilities/spec_rstack_overflow
+
+The possible values in this file are:
+
+ * 'Not affected':
+
+ The processor is not vulnerable
+
+ * 'Vulnerable: no microcode':
+
+ The processor is vulnerable, no microcode extending IBPB
+ functionality to address the vulnerability has been applied.
+
+ * 'Mitigation: microcode':
+
+ Extended IBPB functionality microcode patch has been applied. It does
+ not address User->Kernel and Guest->Host transitions protection but it
+ does address User->User and VM->VM attack vectors.
+
+ Note that User->User mitigation is controlled by how the IBPB aspect in
+ the Spectre v2 mitigation is selected:
+
+ * conditional IBPB:
+
+ where each process can select whether it needs an IBPB issued
+ around it PR_SPEC_DISABLE/_ENABLE etc, see :doc:`spectre`
+
+ * strict:
+
+ i.e., always on - by supplying spectre_v2_user=on on the kernel
+ command line
+
+ (spec_rstack_overflow=microcode)
+
+ * 'Mitigation: safe RET':
+
+ Software-only mitigation. It complements the extended IBPB microcode
+ patch functionality by addressing User->Kernel and Guest->Host
+ transitions protection.
+
+ Selected by default or by spec_rstack_overflow=safe-ret
+
+ * 'Mitigation: IBPB':
+
+ Similar protection as "safe RET" above but employs an IBPB barrier on
+ privilege domain crossings (User->Kernel, Guest->Host).
+
+ (spec_rstack_overflow=ibpb)
+
+ * 'Mitigation: IBPB on VMEXIT':
+
+ Mitigation addressing the cloud provider scenario - the Guest->Host
+ transitions only.
+
+ (spec_rstack_overflow=ibpb-vmexit)
+
+
+
+In order to exploit vulnerability, an attacker needs to:
+
+ - gain local access on the machine
+
+ - break kASLR
+
+ - find gadgets in the running kernel in order to use them in the exploit
+
+ - potentially create and pin an additional workload on the sibling
+ thread, depending on the microarchitecture (not necessary on fam 0x19)
+
+ - run the exploit
+
+Considering the performance implications of each mitigation type, the
+default one is 'Mitigation: safe RET' which should take care of most
+attack vectors, including the local User->Kernel one.
+
+As always, the user is advised to keep her/his system up-to-date by
+applying software updates regularly.
+
+The default setting will be reevaluated when needed and especially when
+new attack vectors appear.
+
+As one can surmise, 'Mitigation: safe RET' does come at the cost of some
+performance depending on the workload. If one trusts her/his userspace
+and does not want to suffer the performance impact, one can always
+disable the mitigation with spec_rstack_overflow=off.
+
+Similarly, 'Mitigation: IBPB' is another full mitigation type employing
+an indrect branch prediction barrier after having applied the required
+microcode patch for one's system. This mitigation comes also at
+a performance cost.
+
+Mitigation: safe RET
+--------------------
+
+The mitigation works by ensuring all RET instructions speculate to
+a controlled location, similar to how speculation is controlled in the
+retpoline sequence. To accomplish this, the __x86_return_thunk forces
+the CPU to mispredict every function return using a 'safe return'
+sequence.
+
+To ensure the safety of this mitigation, the kernel must ensure that the
+safe return sequence is itself free from attacker interference. In Zen3
+and Zen4, this is accomplished by creating a BTB alias between the
+untraining function srso_alias_untrain_ret() and the safe return
+function srso_alias_safe_ret() which results in evicting a potentially
+poisoned BTB entry and using that safe one for all function returns.
+
+In older Zen1 and Zen2, this is accomplished using a reinterpretation
+technique similar to Retbleed one: srso_untrain_ret() and
+srso_safe_ret().
diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index c18d94fa6470..599e8d3bcbc3 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -141,8 +141,8 @@ nodemask_t
The size of a nodemask_t type. Used to compute the number of online
nodes.
-(page, flags|_refcount|mapping|lru|_mapcount|private|compound_dtor|compound_order|compound_head)
--------------------------------------------------------------------------------------------------
+(page, flags|_refcount|mapping|lru|_mapcount|private|compound_order|compound_head)
+----------------------------------------------------------------------------------
User-space tools compute their values based on the offset of these
variables. The variables are used when excluding unnecessary pages.
@@ -325,8 +325,8 @@ NR_FREE_PAGES
On linux-2.6.21 or later, the number of free pages is in
vm_stat[NR_FREE_PAGES]. Used to get the number of free pages.
-PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask
-------------------------------------------------------------------------------
+PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask|PG_hugetlb
+-----------------------------------------------------------------------------------------
Page attributes. These flags are used to filter various unnecessary for
dumping pages.
@@ -338,12 +338,6 @@ More page attributes. These flags are used to filter various unnecessary for
dumping pages.
-HUGETLB_PAGE_DTOR
------------------
-
-The HUGETLB_PAGE_DTOR flag denotes hugetlbfs pages. Makedumpfile
-excludes these pages.
-
x86_64
======
@@ -624,3 +618,9 @@ Used to get the correct ranges:
* VMALLOC_START ~ VMALLOC_END : vmalloc() / ioremap() space.
* VMEMMAP_START ~ VMEMMAP_END : vmemmap space, used for struct page array.
* KERNEL_LINK_ADDR : start address of Kernel link and BPF
+
+va_kernel_pa_offset
+-------------------
+
+Indicates the offset between the kernel virtual and physical mappings.
+Used to translate virtual to physical addresses.
diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst
index 1ba8f2a44aac..102937bc8443 100644
--- a/Documentation/admin-guide/kernel-parameters.rst
+++ b/Documentation/admin-guide/kernel-parameters.rst
@@ -80,7 +80,7 @@ The special case-tolerant group name "all" has a meaning of selecting all CPUs,
so that "nohz_full=all" is the equivalent of "nohz_full=0-N".
The semantics of "N" and "all" is supported on a level of bitmaps and holds for
-all users of bitmap_parse().
+all users of bitmap_parselist().
This document may not be entirely up to date and comprehensive. The command
"modinfo -p ${modulename}" shows a current list of all parameters of a loadable
@@ -89,10 +89,11 @@ reveal their parameters in /sys/module/${modulename}/parameters/. Some of these
parameters may be changed at runtime by the command
``echo -n ${value} > /sys/module/${modulename}/parameters/${parm}``.
-The parameters listed below are only valid if certain kernel build options were
-enabled and if respective hardware is present. The text in square brackets at
-the beginning of each description states the restrictions within which a
-parameter is applicable::
+The parameters listed below are only valid if certain kernel build options
+were enabled and if respective hardware is present. This list should be kept
+in alphabetical order. The text in square brackets at the beginning
+of each description states the restrictions within which a parameter
+is applicable::
ACPI ACPI support is enabled.
AGP AGP (Accelerated Graphics Port) is enabled.
@@ -127,9 +128,9 @@ parameter is applicable::
KGDB Kernel debugger support is enabled.
KVM Kernel Virtual Machine support is enabled.
LIBATA Libata driver is enabled
- LP Printer support is enabled.
LOONGARCH LoongArch architecture is enabled.
LOOP Loopback device support is enabled.
+ LP Printer support is enabled.
M68k M68k architecture is enabled.
These options have more detailed description inside of
Documentation/arch/m68k/kernel-options.rst.
@@ -139,10 +140,9 @@ parameter is applicable::
MSI Message Signaled Interrupts (PCI).
MTD MTD (Memory Technology Device) support is enabled.
NET Appropriate network support is enabled.
- NUMA NUMA support is enabled.
NFS Appropriate NFS support is enabled.
+ NUMA NUMA support is enabled.
OF Devicetree is enabled.
- PV_OPS A paravirtualized kernel is enabled.
PARISC The PA-RISC architecture is enabled.
PCI PCI bus support is enabled.
PCIE PCI Express support is enabled.
@@ -151,9 +151,10 @@ parameter is applicable::
PPC PowerPC architecture is enabled.
PPT Parallel port support is enabled.
PS2 Appropriate PS/2 support is enabled.
+ PV_OPS A paravirtualized kernel is enabled.
RAM RAM disk support is enabled.
- RISCV RISCV architecture is enabled.
RDT Intel Resource Director Technology.
+ RISCV RISCV architecture is enabled.
S390 S390 architecture is enabled.
SCSI Appropriate SCSI support is enabled.
A lot of drivers have their options described inside
@@ -164,15 +165,15 @@ parameter is applicable::
SH SuperH architecture is enabled.
SMP The kernel is an SMP kernel.
SPARC Sparc architecture is enabled.
- SWSUSP Software suspend (hibernation) is enabled.
SUSPEND System suspend states are enabled.
+ SWSUSP Software suspend (hibernation) is enabled.
TPM TPM drivers are enabled.
UMS USB Mass Storage support is enabled.
USB USB support is enabled.
USBHID USB Human Interface Device support is enabled.
V4L Video For Linux support is enabled.
- VMMIO Driver for memory mapped virtio devices is enabled.
VGA The VGA console has been enabled.
+ VMMIO Driver for memory mapped virtio devices is enabled.
VT Virtual terminal support is enabled.
WDT Watchdog support is enabled.
X86-32 X86-32, aka i386 architecture is enabled.
@@ -186,9 +187,9 @@ parameter is applicable::
In addition, the following text indicates that the option::
+ BOOT Is a boot loader parameter.
BUGS= Relates to possible processor bugs on the said processor.
KNL Is a kernel start-up parameter.
- BOOT Is a boot loader parameter.
Parameters denoted with BOOT are actually interpreted by the boot
loader, and have no meaning to the kernel directly.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a1457995fd41..0a1731a0f0ef 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -418,20 +418,20 @@
arm64.nobti [ARM64] Unconditionally disable Branch Target
Identification support
- arm64.nopauth [ARM64] Unconditionally disable Pointer Authentication
- support
+ arm64.nomops [ARM64] Unconditionally disable Memory Copy and Memory
+ Set instructions support
arm64.nomte [ARM64] Unconditionally disable Memory Tagging Extension
support
- arm64.nosve [ARM64] Unconditionally disable Scalable Vector
- Extension support
+ arm64.nopauth [ARM64] Unconditionally disable Pointer Authentication
+ support
arm64.nosme [ARM64] Unconditionally disable Scalable Matrix
Extension support
- arm64.nomops [ARM64] Unconditionally disable Memory Copy and Memory
- Set instructions support
+ arm64.nosve [ARM64] Unconditionally disable Scalable Vector
+ Extension support
ataflop= [HW,M68k]
@@ -553,7 +553,7 @@
others).
ccw_timeout_log [S390]
- See Documentation/s390/common_io.rst for details.
+ See Documentation/arch/s390/common_io.rst for details.
cgroup_disable= [KNL] Disable a particular controller or optional feature
Format: {name of the controller(s) or feature(s) to disable}
@@ -598,7 +598,7 @@
Setting checkreqprot to 1 is deprecated.
cio_ignore= [S390]
- See Documentation/s390/common_io.rst for details.
+ See Documentation/arch/s390/common_io.rst for details.
clearcpuid=X[,X...] [X86]
Disable CPUID feature X for the kernel. See
@@ -696,7 +696,7 @@
kernel/dma/contiguous.c
cma_pernuma=nn[MG]
- [ARM64,KNL,CMA]
+ [KNL,CMA]
Sets the size of kernel per-numa memory area for
contiguous memory allocations. A value of 0 disables
per-numa CMA altogether. And If this option is not
@@ -706,6 +706,17 @@
which is located in node nid, if the allocation fails,
they will fallback to the global default memory area.
+ numa_cma=<node>:nn[MG][,<node>:nn[MG]]
+ [KNL,CMA]
+ Sets the size of kernel numa memory area for
+ contiguous memory allocations. It will reserve CMA
+ area for the specified node.
+
+ With numa CMA enabled, DMA users on node nid will
+ first try to allocate buffer from the numa area
+ which is located in node nid, if the allocation fails,
+ they will fallback to the global default memory area.
+
cmo_free_hint= [PPC] Format: { yes | no }
Specify whether pages are marked as being inactive
when they are freed. This is used in CMO environments
@@ -862,7 +873,7 @@
memory region [offset, offset + size] for that kernel
image. If '@offset' is omitted, then a suitable offset
is selected automatically.
- [KNL, X86-64, ARM64] Select a region under 4G first, and
+ [KNL, X86-64, ARM64, RISCV] Select a region under 4G first, and
fall back to reserve region above 4G when '@offset'
hasn't been specified.
See Documentation/admin-guide/kdump/kdump.rst for further details.
@@ -875,14 +886,14 @@
Documentation/admin-guide/kdump/kdump.rst for an example.
crashkernel=size[KMG],high
- [KNL, X86-64, ARM64] range could be above 4G. Allow kernel
- to allocate physical memory region from top, so could
- be above 4G if system have more than 4G ram installed.
- Otherwise memory region will be allocated below 4G, if
- available.
+ [KNL, X86-64, ARM64, RISCV] range could be above 4G.
+ Allow kernel to allocate physical memory region from top,
+ so could be above 4G if system have more than 4G ram
+ installed. Otherwise memory region will be allocated
+ below 4G, if available.
It will be ignored if crashkernel=X is specified.
crashkernel=size[KMG],low
- [KNL, X86-64, ARM64] range under 4G. When crashkernel=X,high
+ [KNL, X86-64, ARM64, RISCV] range under 4G. When crashkernel=X,high
is passed, kernel could allocate physical memory region
above 4G, that cause second kernel crash on system
that require some amount of low memory, e.g. swiotlb
@@ -893,6 +904,7 @@
size is platform dependent.
--> x86: max(swiotlb_size_or_default() + 8MiB, 256MiB)
--> arm64: 128MiB
+ --> riscv: 128MiB
This one lets the user specify own low range under 4G
for second kernel instead.
0: to disable low allocation.
@@ -1623,6 +1635,26 @@
Format: off | on
default: on
+ gather_data_sampling=
+ [X86,INTEL] Control the Gather Data Sampling (GDS)
+ mitigation.
+
+ Gather Data Sampling is a hardware vulnerability which
+ allows unprivileged speculative access to data which was
+ previously stored in vector registers.
+
+ This issue is mitigated by default in updated microcode.
+ The mitigation may have a performance impact but can be
+ disabled. On systems without the microcode mitigation
+ disabling AVX serves as a mitigation.
+
+ force: Disable AVX to mitigate systems without
+ microcode mitigation. No effect if the microcode
+ mitigation is present. Known to cause crashes in
+ userspace with buggy AVX enumeration.
+
+ off: Disable GDS mitigation.
+
gcov_persist= [GCOV] When non-zero (default), profiling data for
kernel modules is saved and remains accessible via
debugfs, even when the module is unloaded/reloaded.
@@ -2624,7 +2656,7 @@
kvm-intel.flexpriority=
[KVM,Intel] Control KVM's use of FlexPriority feature
- (TPR shadow). Default is 1 (enabled). Disalbe by KVM if
+ (TPR shadow). Default is 1 (enabled). Disable by KVM if
hardware lacks support for it.
kvm-intel.nested=
@@ -2918,6 +2950,10 @@
locktorture.torture_type= [KNL]
Specify the locking implementation to test.
+ locktorture.writer_fifo= [KNL]
+ Run the write-side locktorture kthreads at
+ sched_set_fifo() real-time priority.
+
locktorture.verbose= [KNL]
Enable additional printk() statements.
@@ -3105,7 +3141,7 @@
[KNL,SH] Allow user to override the default size for
per-device physically contiguous DMA buffers.
- memhp_default_state=online/offline
+ memhp_default_state=online/offline/online_kernel/online_movable
[KNL] Set the initial state for the memory hotplug
onlining policy. If not specified, the default value is
set according to the
@@ -3273,24 +3309,25 @@
Disable all optional CPU mitigations. This
improves system performance, but it may also
expose users to several CPU vulnerabilities.
- Equivalent to: nopti [X86,PPC]
- if nokaslr then kpti=0 [ARM64]
- nospectre_v1 [X86,PPC]
- nobp=0 [S390]
- nospectre_v2 [X86,PPC,S390,ARM64]
- spectre_v2_user=off [X86]
- spec_store_bypass_disable=off [X86,PPC]
- ssbd=force-off [ARM64]
- nospectre_bhb [ARM64]
+ Equivalent to: if nokaslr then kpti=0 [ARM64]
+ gather_data_sampling=off [X86]
+ kvm.nx_huge_pages=off [X86]
l1tf=off [X86]
mds=off [X86]
- tsx_async_abort=off [X86]
- kvm.nx_huge_pages=off [X86]
- srbds=off [X86,INTEL]
+ mmio_stale_data=off [X86]
no_entry_flush [PPC]
no_uaccess_flush [PPC]
- mmio_stale_data=off [X86]
+ nobp=0 [S390]
+ nopti [X86,PPC]
+ nospectre_bhb [ARM64]
+ nospectre_v1 [X86,PPC]
+ nospectre_v2 [X86,PPC,S390,ARM64]
retbleed=off [X86]
+ spec_store_bypass_disable=off [X86,PPC]
+ spectre_v2_user=off [X86]
+ srbds=off [X86,INTEL]
+ ssbd=force-off [ARM64]
+ tsx_async_abort=off [X86]
Exceptions:
This does not have any effect on
@@ -3717,7 +3754,7 @@
nohibernate [HIBERNATION] Disable hibernation and resume.
- nohlt [ARM,ARM64,MICROBLAZE,MIPS,SH] Forces the kernel to
+ nohlt [ARM,ARM64,MICROBLAZE,MIPS,PPC,SH] Forces the kernel to
busy wait in do_idle() and not use the arch_cpu_idle()
implementation; requires CONFIG_GENERIC_IDLE_POLL_SETUP
to be effective. This is useful on platforms where the
@@ -3853,10 +3890,10 @@
nosmp [SMP] Tells an SMP kernel to act as a UP kernel,
and disable the IO APIC. legacy for "maxcpus=0".
- nosmt [KNL,MIPS,S390] Disable symmetric multithreading (SMT).
+ nosmt [KNL,MIPS,PPC,S390] Disable symmetric multithreading (SMT).
Equivalent to smt=1.
- [KNL,X86] Disable symmetric multithreading (SMT).
+ [KNL,X86,PPC] Disable symmetric multithreading (SMT).
nosmt=force: Force disable SMT, cannot be undone
via the sysfs control file.
@@ -4037,20 +4074,6 @@
timeout < 0: reboot immediately
Format: <timeout>
- panic_print= Bitmask for printing system info when panic happens.
- User can chose combination of the following bits:
- bit 0: print all tasks info
- bit 1: print system memory info
- bit 2: print timer info
- bit 3: print locks info if CONFIG_LOCKDEP is on
- bit 4: print ftrace buffer
- bit 5: print all printk messages in buffer
- bit 6: print all CPUs backtrace (if available in the arch)
- *Be aware* that this option may print a _lot_ of lines,
- so there are risks of losing older messages in the log.
- Use this option carefully, maybe worth to setup a
- bigger log buffer with "log_buf_len" along with this.
-
panic_on_taint= Bitmask for conditionally calling panic() in add_taint()
Format: <hex>[,nousertaint]
Hexadecimal bitmask representing the set of TAINT flags
@@ -4067,6 +4090,20 @@
panic_on_warn=1 panic() instead of WARN(). Useful to cause kdump
on a WARN().
+ panic_print= Bitmask for printing system info when panic happens.
+ User can chose combination of the following bits:
+ bit 0: print all tasks info
+ bit 1: print system memory info
+ bit 2: print timer info
+ bit 3: print locks info if CONFIG_LOCKDEP is on
+ bit 4: print ftrace buffer
+ bit 5: print all printk messages in buffer
+ bit 6: print all CPUs backtrace (if available in the arch)
+ *Be aware* that this option may print a _lot_ of lines,
+ so there are risks of losing older messages in the log.
+ Use this option carefully, maybe worth to setup a
+ bigger log buffer with "log_buf_len" along with this.
+
parkbd.port= [HW] Parallel port number the keyboard adapter is
connected to, default is 0.
Format: <parport#>
@@ -4186,7 +4223,7 @@
mode 0, bit 1 is for mode 1, and so on. Mode 0 only
allowed by default.
- pause_on_oops=
+ pause_on_oops=<int>
Halt all CPUs after the first oops has been printed for
the specified number of seconds. This is to be used if
your oopses keep scrolling off the screen.
@@ -4928,6 +4965,15 @@
test until boot completes in order to avoid
interference.
+ rcuscale.kfree_by_call_rcu= [KNL]
+ In kernels built with CONFIG_RCU_LAZY=y, test
+ call_rcu() instead of kfree_rcu().
+
+ rcuscale.kfree_mult= [KNL]
+ Instead of allocating an object of size kfree_obj,
+ allocate one of kfree_mult * sizeof(kfree_obj).
+ Defaults to 1.
+
rcuscale.kfree_rcu_test= [KNL]
Set to measure performance of kfree_rcu() flooding.
@@ -4953,6 +4999,12 @@
Number of loops doing rcuscale.kfree_alloc_num number
of allocations and frees.
+ rcuscale.minruntime= [KNL]
+ Set the minimum test run time in seconds. This
+ does not affect the data-collection interval,
+ but instead allows better measurement of things
+ like CPU consumption.
+
rcuscale.nreaders= [KNL]
Set number of RCU readers. The value -1 selects
N, where N is the number of CPUs. A value
@@ -4967,7 +5019,7 @@
the same as for rcuscale.nreaders.
N, where N is the number of CPUs
- rcuscale.perf_type= [KNL]
+ rcuscale.scale_type= [KNL]
Specify the RCU implementation to test.
rcuscale.shutdown= [KNL]
@@ -4983,6 +5035,11 @@
in microseconds. The default of zero says
no holdoff.
+ rcuscale.writer_holdoff_jiffies= [KNL]
+ Additional write-side holdoff between grace
+ periods, but in jiffies. The default of zero
+ says no holdoff.
+
rcutorture.fqs_duration= [KNL]
Set duration of force_quiescent_state bursts
in microseconds.
@@ -5264,6 +5321,13 @@
number avoids disturbing real-time workloads,
but lengthens grace periods.
+ rcupdate.rcu_task_lazy_lim= [KNL]
+ Number of callbacks on a given CPU that will
+ cancel laziness on that CPU. Use -1 to disable
+ cancellation of laziness, but be advised that
+ doing so increases the danger of OOM due to
+ callback flooding.
+
rcupdate.rcu_task_stall_info= [KNL]
Set initial timeout in jiffies for RCU task stall
informational messages, which give some indication
@@ -5293,6 +5357,29 @@
A change in value does not take effect until
the beginning of the next grace period.
+ rcupdate.rcu_tasks_lazy_ms= [KNL]
+ Set timeout in milliseconds RCU Tasks asynchronous
+ callback batching for call_rcu_tasks().
+ A negative value will take the default. A value
+ of zero will disable batching. Batching is
+ always disabled for synchronize_rcu_tasks().
+
+ rcupdate.rcu_tasks_rude_lazy_ms= [KNL]
+ Set timeout in milliseconds RCU Tasks
+ Rude asynchronous callback batching for
+ call_rcu_tasks_rude(). A negative value
+ will take the default. A value of zero will
+ disable batching. Batching is always disabled
+ for synchronize_rcu_tasks_rude().
+
+ rcupdate.rcu_tasks_trace_lazy_ms= [KNL]
+ Set timeout in milliseconds RCU Tasks
+ Trace asynchronous callback batching for
+ call_rcu_tasks_trace(). A negative value
+ will take the default. A value of zero will
+ disable batching. Batching is always disabled
+ for synchronize_rcu_tasks_trace().
+
rcupdate.rcu_self_test= [KNL]
Run the RCU early boot self tests
@@ -5468,6 +5555,13 @@
[KNL] Disable ring 3 MONITOR/MWAIT feature on supported
CPUs.
+ riscv_isa_fallback [RISCV]
+ When CONFIG_RISCV_ISA_FALLBACK is not enabled, permit
+ falling back to detecting extension support by parsing
+ "riscv,isa" property on devicetree systems when the
+ replacement properties are not found. See the Kconfig
+ entry for RISCV_ISA_FALLBACK.
+
ro [KNL] Mount root device read-only on boot
rodata= [KNL]
@@ -5501,6 +5595,10 @@
Useful for devices that are detected asynchronously
(e.g. USB and MMC devices).
+ rootwait= [KNL] Maximum time (in seconds) to wait for root device
+ to show up before attempting to mount the root
+ filesystem.
+
rproc_mem=nn[KMG][@address]
[KNL,ARM,CMA] Remoteproc physical memory block.
Memory area to be used by remote processor image,
@@ -5875,6 +5973,17 @@
Not specifying this option is equivalent to
spectre_v2_user=auto.
+ spec_rstack_overflow=
+ [X86] Control RAS overflow mitigation on AMD Zen CPUs
+
+ off - Disable mitigation
+ microcode - Enable microcode mitigation only
+ safe-ret - Enable sw-only safe RET mitigation (default)
+ ibpb - Enable mitigation by issuing IBPB on
+ kernel entry
+ ibpb-vmexit - Issue IBPB only on VMEXIT
+ (cloud-specific mitigation)
+
spec_store_bypass_disable=
[HW] Control Speculative Store Bypass (SSB) Disable mitigation
(Speculative Store Bypass vulnerability)
@@ -6243,10 +6352,6 @@
-1: disable all critical trip points in all thermal zones
<degrees C>: override all critical trip points
- thermal.nocrt= [HW,ACPI]
- Set to disable actions on ACPI thermal zone
- critical and hot trip points.
-
thermal.off= [HW,ACPI]
1: disable ACPI thermal control
@@ -6308,6 +6413,13 @@
This will guarantee that all the other pcrs
are saved.
+ tpm_tis.interrupts= [HW,TPM]
+ Enable interrupts for the MMIO based physical layer
+ for the FIFO interface. By default it is set to false
+ (0). For more information about TPM hardware interfaces
+ defined by Trusted Computing Group (TCG) see
+ https://trustedcomputinggroup.org/resource/pc-client-platform-tpm-profile-ptp-specification/
+
tp_printk [FTRACE]
Have the tracepoints sent to printk as well as the
tracing ring buffer. This is useful for early boot up
@@ -6607,7 +6719,7 @@
usbcore.authorized_default=
[USB] Default USB device authorization:
- (default -1 = authorized except for wireless USB,
+ (default -1 = authorized (same as 1),
0 = not authorized, 1 = authorized, 2 = authorized
if device connected to internal port)
@@ -6964,6 +7076,13 @@
disables both lockup detectors. Default is 10
seconds.
+ workqueue.unbound_cpus=
+ [KNL,SMP] Specify to constrain one or some CPUs
+ to use in unbound workqueues.
+ Format: <cpu-list>
+ By default, all online CPUs are available for
+ unbound workqueues.
+
workqueue.watchdog_thresh=
If CONFIG_WQ_WATCHDOG is configured, workqueue can
warn stall conditions and dump internal state to
@@ -6985,15 +7104,6 @@
threshold repeatedly. They are likely good
candidates for using WQ_UNBOUND workqueues instead.
- workqueue.disable_numa
- By default, all work items queued to unbound
- workqueues are affine to the NUMA nodes they're
- issued on, which results in better behavior in
- general. If NUMA affinity needs to be disabled for
- whatever reason, this option can be used. Note
- that this also can be controlled per-workqueue for
- workqueues visible under /sys/bus/workqueue/.
-
workqueue.power_efficient
Per-cpu workqueues are generally preferred because
they show better performance thanks to cache
@@ -7009,6 +7119,18 @@
The default value of this parameter is determined by
the config option CONFIG_WQ_POWER_EFFICIENT_DEFAULT.
+ workqueue.default_affinity_scope=
+ Select the default affinity scope to use for unbound
+ workqueues. Can be one of "cpu", "smt", "cache",
+ "numa" and "system". Default is "cache". For more
+ information, see the Affinity Scopes section in
+ Documentation/core-api/workqueue.rst.
+
+ This can be changed after boot by writing to the
+ matching /sys/module/workqueue/parameters file. All
+ workqueues with the "default" affinity scope will be
+ updated accordignly.
+
workqueue.debug_force_rr_cpu
Workqueue used to implicitly guarantee that work
items queued without explicit CPU specified are put
diff --git a/Documentation/admin-guide/media/qcom_camss.rst b/Documentation/admin-guide/media/qcom_camss.rst
index a72e17d09cb7..8a8f3ff40105 100644
--- a/Documentation/admin-guide/media/qcom_camss.rst
+++ b/Documentation/admin-guide/media/qcom_camss.rst
@@ -18,7 +18,7 @@ The driver implements V4L2, Media controller and V4L2 subdev interfaces.
Camera sensor using V4L2 subdev interface in the kernel is supported.
The driver is implemented using as a reference the Qualcomm Camera Subsystem
-driver for Android as found in Code Aurora [#f1]_ [#f2]_.
+driver for Android as found in Code Linaro [#f1]_ [#f2]_.
Qualcomm Camera Subsystem hardware
@@ -181,5 +181,5 @@ Referenced 2018-06-22.
References
----------
-.. [#f1] https://source.codeaurora.org/quic/la/kernel/msm-3.10/
-.. [#f2] https://source.codeaurora.org/quic/la/kernel/msm-3.18/
+.. [#f1] https://git.codelinaro.org/clo/la/kernel/msm-3.10/
+.. [#f2] https://git.codelinaro.org/clo/la/kernel/msm-3.18/
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index 2d495fa85a0e..8da1b7281827 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -87,7 +87,7 @@ comma (","). ::
│ │ │ │ │ │ │ filters/nr_filters
│ │ │ │ │ │ │ │ 0/type,matching,memcg_id
│ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
- │ │ │ │ │ │ │ tried_regions/
+ │ │ │ │ │ │ │ tried_regions/total_bytes
│ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age
│ │ │ │ │ │ │ │ ...
│ │ │ │ │ │ ...
@@ -99,7 +99,7 @@ Root
The root of the DAMON sysfs interface is ``<sysfs>/kernel/mm/damon/``, and it
has one directory named ``admin``. The directory contains the files for
-privileged user space programs' control of DAMON. User space tools or deamons
+privileged user space programs' control of DAMON. User space tools or daemons
having the root permission could use this directory.
kdamonds/
@@ -127,14 +127,18 @@ in the state. Writing ``commit`` to the ``state`` file makes kdamond reads the
user inputs in the sysfs files except ``state`` file again. Writing
``update_schemes_stats`` to ``state`` file updates the contents of stats files
for each DAMON-based operation scheme of the kdamond. For details of the
-stats, please refer to :ref:`stats section <sysfs_schemes_stats>`. Writing
-``update_schemes_tried_regions`` to ``state`` file updates the DAMON-based
-operation scheme action tried regions directory for each DAMON-based operation
-scheme of the kdamond. Writing ``clear_schemes_tried_regions`` to ``state``
-file clears the DAMON-based operating scheme action tried regions directory for
-each DAMON-based operation scheme of the kdamond. For details of the
-DAMON-based operation scheme action tried regions directory, please refer to
-:ref:`tried_regions section <sysfs_schemes_tried_regions>`.
+stats, please refer to :ref:`stats section <sysfs_schemes_stats>`.
+
+Writing ``update_schemes_tried_regions`` to ``state`` file updates the
+DAMON-based operation scheme action tried regions directory for each
+DAMON-based operation scheme of the kdamond. Writing
+``update_schemes_tried_bytes`` to ``state`` file updates only
+``.../tried_regions/total_bytes`` files. Writing
+``clear_schemes_tried_regions`` to ``state`` file clears the DAMON-based
+operating scheme action tried regions directory for each DAMON-based operation
+scheme of the kdamond. For details of the DAMON-based operation scheme action
+tried regions directory, please refer to :ref:`tried_regions section
+<sysfs_schemes_tried_regions>`.
If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
@@ -359,15 +363,21 @@ number (``N``) to the file creates the number of child directories named ``0``
to ``N-1``. Each directory represents each filter. The filters are evaluated
in the numeric order.
-Each filter directory contains three files, namely ``type``, ``matcing``, and
-``memcg_path``. You can write one of two special keywords, ``anon`` for
-anonymous pages, or ``memcg`` for specific memory cgroup filtering. In case of
-the memory cgroup filtering, you can specify the memory cgroup of the interest
-by writing the path of the memory cgroup from the cgroups mount point to
-``memcg_path`` file. You can write ``Y`` or ``N`` to ``matching`` file to
-filter out pages that does or does not match to the type, respectively. Then,
-the scheme's action will not be applied to the pages that specified to be
-filtered out.
+Each filter directory contains six files, namely ``type``, ``matcing``,
+``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. To ``type``
+file, you can write one of four special keywords: ``anon`` for anonymous pages,
+``memcg`` for specific memory cgroup, ``addr`` for specific address range (an
+open-ended interval), or ``target`` for specific DAMON monitoring target
+filtering. In case of the memory cgroup filtering, you can specify the memory
+cgroup of the interest by writing the path of the memory cgroup from the
+cgroups mount point to ``memcg_path`` file. In case of the address range
+filtering, you can specify the start and end address of the range to
+``addr_start`` and ``addr_end`` files, respectively. For the DAMON monitoring
+target filtering, you can specify the index of the target between the list of
+the DAMON context's monitoring targets list to ``target_idx`` file. You can
+write ``Y`` or ``N`` to ``matching`` file to filter out pages that does or does
+not match to the type, respectively. Then, the scheme's action will not be
+applied to the pages that specified to be filtered out.
For example, below restricts a DAMOS action to be applied to only non-anonymous
pages of all memory cgroups except ``/having_care_already``.::
@@ -381,8 +391,14 @@ pages of all memory cgroups except ``/having_care_already``.::
echo /having_care_already > 1/memcg_path
echo N > 1/matching
-Note that filters are currently supported only when ``paddr``
-`implementation <sysfs_contexts>` is being used.
+Note that ``anon`` and ``memcg`` filters are currently supported only when
+``paddr`` `implementation <sysfs_contexts>` is being used.
+
+Also, memory regions that are filtered out by ``addr`` or ``target`` filters
+are not counted as the scheme has tried to those, while regions that filtered
+out by other type filters are counted as the scheme has tried to. The
+difference is applied to :ref:`stats <damos_stats>` and
+:ref:`tried regions <sysfs_schemes_tried_regions>`.
.. _sysfs_schemes_stats:
@@ -397,7 +413,7 @@ be used for online analysis or tuning of the schemes.
The statistics can be retrieved by reading the files under ``stats`` directory
(``nr_tried``, ``sz_tried``, ``nr_applied``, ``sz_applied``, and
``qt_exceeds``), respectively. The files are not updated in real time, so you
-should ask DAMON sysfs interface to updte the content of the files for the
+should ask DAMON sysfs interface to update the content of the files for the
stats by writing a special keyword, ``update_schemes_stats`` to the relevant
``kdamonds/<N>/state`` file.
@@ -406,13 +422,21 @@ stats by writing a special keyword, ``update_schemes_stats`` to the relevant
schemes/<N>/tried_regions/
--------------------------
+This directory initially has one file, ``total_bytes``.
+
When a special keyword, ``update_schemes_tried_regions``, is written to the
-relevant ``kdamonds/<N>/state`` file, DAMON creates directories named integer
-starting from ``0`` under this directory. Each directory contains files
-exposing detailed information about each of the memory region that the
-corresponding scheme's ``action`` has tried to be applied under this directory,
-during next :ref:`aggregation interval <sysfs_monitoring_attrs>`. The
-information includes address range, ``nr_accesses``, and ``age`` of the region.
+relevant ``kdamonds/<N>/state`` file, DAMON updates the ``total_bytes`` file so
+that reading it returns the total size of the scheme tried regions, and creates
+directories named integer starting from ``0`` under this directory. Each
+directory contains files exposing detailed information about each of the memory
+region that the corresponding scheme's ``action`` has tried to be applied under
+this directory, during next :ref:`aggregation interval
+<sysfs_monitoring_attrs>`. The information includes address range,
+``nr_accesses``, and ``age`` of the region.
+
+Writing ``update_schemes_tried_bytes`` to the relevant ``kdamonds/<N>/state``
+file will only update the ``total_bytes`` file, and will not create the
+subdirectories.
The directories will be removed when another special keyword,
``clear_schemes_tried_regions``, is written to the relevant
diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst
index 7626392fe82c..776f244bdae4 100644
--- a/Documentation/admin-guide/mm/ksm.rst
+++ b/Documentation/admin-guide/mm/ksm.rst
@@ -159,6 +159,8 @@ The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
general_profit
how effective is KSM. The calculation is explained below.
+pages_scanned
+ how many pages are being scanned for ksm
pages_shared
how many shared pages are being used
pages_sharing
@@ -173,6 +175,13 @@ stable_node_chains
the number of KSM pages that hit the ``max_page_sharing`` limit
stable_node_dups
number of duplicated KSM pages
+ksm_zero_pages
+ how many zero pages that are still mapped into processes were mapped by
+ KSM when deduplicating.
+
+When ``use_zero_pages`` is/was enabled, the sum of ``pages_sharing`` +
+``ksm_zero_pages`` represents the actual number of pages saved by KSM.
+if ``use_zero_pages`` has never been enabled, ``ksm_zero_pages`` is 0.
A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good
sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing``
@@ -196,21 +205,25 @@ several times, which are unprofitable memory consumed.
1) How to determine whether KSM save memory or consume memory in system-wide
range? Here is a simple approximate calculation for reference::
- general_profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
+ general_profit =~ ksm_saved_pages * sizeof(page) - (all_rmap_items) *
sizeof(rmap_item);
- where all_rmap_items can be easily obtained by summing ``pages_sharing``,
- ``pages_shared``, ``pages_unshared`` and ``pages_volatile``.
+ where ksm_saved_pages equals to the sum of ``pages_sharing`` +
+ ``ksm_zero_pages`` of the system, and all_rmap_items can be easily
+ obtained by summing ``pages_sharing``, ``pages_shared``, ``pages_unshared``
+ and ``pages_volatile``.
2) The KSM profit inner a single process can be similarly obtained by the
following approximate calculation::
- process_profit =~ ksm_merging_pages * sizeof(page) -
+ process_profit =~ ksm_saved_pages * sizeof(page) -
ksm_rmap_items * sizeof(rmap_item).
- where ksm_merging_pages is shown under the directory ``/proc/<pid>/``,
- and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``. The process profit
- is also shown in ``/proc/<pid>/ksm_stat`` as ksm_process_profit.
+ where ksm_saved_pages equals to the sum of ``ksm_merging_pages`` and
+ ``ksm_zero_pages``, both of which are shown under the directory
+ ``/proc/<pid>/ksm_stat``, and ksm_rmap_items is also shown in
+ ``/proc/<pid>/ksm_stat``. The process profit is also shown in
+ ``/proc/<pid>/ksm_stat`` as ksm_process_profit.
From the perspective of application, a high ratio of ``ksm_rmap_items`` to
``ksm_merging_pages`` means a bad madvise-applied policy, so developers or
diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst
index 1b02fe5807cc..cfe034cf1e87 100644
--- a/Documentation/admin-guide/mm/memory-hotplug.rst
+++ b/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -291,6 +291,14 @@ The following files are currently defined:
Availability depends on the CONFIG_ARCH_MEMORY_PROBE
kernel configuration option.
``uevent`` read-write: generic udev file for device subsystems.
+``crash_hotplug`` read-only: when changes to the system memory map
+ occur due to hot un/plug of memory, this file contains
+ '1' if the kernel updates the kdump capture kernel memory
+ map itself (via elfcorehdr), or '0' if userspace must update
+ the kdump capture kernel memory map.
+
+ Availability depends on the CONFIG_MEMORY_HOTPLUG kernel
+ configuration option.
====================== =========================================================
.. note::
@@ -433,6 +441,18 @@ The following module parameters are currently defined:
memory in a way that huge pages in bigger
granularity cannot be formed on hotplugged
memory.
+
+ With value "force" it could result in memory
+ wastage due to memmap size limitations. For
+ example, if the memmap for a memory block
+ requires 1 MiB, but the pageblock size is 2
+ MiB, 1 MiB of hotplugged memory will be wasted.
+ Note that there are still cases where the
+ feature cannot be enforced: for example, if the
+ memmap is smaller than a single page, or if the
+ architecture does not support the forced mode
+ in all configurations.
+
``online_policy`` read-write: Set the basic policy used for
automatic zone selection when onlining memory
blocks without specifying a target zone.
@@ -669,7 +689,7 @@ when still encountering permanently unmovable pages within ZONE_MOVABLE
(-> BUG), memory offlining will keep retrying until it eventually succeeds.
When offlining is triggered from user space, the offlining context can be
-terminated by sending a fatal signal. A timeout based offlining can easily be
+terminated by sending a signal. A timeout based offlining can easily be
implemented via::
% timeout $TIMEOUT offline_block | failure_handling
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 46515ad2337f..eca38fa81e0f 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -109,7 +109,7 @@ VMA Policy
* A task may install a new VMA policy on a sub-range of a
previously mmap()ed region. When this happens, Linux splits
the existing virtual memory area into 2 or 3 VMAs, each with
- it's own policy.
+ its own policy.
* By default, VMA policy applies only to pages allocated after
the policy is installed. Any pages already faulted into the
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 7c304e432205..4349a8c2b978 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -244,6 +244,21 @@ write-protected (so future writes will also result in a WP fault). These ioctls
support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP``
respectively) to configure the mapping this way.
+Memory Poisioning Emulation
+---------------------------
+
+In response to a fault (either missing or minor), an action userspace can
+take to "resolve" it is to issue a ``UFFDIO_POISON``. This will cause any
+future faulters to either get a SIGBUS, or in KVM's case the guest will
+receive an MCE as if there were hardware memory poisoning.
+
+This is used to emulate hardware memory poisoning. Imagine a VM running on a
+machine which experiences a real hardware memory error. Later, we live migrate
+the VM to another physical machine. Since we want the migration to be
+transparent to the guest, we want that same address range to act as if it was
+still poisoned, even though it's on a new physical host which ostensibly
+doesn't have a memory error in the exact same spot.
+
QEMU/KVM
========
diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index c5c2c7dbb155..45b98390e938 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -49,7 +49,7 @@ compressed pool.
Design
======
-Zswap receives pages for compression through the Frontswap API and is able to
+Zswap receives pages for compression from the swap subsystem and is able to
evict pages from its own compressed pool on an LRU basis and write them back to
the backing swap device in the case that the compressed pool is full.
@@ -70,19 +70,19 @@ means the compression ratio will always be 2:1 or worse (because of half-full
zbud pages). The zsmalloc type zpool has a more complex compressed page
storage method, and it can achieve greater storage densities.
-When a swap page is passed from frontswap to zswap, zswap maintains a mapping
+When a swap page is passed from swapout to zswap, zswap maintains a mapping
of the swap entry, a combination of the swap type and swap offset, to the zpool
handle that references that compressed swap page. This mapping is achieved
with a red-black tree per swap type. The swap offset is the search key for the
tree nodes.
-During a page fault on a PTE that is a swap entry, frontswap calls the zswap
-load function to decompress the page into the page allocated by the page fault
-handler.
+During a page fault on a PTE that is a swap entry, the swapin code calls the
+zswap load function to decompress the page into the page allocated by the page
+fault handler.
Once there are no PTEs referencing a swap page stored in zswap (i.e. the count
-in the swap_map goes to 0) the swap code calls the zswap invalidate function,
-via frontswap, to free the compressed entry.
+in the swap_map goes to 0) the swap code calls the zswap invalidate function
+to free the compressed entry.
Zswap seeks to be simple in its policies. Sysfs attributes allow for one user
controlled policy:
diff --git a/Documentation/admin-guide/module-signing.rst b/Documentation/admin-guide/module-signing.rst
index 7d7c7c8a545c..2898b2703297 100644
--- a/Documentation/admin-guide/module-signing.rst
+++ b/Documentation/admin-guide/module-signing.rst
@@ -266,7 +266,7 @@ for which it has a public key. Otherwise, it will also load modules that are
unsigned. Any module for which the kernel has a key, but which proves to have
a signature mismatch will not be permitted to load.
-Any module that has an unparseable signature will be rejected.
+Any module that has an unparsable signature will be rejected.
=========================================
diff --git a/Documentation/admin-guide/perf/alibaba_pmu.rst b/Documentation/admin-guide/perf/alibaba_pmu.rst
index 11de998bb480..7d840023903f 100644
--- a/Documentation/admin-guide/perf/alibaba_pmu.rst
+++ b/Documentation/admin-guide/perf/alibaba_pmu.rst
@@ -88,6 +88,11 @@ data bandwidth::
-e ali_drw_27080/hif_rmw/ \
-e ali_drw_27080/cycle/ -- sleep 10
+Example usage of counting all memory read/write bandwidth by metric::
+
+ perf stat -M ddr_read_bandwidth.all -- sleep 10
+ perf stat -M ddr_write_bandwidth.all -- sleep 10
+
The average DRAM bandwidth can be calculated as follows:
- Read Bandwidth = perf_hif_rd * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle
diff --git a/Documentation/admin-guide/serial-console.rst b/Documentation/admin-guide/serial-console.rst
index 8c8b94e54e26..a3dfc2c66e01 100644
--- a/Documentation/admin-guide/serial-console.rst
+++ b/Documentation/admin-guide/serial-console.rst
@@ -59,7 +59,7 @@ times. In this case, there are the following two rules:
the hardware is not available.
The result might be surprising. For example, the following two command
-lines have the same result:
+lines have the same result::
console=ttyS1,9600 console=tty0 console=tty1
console=tty0 console=ttyS1,9600 console=tty1
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 3800fab1619b..cf33de56da27 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -450,6 +450,35 @@ this allows system administrators to override the
``IA64_THREAD_UAC_NOPRINT`` ``prctl`` and avoid logs being flooded.
+io_uring_disabled
+=================
+
+Prevents all processes from creating new io_uring instances. Enabling this
+shrinks the kernel's attack surface.
+
+= ======================================================================
+0 All processes can create io_uring instances as normal. This is the
+ default setting.
+1 io_uring creation is disabled (io_uring_setup() will fail with
+ -EPERM) for unprivileged processes not in the io_uring_group group.
+ Existing io_uring instances can still be used. See the
+ documentation for io_uring_group for more information.
+2 io_uring creation is disabled for all processes. io_uring_setup()
+ always fails with -EPERM. Existing io_uring instances can still be
+ used.
+= ======================================================================
+
+
+io_uring_group
+==============
+
+When io_uring_disabled is set to 1, a process must either be
+privileged (CAP_SYS_ADMIN) or be in the io_uring_group group in order
+to create an io_uring instance. If io_uring_group is set to -1 (the
+default), only processes with the CAP_SYS_ADMIN capability may create
+io_uring instances.
+
+
kexec_load_disabled
===================
@@ -941,16 +970,35 @@ enabled, otherwise writing to this file will return ``-EBUSY``.
The default value is 8.
-perf_user_access (arm64 only)
-=================================
+perf_user_access (arm64 and riscv only)
+=======================================
+
+Controls user space access for reading perf event counters.
-Controls user space access for reading perf event counters. When set to 1,
-user space can read performance monitor counter registers directly.
+arm64
+=====
The default value is 0 (access disabled).
+When set to 1, user space can read performance monitor counter registers
+directly.
+
See Documentation/arch/arm64/perf.rst for more information.
+riscv
+=====
+
+When set to 0, user space access is disabled.
+
+The default value is 1, user space can read performance monitor counter
+registers through perf, any direct access without perf intervention will trigger
+an illegal instruction.
+
+When set to 2, which enables legacy mode (user space has direct access to cycle
+and insret CSRs only). Note that this legacy value is deprecated and will be
+removed once all user space applications are fixed.
+
+Note that the time CSR is always directly accessible to all modes.
pid_max
=======
diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
index 3a9c041d7f6c..b67772cf36d6 100644
--- a/Documentation/admin-guide/xfs.rst
+++ b/Documentation/admin-guide/xfs.rst
@@ -192,7 +192,7 @@ When mounting an XFS filesystem, the following options are accepted.
are any integer multiple of a valid ``sunit`` value.
Typically the only time these mount options are necessary if
- after an underlying RAID device has had it's geometry
+ after an underlying RAID device has had its geometry
modified, such as adding a new disk to a RAID5 lun and
reshaping it.