summaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)Author
2023-11-19Merge tag 'x86_urgent_for_v6.7_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Ignore invalid x2APIC entries in order to not waste per-CPU data - Fix a back-to-back signals handling scenario when shadow stack is in use - A documentation fix - Add Kirill as TDX maintainer * tag 'x86_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/acpi: Ignore invalid x2APIC entries x86/shstk: Delay signal entry SSP write until after user accesses x86/Documentation: Indent 'note::' directive for protocol version number note MAINTAINERS: Add Intel TDX entry
2023-11-09x86/acpi: Ignore invalid x2APIC entriesZhang Rui
Currently, the kernel enumerates the possible CPUs by parsing both ACPI MADT Local APIC entries and x2APIC entries. So CPUs with "valid" APIC IDs, even if they have duplicated APIC IDs in Local APIC and x2APIC, are always enumerated. Below is what ACPI MADT Local APIC and x2APIC describes on an Ivebridge-EP system, [02Ch 0044 1] Subtable Type : 00 [Processor Local APIC] [02Fh 0047 1] Local Apic ID : 00 ... [164h 0356 1] Subtable Type : 00 [Processor Local APIC] [167h 0359 1] Local Apic ID : 39 [16Ch 0364 1] Subtable Type : 00 [Processor Local APIC] [16Fh 0367 1] Local Apic ID : FF ... [3ECh 1004 1] Subtable Type : 09 [Processor Local x2APIC] [3F0h 1008 4] Processor x2Apic ID : 00000000 ... [B5Ch 2908 1] Subtable Type : 09 [Processor Local x2APIC] [B60h 2912 4] Processor x2Apic ID : 00000077 As a result, kernel shows "smpboot: Allowing 168 CPUs, 120 hotplug CPUs". And this wastes significant amount of memory for the per-cpu data. Plus this also breaks https://lore.kernel.org/all/87edm36qqb.ffs@tglx/, because __max_logical_packages is over-estimated by the APIC IDs in the x2APIC entries. According to https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#processor-local-x2apic-structure: "[Compatibility note] On some legacy OSes, Logical processors with APIC ID values less than 255 (whether in XAPIC or X2APIC mode) must use the Processor Local APIC structure to convey their APIC information to OSPM, and those processors must be declared in the DSDT using the Processor() keyword. Logical processors with APIC ID values 255 and greater must use the Processor Local x2APIC structure and be declared using the Device() keyword." Therefore prevent the registration of x2APIC entries with an APIC ID less than 255 if the local APIC table enumerates valid APIC IDs. [ tglx: Simplify the logic ] Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230702162802.344176-1-rui.zhang@intel.com
2023-11-08x86/shstk: Delay signal entry SSP write until after user accessesRick Edgecombe
When a signal is being delivered, the kernel needs to make accesses to userspace. These accesses could encounter an access error, in which case the signal delivery itself will trigger a segfault. Usually this would result in the kernel killing the process. But in the case of a SEGV signal handler being configured, the failure of the first signal delivery will result in *another* signal getting delivered. The second signal may succeed if another thread has resolved the issue that triggered the segfault (i.e. a well timed mprotect()/mmap()), or the second signal is being delivered to another stack (i.e. an alt stack). On x86, in the non-shadow stack case, all the accesses to userspace are done before changes to the registers (in pt_regs). The operation is aborted when an access error occurs, so although there may be writes done for the first signal, control flow changes for the signal (regs->ip, regs->sp, etc) are not committed until all the accesses have already completed successfully. This means that the second signal will be delivered as if it happened at the time of the first signal. It will effectively replace the first aborted signal, overwriting the half-written frame of the aborted signal. So on sigreturn from the second signal, control flow will resume happily from the point of control flow where the original signal was delivered. The problem is, when shadow stack is active, the shadow stack SSP register/MSR is updated *before* some of the userspace accesses. This means if the earlier accesses succeed and the later ones fail, the second signal will not be delivered at the same spot on the shadow stack as the first one. So on sigreturn from the second signal, the SSP will be pointing to the wrong location on the shadow stack (off by a frame). Pengfei privately reported that while using a shadow stack enabled glibc, the “signal06” test in the LTP test-suite hung. It turns out it is testing the above described double signal scenario. When this test was compiled with shadow stack, the first signal pushed a shadow stack sigframe, then the second pushed another. When the second signal was handled, the SSP was at the first shadow stack signal frame instead of the original location. The test then got stuck as the #CP from the twice incremented SSP was incorrect and generated segfaults in a loop. Fix this by adjusting the SSP register only after any userspace accesses, such that there can be no failures after the SSP is adjusted. Do this by moving the shadow stack sigframe push logic to happen after all other userspace accesses. Note, sigreturn (as opposed to the signal delivery dealt with in this patch) has ordering behavior that could lead to similar failures. The ordering issues there extend beyond shadow stack to include the alt stack restoration. Fixing that would require cross-arch changes, and the ordering today does not cause any known test or apps breakages. So leave it as is, for now. [ dhansen: minor changelog/subject tweak ] Fixes: 05e36022c054 ("x86/shstk: Handle signals for shadow stack") Reported-by: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20231107182251.91276-1-rick.p.edgecombe%40intel.com Link: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/signal/signal06.c
2023-11-04Merge tag 'x86_microcode_for_v6.7_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 microcode loading updates from Borislac Petkov: "Major microcode loader restructuring, cleanup and improvements by Thomas Gleixner: - Restructure the code needed for it and add a temporary initrd mapping on 32-bit so that the loader can access the microcode blobs. This in itself is a preparation for the next major improvement: - Do not load microcode on 32-bit before paging has been enabled. Handling this has caused an endless stream of headaches, issues, ugly code and unnecessary hacks in the past. And there really wasn't any sensible reason to do that in the first place. So switch the 32-bit loading to happen after paging has been enabled and turn the loader code "real purrty" again - Drop mixed microcode steppings loading on Intel - there, a single patch loaded on the whole system is sufficient - Rework late loading to track which CPUs have updated microcode successfully and which haven't, act accordingly - Move late microcode loading on Intel in NMI context in order to guarantee concurrent loading on all threads - Make the late loading CPU-hotplug-safe and have the offlined threads be woken up for the purpose of the update - Add support for a minimum revision which determines whether late microcode loading is safe on a machine and the microcode does not change software visible features which the machine cannot use anyway since feature detection has happened already. Roughly, the minimum revision is the smallest revision number which must be loaded currently on the system so that late updates can be allowed - Other nice leanups, fixess, etc all over the place" * tag 'x86_microcode_for_v6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) x86/microcode/intel: Add a minimum required revision for late loading x86/microcode: Prepare for minimal revision check x86/microcode: Handle "offline" CPUs correctly x86/apic: Provide apic_force_nmi_on_cpu() x86/microcode: Protect against instrumentation x86/microcode: Rendezvous and load in NMI x86/microcode: Replace the all-in-one rendevous handler x86/microcode: Provide new control functions x86/microcode: Add per CPU control field x86/microcode: Add per CPU result state x86/microcode: Sanitize __wait_for_cpus() x86/microcode: Clarify the late load logic x86/microcode: Handle "nosmt" correctly x86/microcode: Clean up mc_cpu_down_prep() x86/microcode: Get rid of the schedule work indirection x86/microcode: Mop up early loading leftovers x86/microcode/amd: Use cached microcode for AP load x86/microcode/amd: Cache builtin/initrd microcode early x86/microcode/amd: Cache builtin microcode too x86/microcode/amd: Use correct per CPU ucode_cpu_info ...
2023-11-03Merge tag 'tty-6.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty and serial updates from Greg KH: "Here is the big set of tty/serial driver changes for 6.7-rc1. Included in here are: - console/vgacon cleanups and removals from Arnd - tty core and n_tty cleanups from Jiri - lots of 8250 driver updates and cleanups - sc16is7xx serial driver updates - dt binding updates - first set of port lock wrapers from Thomas for the printk fixes coming in future releases - other small serial and tty core cleanups and updates All of these have been in linux-next for a while with no reported issues" * tag 'tty-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (193 commits) serdev: Replace custom code with device_match_acpi_handle() serdev: Simplify devm_serdev_device_open() function serdev: Make use of device_set_node() tty: n_gsm: add copyright Siemens Mobility GmbH tty: n_gsm: fix race condition in status line change on dead connections serial: core: Fix runtime PM handling for pending tx vgacon: fix mips/sibyte build regression dt-bindings: serial: drop unsupported samsung bindings tty: serial: samsung: drop earlycon support for unsupported platforms tty: 8250: Add note for PX-835 tty: 8250: Fix IS-200 PCI ID comment tty: 8250: Add Brainboxes Oxford Semiconductor-based quirks tty: 8250: Add support for Intashield IX cards tty: 8250: Add support for additional Brainboxes PX cards tty: 8250: Fix up PX-803/PX-857 tty: 8250: Fix port count of PX-257 tty: 8250: Add support for Intashield IS-100 tty: 8250: Add support for Brainboxes UP cards tty: 8250: Add support for additional Brainboxes UC cards tty: 8250: Remove UC-257 and UC-431 ...
2023-11-02Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "As usual, lots of singleton and doubleton patches all over the tree and there's little I can say which isn't in the individual changelogs. The lengthier patch series are - 'kdump: use generic functions to simplify crashkernel reservation in arch', from Baoquan He. This is mainly cleanups and consolidation of the 'crashkernel=' kernel parameter handling - After much discussion, David Laight's 'minmax: Relax type checks in min() and max()' is here. Hopefully reduces some typecasting and the use of min_t() and max_t() - A group of patches from Oleg Nesterov which clean up and slightly fix our handling of reads from /proc/PID/task/... and which remove task_struct.thread_group" * tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits) scripts/gdb/vmalloc: disable on no-MMU scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n .mailmap: add address mapping for Tomeu Vizoso mailmap: update email address for Claudiu Beznea tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions .mailmap: map Benjamin Poirier's address scripts/gdb: add lx_current support for riscv ocfs2: fix a spelling typo in comment proc: test ProtectionKey in proc-empty-vm test proc: fix proc-empty-vm test with vsyscall fs/proc/base.c: remove unneeded semicolon do_io_accounting: use sig->stats_lock do_io_accounting: use __for_each_thread() ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error() ocfs2: fix a typo in a comment scripts/show_delta: add __main__ judgement before main code treewide: mark stuff as __ro_after_init fs: ocfs2: check status values proc: test /proc/${pid}/statm compiler.h: move __is_constexpr() to compiler.h ...
2023-11-01Merge tag 'sysctl-6.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "To help make the move of sysctls out of kernel/sysctl.c not incur a size penalty sysctl has been changed to allow us to not require the sentinel, the final empty element on the sysctl array. Joel Granados has been doing all this work. On the v6.6 kernel we got the major infrastructure changes required to support this. For v6.7-rc1 we have all arch/ and drivers/ modified to remove the sentinel. Both arch and driver changes have been on linux-next for a bit less than a month. It is worth re-iterating the value: - this helps reduce the overall build time size of the kernel and run time memory consumed by the kernel by about ~64 bytes per array - the extra 64-byte penalty is no longer inncurred now when we move sysctls out from kernel/sysctl.c to their own files For v6.8-rc1 expect removal of all the sentinels and also then the unneeded check for procname == NULL. The last two patches are fixes recently merged by Krister Johansen which allow us again to use softlockup_panic early on boot. This used to work but the alias work broke it. This is useful for folks who want to detect softlockups super early rather than wait and spend money on cloud solutions with nothing but an eventual hung kernel. Although this hadn't gone through linux-next it's also a stable fix, so we might as well roll through the fixes now" * tag 'sysctl-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (23 commits) watchdog: move softlockup_panic back to early_param proc: sysctl: prevent aliased sysctls from getting passed to init intel drm: Remove now superfluous sentinel element from ctl_table array Drivers: hv: Remove now superfluous sentinel element from ctl_table array raid: Remove now superfluous sentinel element from ctl_table array fw loader: Remove the now superfluous sentinel element from ctl_table array sgi-xp: Remove the now superfluous sentinel element from ctl_table array vrf: Remove the now superfluous sentinel element from ctl_table array char-misc: Remove the now superfluous sentinel element from ctl_table array infiniband: Remove the now superfluous sentinel element from ctl_table array macintosh: Remove the now superfluous sentinel element from ctl_table array parport: Remove the now superfluous sentinel element from ctl_table array scsi: Remove now superfluous sentinel element from ctl_table array tty: Remove now superfluous sentinel element from ctl_table array xen: Remove now superfluous sentinel element from ctl_table array hpet: Remove now superfluous sentinel element from ctl_table array c-sky: Remove now superfluous sentinel element from ctl_talbe array powerpc: Remove now superfluous sentinel element from ctl_table arrays riscv: Remove now superfluous sentinel element from ctl_table array x86/vdso: Remove now superfluous sentinel element from ctl_table array ...
2023-11-01Merge tag 'x86_tdx_for_6.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 TDX updates from Dave Hansen: "The majority of this is a rework of the assembly and C wrappers that are used to talk to the TDX module and VMM. This is a nice cleanup in general but is also clearing the way for using this code when Linux is the TDX VMM. There are also some tidbits to make TDX guests play nicer with Hyper-V and to take advantage the hardware TSC. Summary: - Refactor and clean up TDX hypercall/module call infrastructure - Handle retrying/resuming page conversion hypercalls - Make sure to use the (shockingly) reliable TSC in TDX guests" [ TLA reminder: TDX is "Trust Domain Extensions", Intel's guest VM confidentiality technology ] * tag 'x86_tdx_for_6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tdx: Mark TSC reliable x86/tdx: Fix __noreturn build warning around __tdx_hypercall_failed() x86/virt/tdx: Make TDX_MODULE_CALL handle SEAMCALL #UD and #GP x86/virt/tdx: Wire up basic SEAMCALL functions x86/tdx: Remove 'struct tdx_hypercall_args' x86/tdx: Reimplement __tdx_hypercall() using TDX_MODULE_CALL asm x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALL x86/tdx: Extend TDX_MODULE_CALL to support more TDCALL/SEAMCALL leafs x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure x86/tdx: Rename __tdx_module_call() to __tdcall() x86/tdx: Make macros of TDCALLs consistent with the spec x86/tdx: Skip saving output regs when SEAMCALL fails with VMFailInvalid x86/tdx: Zero out the missing RSI in TDX_HYPERCALL macro x86/tdx: Retry partially-completed page conversion hypercalls
2023-10-30Merge tag 'rcu-next-v6.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull RCU updates from Frederic Weisbecker: - RCU torture, locktorture and generic torture infrastructure updates that include various fixes, cleanups and consolidations. Among the user visible things, ftrace dumps can now be found into their own file, and module parameters get better documented and reported on dumps. - Generic and misc fixes all over the place. Some highlights: * Hotplug handling has seen some light cleanups and comments * An RCU barrier can now be triggered through sysfs to serialize memory stress testing and avoid OOM * Object information is now dumped in case of invalid callback invocation * Also various SRCU issues, too hard to trigger to deserve urgent pull requests, have been fixed - RCU documentation updates - RCU reference scalability test minor fixes and doc improvements. - RCU tasks minor fixes - Stall detection updates. Introduce RCU CPU Stall notifiers that allows a subsystem to provide informations to help debugging. Also cure some false positive stalls. * tag 'rcu-next-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (56 commits) srcu: Only accelerate on enqueue time locktorture: Check the correct variable for allocation failure srcu: Fix callbacks acceleration mishandling rcu: Comment why callbacks migration can't wait for CPUHP_RCUTREE_PREP rcu: Standardize explicit CPU-hotplug calls rcu: Conditionally build CPU-hotplug teardown callbacks rcu: Remove references to rcu_migrate_callbacks() from diagrams rcu: Assume rcu_report_dead() is always called locally rcu: Assume IRQS disabled from rcu_report_dead() rcu: Use rcu_segcblist_segempty() instead of open coding it rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects srcu: Fix srcu_struct node grpmask overflow on 64-bit systems torture: Convert parse-console.sh to mktemp rcutorture: Traverse possible cpu to set maxcpu in rcu_nocb_toggle() rcutorture: Replace schedule_timeout*() 1-jiffy waits with HZ/20 torture: Add kvm.sh --debug-info argument locktorture: Rename readers_bind/writers_bind to bind_readers/bind_writers doc: Catch-up update for locktorture module parameters locktorture: Add call_rcu_chains module parameter locktorture: Add new module parameters to lock_torture_print_module_parms() ...
2023-10-30Merge tag 'x86-core-2023-10-29-v2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 core updates from Thomas Gleixner: - Limit the hardcoded topology quirk for Hygon CPUs to those which have a model ID less than 4. The newer models have the topology CPUID leaf 0xB correctly implemented and are not affected. - Make SMT control more robust against enumeration failures SMT control was added to allow controlling SMT at boottime or runtime. The primary purpose was to provide a simple mechanism to disable SMT in the light of speculation attack vectors. It turned out that the code is sensible to enumeration failures and worked only by chance for XEN/PV. XEN/PV has no real APIC enumeration which means the primary thread mask is not set up correctly. By chance a XEN/PV boot ends up with smp_num_siblings == 2, which makes the hotplug control stay at its default value "enabled". So the mask is never evaluated. The ongoing rework of the topology evaluation caused XEN/PV to end up with smp_num_siblings == 1, which sets the SMT control to "not supported" and the empty primary thread mask causes the hotplug core to deny the bringup of the APS. Make the decision logic more robust and take 'not supported' and 'not implemented' into account for the decision whether a CPU should be booted or not. - Fake primary thread mask for XEN/PV Pretend that all XEN/PV vCPUs are primary threads, which makes the usage of the primary thread mask valid on XEN/PV. That is consistent with because all of the topology information on XEN/PV is fake or even non-existent. - Encapsulate topology information in cpuinfo_x86 Move the randomly scattered topology data into a separate data structure for readability and as a preparatory step for the topology evaluation overhaul. - Consolidate APIC ID data type to u32 It's fixed width hardware data and not randomly u16, int, unsigned long or whatever developers decided to use. - Cure the abuse of cpuinfo for persisting logical IDs. Per CPU cpuinfo is used to persist the logical package and die IDs. That's really not the right place simply because cpuinfo is subject to be reinitialized when a CPU goes through an offline/online cycle. Use separate per CPU data for the persisting to enable the further topology management rework. It will be removed once the new topology management is in place. - Provide a debug interface for inspecting topology information Useful in general and extremly helpful for validating the topology management rework in terms of correctness or "bug" compatibility. * tag 'x86-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/apic, x86/hyperv: Use u32 in hv_snp_boot_ap() too x86/cpu: Provide debug interface x86/cpu/topology: Cure the abuse of cpuinfo for persisting logical ids x86/apic: Use u32 for wakeup_secondary_cpu[_64]() x86/apic: Use u32 for [gs]et_apic_id() x86/apic: Use u32 for phys_pkg_id() x86/apic: Use u32 for cpu_present_to_apicid() x86/apic: Use u32 for check_apicid_used() x86/apic: Use u32 for APIC IDs in global data x86/apic: Use BAD_APICID consistently x86/cpu: Move cpu_l[l2]c_id into topology info x86/cpu: Move logical package and die IDs into topology info x86/cpu: Remove pointless evaluation of x86_coreid_bits x86/cpu: Move cu_id into topology info x86/cpu: Move cpu_core_id into topology info hwmon: (fam15h_power) Use topology_core_id() scsi: lpfc: Use topology_core_id() x86/cpu: Move cpu_die_id into topology info x86/cpu: Move phys_proc_id into topology info x86/cpu: Encapsulate topology information in cpuinfo_x86 ...
2023-10-30Merge tag 'x86-apic-2023-10-29-v2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 APIC updates from Thomas Gleixner: - Make the quirk for non-maskable MSI interrupts in the affinity setter functional again. It was broken by a MSI core code update, which restructured the code in a way that the quirk flag was not longer set correctly. Trying to restore the core logic caused a deeper inspection and it turned out that the extra quirk flag is not required at all because it's the inverse of the reservation mode bit, which only can be set when the MSI interrupt is maskable. So the trivial fix is to use the reservation mode check in the affinity setter function and remove almost 40 lines of code related to the no-mask quirk flag. - Cure a Kconfig dependency issue which causes compile failures by correcting the conditionals in the affected header files. - Clean up coding style in the UV APIC driver. * tag 'x86-apic-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/apic/msi: Fix misconfigured non-maskable MSI quirk x86/msi: Fix compile error caused by CONFIG_GENERIC_MSI_IRQ=y && !CONFIG_X86_LOCAL_APIC x86/platform/uv/apic: Clean up inconsistent indenting
2023-10-30Merge tag 'x86-mm-2023-10-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm handling updates from Ingo Molnar: - Add new NX-stack self-test - Improve NUMA partial-CFMWS handling - Fix #VC handler bugs resulting in SEV-SNP boot failures - Drop the 4MB memory size restriction on minimal NUMA nodes - Reorganize headers a bit, in preparation to header dependency reduction efforts - Misc cleanups & fixes * tag 'x86-mm-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Drop the 4 MB restriction on minimal NUMA node memory size selftests/x86/lam: Zero out buffer for readlink() x86/sev: Drop unneeded #include x86/sev: Move sev_setup_arch() to mem_encrypt.c x86/tdx: Replace deprecated strncpy() with strtomem_pad() selftests/x86/mm: Add new test that userspace stack is in fact NX x86/sev: Make boot_ghcb_page[] static x86/boot: Move x86_cache_alignment initialization to correct spot x86/sev-es: Set x86_virt_bits to the correct value straight away, instead of a two-phase approach x86/sev-es: Allow copy_from_kernel_nofault() in earlier boot x86_64: Show CR4.PSE on auxiliaries like on BSP x86/iommu/docs: Update AMD IOMMU specification document URL x86/sev/docs: Update document URL in amd-memory-encryption.rst x86/mm: Move arch_memory_failure() and arch_is_platform_page() definitions from <asm/processor.h> to <asm/pgtable.h> ACPI/NUMA: Apply SRAT proximity domain to entire CFMWS window x86/numa: Introduce numa_fill_memblks()
2023-10-30Merge tag 'x86-irq-2023-10-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 irq fix from Ingo Molnar: "Fix out-of-order NMI nesting checks resulting in false positive warnings" * tag 'x86-irq-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/nmi: Fix out-of-order NMI nesting checks & false positive warning
2023-10-30Merge tag 'x86-entry-2023-10-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 entry updates from Ingo Molnar: - Make IA32_EMULATION boot time configurable with the new ia32_emulation=<bool> boot option - Clean up fast syscall return validation code: convert it to C and refactor the code - As part of this, optimize the canonical RIP test code * tag 'x86-entry-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry/32: Clean up syscall fast exit tests x86/entry/64: Use TASK_SIZE_MAX for canonical RIP test x86/entry/64: Convert SYSRET validation tests to C x86/entry/32: Remove SEP test for SYSEXIT x86/entry/32: Convert do_fast_syscall_32() to bool return type x86/entry/compat: Combine return value test from syscall handler x86/entry/64: Remove obsolete comment on tracing vs. SYSRET x86: Make IA32_EMULATION boot time configurable x86/entry: Make IA32 syscalls' availability depend on ia32_enabled() x86/elf: Make loading of 32bit processes depend on ia32_enabled() x86/entry: Compile entry_SYSCALL32_ignore() unconditionally x86/entry: Rename ignore_sysret() x86: Introduce ia32_enabled()
2023-10-30Merge tag 'x86-boot-2023-10-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 boot updates from Ingo Molnar: - Rework PE header generation, primarily to generate a modern, 4k aligned kernel image view with narrower W^X permissions. - Further refine init-lifetime annotations - Misc cleanups & fixes * tag 'x86-boot-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/boot: efistub: Assign global boot_params variable x86/boot: Rename conflicting 'boot_params' pointer to 'boot_params_ptr' x86/head/64: Move the __head definition to <asm/init.h> x86/head/64: Add missing __head annotation to startup_64_load_idt() x86/head/64: Mark 'startup_gdt[]' and 'startup_gdt_descr' as __initdata x86/boot: Harmonize the style of array-type parameter for fixup_pointer() calls x86/boot: Fix incorrect startup_gdt_descr.size x86/boot: Compile boot code with -std=gnu11 too x86/boot: Increase section and file alignment to 4k/512 x86/boot: Split off PE/COFF .data section x86/boot: Drop PE/COFF .reloc section x86/boot: Construct PE/COFF .text section from assembler x86/boot: Derive file size from _edata symbol x86/boot: Define setup size in linker script x86/boot: Set EFI handover offset directly in header asm x86/boot: Grab kernel_info offset from zoffset header directly x86/boot: Drop references to startup_64 x86/boot: Drop redundant code setting the root device x86/boot: Omit compression buffer from PE/COFF image memory footprint x86/boot: Remove the 'bugger off' message ...
2023-10-30Merge tag 'x86-headers-2023-10-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 header file cleanup from Ingo Molnar: "Replace <asm/export.h> uses with <linux/export.h> and then remove <asm/export.h>" * tag 'x86-headers-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/headers: Remove <asm/export.h> x86/headers: Replace #include <asm/export.h> with #include <linux/export.h> x86/headers: Remove unnecessary #include <asm/export.h>
2023-10-30Merge tag 'objtool-core-2023-10-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: "Misc fixes and cleanups: - Fix potential MAX_NAME_LEN limit related build failures - Fix scripts/faddr2line symbol filtering bug - Fix scripts/faddr2line on LLVM=1 - Fix scripts/faddr2line to accept readelf output with mapping symbols - Minor cleanups" * tag 'objtool-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: scripts/faddr2line: Skip over mapping symbols in output from readelf scripts/faddr2line: Use LLVM addr2line and readelf if LLVM=1 scripts/faddr2line: Don't filter out non-function symbols from readelf objtool: Remove max symbol name length limitation objtool: Propagate early errors objtool: Use 'the fallthrough' pseudo-keyword x86/speculation, objtool: Use absolute relocations for annotations x86/unwind/orc: Remove redundant initialization of 'mid' pointer in __orc_find()
2023-10-30Merge tag 'sched-core-2023-10-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Fair scheduler (SCHED_OTHER) improvements: - Remove the old and now unused SIS_PROP code & option - Scan cluster before LLC in the wake-up path - Use candidate prev/recent_used CPU if scanning failed for cluster wakeup NUMA scheduling improvements: - Improve the VMA access-PID code to better skip/scan VMAs - Extend tracing to cover VMA-skipping decisions - Improve/fix the recently introduced sched_numa_find_nth_cpu() code - Generalize numa_map_to_online_node() Energy scheduling improvements: - Remove the EM_MAX_COMPLEXITY limit - Add tracepoints to track energy computation - Make the behavior of the 'sched_energy_aware' sysctl more consistent - Consolidate and clean up access to a CPU's max compute capacity - Fix uclamp code corner cases RT scheduling improvements: - Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates - Drive the ->rto_mask with rt_rq->pushable_tasks updates Scheduler scalability improvements: - Rate-limit updates to tg->load_avg - On x86 disable IBRS when CPU is offline to improve single-threaded performance - Micro-optimize in_task() and in_interrupt() - Micro-optimize the PSI code - Avoid updating PSI triggers and ->rtpoll_total when there are no state changes Core scheduler infrastructure improvements: - Use saved_state to reduce some spurious freezer wakeups - Bring in a handful of fast-headers improvements to scheduler headers - Make the scheduler UAPI headers more widely usable by user-space - Simplify the control flow of scheduler syscalls by using lock guards - Fix sched_setaffinity() vs. CPU hotplug race Scheduler debuggability improvements: - Disallow writing invalid values to sched_rt_period_us - Fix a race in the rq-clock debugging code triggering warnings - Fix a warning in the bandwidth distribution code - Micro-optimize in_atomic_preempt_off() checks - Enforce that the tasklist_lock is held in for_each_thread() - Print the TGID in sched_show_task() - Remove the /proc/sys/kernel/sched_child_runs_first sysctl ... and misc cleanups & fixes" * tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits) sched/fair: Remove SIS_PROP sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup sched/fair: Scan cluster before scanning LLC in wake-up path sched: Add cpus_share_resources API sched/core: Fix RQCF_ACT_SKIP leak sched/fair: Remove unused 'curr' argument from pick_next_entity() sched/nohz: Update comments about NEWILB_KICK sched/fair: Remove duplicate #include sched/psi: Update poll => rtpoll in relevant comments sched: Make PELT acronym definition searchable sched: Fix stop_one_cpu_nowait() vs hotplug sched/psi: Bail out early from irq time accounting sched/topology: Rename 'DIE' domain to 'PKG' sched/psi: Delete the 'update_total' function parameter from update_triggers() sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes sched/headers: Remove comment referring to rq::cpu_load, since this has been removed sched/numa: Complete scanning of inactive VMAs when there is no alternative sched/numa: Complete scanning of partial VMAs regardless of PID activity sched/numa: Move up the access pid reset logic sched/numa: Trace decisions related to skipping VMAs ...
2023-10-30Merge tag 'x86_fpu_for_6.7_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fpu fixlet from Borislav Petkov: - kernel-doc fix * tag 'x86_fpu_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu/xstate: Address kernel-doc warning
2023-10-30Merge tag 'x86_platform_for_6.7_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 platform updates from Borislav Petkov: - Make sure PCI function 4 IDs of AMD family 0x19, models 0x60-0x7f are actually used in the amd_nb.c enumeration - Add support for extracting NUMA information from devicetree for Hyper-V usages - Add PCI device IDs for the new AMD MI300 AI accelerators - Annotate an array in struct uv_rtc_timer_head with the new __counted_by attribute - Rework UV's NMI action parameter handling * tag 'x86_platform_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/amd_nb: Use Family 19h Models 60h-7Fh Function 4 IDs x86/numa: Add Devicetree support x86/of: Move the x86_flattree_get_config() call out of x86_dtb_init() x86/amd_nb: Add AMD Family MI300 PCI IDs x86/platform/uv: Annotate struct uv_rtc_timer_head with __counted_by x86/platform/uv: Rework NMI "action" modparam handling
2023-10-30Merge tag 'x86_cpu_for_6.7_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpuid updates from Borislav Petkov: - Make sure the "svm" feature flag is cleared from /proc/cpuinfo when virtualization support is disabled in the BIOS on AMD and Hygon platforms - A minor cleanup * tag 'x86_cpu_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu/amd: Remove redundant 'break' statement x86/cpu: Clear SVM feature if disabled by BIOS
2023-10-30Merge tag 'x86_cache_for_6.7_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 resource control updates from Borislav Petkov: - Add support for non-contiguous capacity bitmasks being added to Intel's CAT implementation - Other improvements to resctrl code: better configuration, simplifications, debugging support, fixes * tag 'x86_cache_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/resctrl: Display RMID of resource group x86/resctrl: Add support for the files of MON groups only x86/resctrl: Display CLOSID for resource group x86/resctrl: Introduce "-o debug" mount option x86/resctrl: Move default group file creation to mount x86/resctrl: Unwind properly from rdt_enable_ctx() x86/resctrl: Rename rftype flags for consistency x86/resctrl: Simplify rftype flag definitions x86/resctrl: Add multiple tasks to the resctrl group at once Documentation/x86: Document resctrl's new sparse_masks x86/resctrl: Add sparse_masks file in info x86/resctrl: Enable non-contiguous CBMs in Intel CAT x86/resctrl: Rename arch_has_sparse_bitmaps x86/resctrl: Fix remaining kernel-doc warnings
2023-10-30Merge tag 'x86_bugs_for_6.7_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 hw mitigation updates from Borislav Petkov: - A bunch of improvements, cleanups and fixlets to the SRSO mitigation machinery and other, general cleanups to the hw mitigations code, by Josh Poimboeuf - Improve the return thunk detection by objtool as it is absolutely important that the default return thunk is not used after returns have been patched. Future work to detect and report this better is pending - Other misc cleanups and fixes * tag 'x86_bugs_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) x86/retpoline: Document some thunk handling aspects x86/retpoline: Make sure there are no unconverted return thunks due to KCSAN x86/callthunks: Delete unused "struct thunk_desc" x86/vdso: Run objtool on vdso32-setup.o objtool: Fix return thunk patching in retpolines x86/srso: Remove unnecessary semicolon x86/pti: Fix kernel warnings for pti= and nopti cmdline options x86/calldepth: Rename __x86_return_skl() to call_depth_return_thunk() x86/nospec: Refactor UNTRAIN_RET[_*] x86/rethunk: Use SYM_CODE_START[_LOCAL]_NOALIGN macros x86/srso: Disentangle rethunk-dependent options x86/srso: Move retbleed IBPB check into existing 'has_microcode' code block x86/bugs: Remove default case for fully switched enums x86/srso: Remove 'pred_cmd' label x86/srso: Unexport untraining functions x86/srso: Improve i-cache locality for alias mitigation x86/srso: Fix unret validation dependencies x86/srso: Fix vulnerability reporting for missing microcode x86/srso: Print mitigation for retbleed IBPB case x86/srso: Print actual mitigation if requested mitigation isn't possible ...
2023-10-30Merge tag 'ras_core_for_6.7_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RAS updates from Borislav Petkov: - Specify what error addresses reported on AMD are actually usable memory error addresses for further decoding * tag 'ras_core_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Cleanup mce_usable_address() x86/mce: Define amd_mce_usable_address() x86/MCE/AMD: Split amd_mce_is_memory_error()
2023-10-28Merge tag 'x86-urgent-2023-10-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: - Fix a possible CPU hotplug deadlock bug caused by the new TSC synchronization code - Fix a legacy PIC discovery bug that results in device troubles on affected systems, such as non-working keybards, etc - Add a new Intel CPU model number to <asm/intel-family.h> * tag 'x86-urgent-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tsc: Defer marking TSC unstable to a worker x86/i8259: Skip probing when ACPI/MADT advertises PCAT compatibility x86/cpu: Add model number for Intel Arrow Lake mobile processor
2023-10-27x86/tsc: Defer marking TSC unstable to a workerThomas Gleixner
Tetsuo reported the following lockdep splat when the TSC synchronization fails during CPU hotplug: tsc: Marking TSC unstable due to check_tsc_sync_source failed WARNING: inconsistent lock state inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. ffffffff8cfa1c78 (watchdog_lock){?.-.}-{2:2}, at: clocksource_watchdog+0x23/0x5a0 {IN-HARDIRQ-W} state was registered at: _raw_spin_lock_irqsave+0x3f/0x60 clocksource_mark_unstable+0x1b/0x90 mark_tsc_unstable+0x41/0x50 check_tsc_sync_source+0x14f/0x180 sysvec_call_function_single+0x69/0x90 Possible unsafe locking scenario: lock(watchdog_lock); <Interrupt> lock(watchdog_lock); stack backtrace: _raw_spin_lock+0x30/0x40 clocksource_watchdog+0x23/0x5a0 run_timer_softirq+0x2a/0x50 sysvec_apic_timer_interrupt+0x6e/0x90 The reason is the recent conversion of the TSC synchronization function during CPU hotplug on the control CPU to a SMP function call. In case that the synchronization with the upcoming CPU fails, the TSC has to be marked unstable via clocksource_mark_unstable(). clocksource_mark_unstable() acquires 'watchdog_lock', but that lock is taken with interrupts enabled in the watchdog timer callback to minimize interrupt disabled time. That's obviously a possible deadlock scenario, Before that change the synchronization function was invoked in thread context so this could not happen. As it is not crucical whether the unstable marking happens slightly delayed, defer the call to a worker thread which avoids the lock context problem. Fixes: 9d349d47f0e3 ("x86/smpboot: Make TSC synchronization function call based") Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87zg064ceg.ffs@tglx
2023-10-27x86/i8259: Skip probing when ACPI/MADT advertises PCAT compatibilityThomas Gleixner
David and a few others reported that on certain newer systems some legacy interrupts fail to work correctly. Debugging revealed that the BIOS of these systems leaves the legacy PIC in uninitialized state which makes the PIC detection fail and the kernel switches to a dummy implementation. Unfortunately this fallback causes quite some code to fail as it depends on checks for the number of legacy PIC interrupts or the availability of the real PIC. In theory there is no reason to use the PIC on any modern system when IO/APIC is available, but the dependencies on the related checks cannot be resolved trivially and on short notice. This needs lots of analysis and rework. The PIC detection has been added to avoid quirky checks and force selection of the dummy implementation all over the place, especially in VM guest scenarios. So it's not an option to revert the relevant commit as that would break a lot of other scenarios. One solution would be to try to initialize the PIC on detection fail and retry the detection, but that puts the burden on everything which does not have a PIC. Fortunately the ACPI/MADT table header has a flag field, which advertises in bit 0 that the system is PCAT compatible, which means it has a legacy 8259 PIC. Evaluate that bit and if set avoid the detection routine and keep the real PIC installed, which then gets initialized (for nothing) and makes the rest of the code with all the dependencies work again. Fixes: e179f6914152 ("x86, irq, pic: Probe for legacy PIC and set legacy_pic appropriately") Reported-by: David Lazar <dlazar@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: David Lazar <dlazar@gmail.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Cc: stable@vger.kernel.org Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218003 Link: https://lore.kernel.org/r/875y2u5s8g.ffs@tglx
2023-10-26x86/apic/msi: Fix misconfigured non-maskable MSI quirkKoichiro Den
commit ef8dd01538ea ("genirq/msi: Make interrupt allocation less convoluted"), reworked the code so that the x86 specific quirk for affinity setting of non-maskable PCI/MSI interrupts is not longer activated if necessary. This could be solved by restoring the original logic in the core MSI code, but after a deeper analysis it turned out that the quirk flag is not required at all. The quirk is only required when the PCI/MSI device cannot mask the MSI interrupts, which in turn also prevents reservation mode from being enabled for the affected interrupt. This allows ot remove the NOMASK quirk bit completely as msi_set_affinity() can instead check whether reservation mode is enabled for the interrupt, which gives exactly the same answer. Even in the momentary non-existing case that the reservation mode would be not set for a maskable MSI interrupt this would not cause any harm as it just would cause msi_set_affinity() to go needlessly through the functionaly equivalent slow path, which works perfectly fine with maskable interrupts as well. Rework msi_set_affinity() to query the reservation mode and remove all NOMASK quirk logic from the core code. [ tglx: Massaged changelog ] Fixes: ef8dd01538ea ("genirq/msi: Make interrupt allocation less convoluted") Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Koichiro Den <den@valinux.co.jp> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20231026032036.2462428-1-den@valinux.co.jp
2023-10-24x86/microcode/intel: Add a minimum required revision for late loadingAshok Raj
In general users, don't have the necessary information to determine whether late loading of a new microcode version is safe and does not modify anything which the currently running kernel uses already, e.g. removal of CPUID bits or behavioural changes of MSRs. To address this issue, Intel has added a "minimum required version" field to a previously reserved field in the microcode header. Microcode updates should only be applied if the current microcode version is equal to, or greater than this minimum required version. Thomas made some suggestions on how meta-data in the microcode file could provide Linux with information to decide if the new microcode is suitable candidate for late loading. But even the "simpler" option requires a lot of metadata and corresponding kernel code to parse it, so the final suggestion was to add the 'minimum required version' field in the header. When microcode changes visible features, microcode will set the minimum required version to its own revision which prevents late loading. Old microcode blobs have the minimum revision field always set to 0, which indicates that there is no information and the kernel considers it unsafe. This is a pure OS software mechanism. The hardware/firmware ignores this header field. For early loading there is no restriction because OS visible features are enumerated after the early load and therefore a change has no effect. The check is always enabled, but by default not enforced. It can be enforced via Kconfig or kernel command line. If enforced, the kernel refuses to late load microcode with a minimum required version field which is zero or when the currently loaded microcode revision is smaller than the minimum required revision. If not enforced the load happens independent of the revision check to stay compatible with the existing behaviour, but it influences the decision whether the kernel is tainted or not. If the check signals that the late load is safe, then the kernel is not tainted. Early loading is not affected by this. [ tglx: Massaged changelog and fixed up the implementation ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231002115903.776467264@linutronix.de
2023-10-24x86/microcode: Prepare for minimal revision checkThomas Gleixner
Applying microcode late can be fatal for the running kernel when the update changes functionality which is in use already in a non-compatible way, e.g. by removing a CPUID bit. There is no way for admins which do not have access to the vendors deep technical support to decide whether late loading of such a microcode is safe or not. Intel has added a new field to the microcode header which tells the minimal microcode revision which is required to be active in the CPU in order to be safe. Provide infrastructure for handling this in the core code and a command line switch which allows to enforce it. If the update is considered safe the kernel is not tainted and the annoying warning message not emitted. If it's enforced and the currently loaded microcode revision is not safe for late loading then the load is aborted. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231017211724.079611170@linutronix.de
2023-10-24x86/microcode: Handle "offline" CPUs correctlyThomas Gleixner
Offline CPUs need to be parked in a safe loop when microcode update is in progress on the primary CPU. Currently, offline CPUs are parked in mwait_play_dead(), and for Intel CPUs, its not a safe instruction, because the MWAIT instruction can be patched in the new microcode update that can cause instability. - Add a new microcode state 'UCODE_OFFLINE' to report status on per-CPU basis. - Force NMI on the offline CPUs. Wake up offline CPUs while the update is in progress and then return them back to mwait_play_dead() after microcode update is complete. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231002115903.660850472@linutronix.de
2023-10-24x86/apic: Provide apic_force_nmi_on_cpu()Thomas Gleixner
When SMT siblings are soft-offlined and parked in one of the play_dead() variants they still react on NMI, which is problematic on affected Intel CPUs. The default play_dead() variant uses MWAIT on modern CPUs, which is not guaranteed to be safe when updated concurrently. Right now late loading is prevented when not all SMT siblings are online, but as they still react on NMI, it is possible to bring them out of their park position into a trivial rendezvous handler. Provide a function which allows to do that. I does sanity checks whether the target is in the cpus_booted_once_mask and whether the APIC driver supports it. Mark X2APIC and XAPIC as capable, but exclude 32bit and the UV and NUMACHIP variants as that needs feedback from the relevant experts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231002115903.603100036@linutronix.de
2023-10-24x86/microcode: Protect against instrumentationThomas Gleixner
The wait for control loop in which the siblings are waiting for the microcode update on the primary thread must be protected against instrumentation as instrumentation can end up in #INT3, #DB or #PF, which then returns with IRET. That IRET reenables NMI which is the opposite of what the NMI rendezvous is trying to achieve. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231002115903.545969323@linutronix.de
2023-10-24x86/microcode: Rendezvous and load in NMIThomas Gleixner
stop_machine() does not prevent the spin-waiting sibling from handling an NMI, which is obviously violating the whole concept of rendezvous. Implement a static branch right in the beginning of the NMI handler which is nopped out except when enabled by the late loading mechanism. The late loader enables the static branch before stop_machine() is invoked. Each CPU has an nmi_enable in its control structure which indicates whether the CPU should go into the update routine. This is required to bridge the gap between enabling the branch and actually being at the point where it is required to enter the loader wait loop. Each CPU which arrives in the stopper thread function sets that flag and issues a self NMI right after that. If the NMI function sees the flag clear, it returns. If it's set it clears the flag and enters the rendezvous. This is safe against a real NMI which hits in between setting the flag and sending the NMI to itself. The real NMI will be swallowed by the microcode update and the self NMI will then let stuff continue. Otherwise this would end up with a spurious NMI. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231002115903.489900814@linutronix.de
2023-10-24x86/microcode: Replace the all-in-one rendevous handlerThomas Gleixner
with a new handler which just separates the control flow of primary and secondary CPUs. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231002115903.433704135@linutronix.de
2023-10-24x86/microcode: Provide new control functionsThomas Gleixner
The current all in one code is unreadable and really not suited for adding future features like uniform loading with package or system scope. Provide a set of new control functions which split the handling of the primary and secondary CPUs. These will replace the current rendezvous all in one function in the next step. This is intentionally a separate change because diff makes an complete unreadable mess otherwise. So the flow separates the primary and the secondary CPUs into their own functions which use the control field in the per CPU ucode_ctrl struct. primary() secondary() wait_for_all() wait_for_all() apply_ucode() wait_for_release() release() apply_ucode() Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231002115903.377922731@linutronix.de
2023-10-24x86/microcode: Add per CPU control fieldThomas Gleixner
Add a per CPU control field to ucode_ctrl and define constants for it which are going to be used to control the loading state machine. In theory this could be a global control field, but a global control does not cover the following case: 15 primary CPUs load microcode successfully 1 primary CPU fails and returns with an error code With global control the sibling of the failed CPU would either try again or the whole operation would be aborted with the consequence that the 15 siblings do not invoke the apply path and end up with inconsistent software state. The result in dmesg would be inconsistent too. There are two additional fields added and initialized: ctrl_cpu and secondaries. ctrl_cpu is the CPU number of the primary thread for now, but with the upcoming uniform loading at package or system scope this will be one CPU per package or just one CPU. Secondaries hands the control CPU a CPU mask which will be required to release the secondary CPUs out of the wait loop. Preparatory change for implementing a properly split control flow for primary and secondary CPUs. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231002115903.319959519@linutronix.de
2023-10-24x86/microcode: Add per CPU result stateThomas Gleixner
The microcode rendezvous is purely acting on global state, which does not allow to analyze fails in a coherent way. Introduce per CPU state where the results are written into, which allows to analyze the return codes of the individual CPUs. Initialize the state when walking the cpu_present_mask in the online check to avoid another for_each_cpu() loop. Enhance the result print out with that. The structure is intentionally named ucode_ctrl as it will gain control fields in subsequent changes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231017211723.632681010@linutronix.de
2023-10-24x86/microcode: Sanitize __wait_for_cpus()Thomas Gleixner
The code is too complicated for no reason: - The return value is pointless as this is a strict boolean. - It's way simpler to count down from num_online_cpus() and check for zero. - The timeout argument is pointless as this is always one second. - Touching the NMI watchdog every 100ns does not make any sense, neither does checking every 100ns. This is really not a hotpath operation. Preload the atomic counter with the number of online CPUs and simplify the whole timeout logic. Delay for one microsecond and touch the NMI watchdog once per millisecond. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231002115903.204251527@linutronix.de
2023-10-24x86/microcode: Clarify the late load logicThomas Gleixner
reload_store() is way too complicated. Split the inner workings out and make the following enhancements: - Taint the kernel only when the microcode was actually updated. If. e.g. the rendezvous fails, then nothing happened and there is no reason for tainting. - Return useful error codes Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20231002115903.145048840@linutronix.de
2023-10-24x86/microcode: Handle "nosmt" correctlyThomas Gleixner
On CPUs where microcode loading is not NMI-safe the SMT siblings which are parked in one of the play_dead() variants still react to NMIs. So if an NMI hits while the primary thread updates the microcode the resulting behaviour is undefined. The default play_dead() implementation on modern CPUs is using MWAIT which is not guaranteed to be safe against a microcode update which affects MWAIT. Take the cpus_booted_once_mask into account to detect this case and refuse to load late if the vendor specific driver does not advertise that late loading is NMI safe. AMD stated that this is safe, so mark the AMD driver accordingly. This requirement will be partially lifted in later changes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231002115903.087472735@linutronix.de
2023-10-24x86/microcode: Clean up mc_cpu_down_prep()Thomas Gleixner
This function has nothing to do with suspend. It's a hotplug callback. Remove the bogus comment. Drop the pointless debug printk. The hotplug core provides tracepoints which track the invocation of those callbacks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231002115903.028651784@linutronix.de
2023-10-24x86/microcode: Get rid of the schedule work indirectionThomas Gleixner
Scheduling work on all CPUs to collect the microcode information is just another extra step for no value. Let the CPU hotplug callback registration do it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231017211723.354748138@linutronix.de
2023-10-24x86/microcode: Mop up early loading leftoversThomas Gleixner
Get rid of the initrd_gone hack which was required to keep find_microcode_in_initrd() functional after init. As find_microcode_in_initrd() is now only used during init, mark it accordingly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231017211723.298854846@linutronix.de
2023-10-24x86/microcode/amd: Use cached microcode for AP loadThomas Gleixner
Now that the microcode cache is initialized before the APs are brought up, there is no point in scanning builtin/initrd microcode during AP loading. Convert the AP loader to utilize the cache, which in turn makes the CPU hotplug callback which applies the microcode after initrd/builtin is gone, obsolete as the early loading during late hotplug operations including the resume path depends now only on the cache. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231017211723.243426023@linutronix.de
2023-10-24x86/microcode/amd: Cache builtin/initrd microcode earlyThomas Gleixner
There is no reason to scan builtin/initrd microcode on each AP. Cache the builtin/initrd microcode in an early initcall so that the early AP loader can utilize the cache. The existing fs initcall which invoked save_microcode_in_initrd_amd() is still required to maintain the initrd_gone flag. Rename it accordingly. This will be removed once the AP loader code is converted to use the cache. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231017211723.187566507@linutronix.de
2023-10-24x86/microcode/amd: Cache builtin microcode tooThomas Gleixner
save_microcode_in_initrd_amd() fails to cache builtin microcode and only scans initrd. Use find_blobs_in_containers() instead which covers both. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231010150702.495139089@linutronix.de
2023-10-24x86/microcode/amd: Use correct per CPU ucode_cpu_infoThomas Gleixner
find_blobs_in_containers() is invoked on every CPU but overwrites unconditionally ucode_cpu_info of CPU0. Fix this by using the proper CPU data and move the assignment into the call site apply_ucode_from_containers() so that the function can be reused. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231010150702.433454320@linutronix.de
2023-10-24x86/microcode: Remove pointless apply() invocationThomas Gleixner
Microcode is applied on the APs during early bringup. There is no point in trying to apply the microcode again during the hotplug operations and neither at the point where the microcode device is initialized. Collect CPU info and microcode revision in setup_online_cpu() for now. This will move to the CPU hotplug callback later. [ bp: Leave the starting notifier for the following scenario: - boot, late load, suspend to disk, resume without the starting notifier, only the last core manages to update the microcode upon resume: # rdmsr -a 0x8b 10000bf 10000bf 10000bf 10000bf 10000bf 10000dc <---- This is on an AMD F10h machine. For the future, one should check whether potential unification of the CPU init path could cover the resume path too so that this can be simplified even more. tglx: This is caused by the odd handling of APs which try to find the microcode blob in builtin or initrd instead of caching the microcode blob during early init before the APs are brought up. Will be cleaned up in a later step. ] Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20231017211723.018821624@linutronix.de
2023-10-24x86/microcode/intel: Rework intel_find_matching_signature()Thomas Gleixner
Take a cpu_signature argument and work from there. Move the match() helper next to the callsite as there is no point for having it in a header. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231002115902.797820205@linutronix.de