summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2018-06-26x86/intel_rdt: Enable CMT and MBM on new Skylake steppingTony Luck
commit 1d9f3e20a56d33e55748552aeec597f58542f92d upstream. New stepping of Skylake has fixes for cache occupancy and memory bandwidth monitoring. Update the code to enable these by default on newer steppings. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: stable@vger.kernel.org # v4.14 Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com> Link: https://lkml.kernel.org/r/20180608160732.9842-1-tony.luck@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-26x86/platform/uv: Use apic_ack_irq()Thomas Gleixner
commit 839b0f1c4ef674cd929a42304c078afca278581a upstream. To address the EBUSY fail of interrupt affinity settings in case that the previous setting has not been cleaned up yet, use the new apic_ack_irq() function instead of the special uv_ack_apic() implementation which is merily a wrapper around ack_APIC_irq(). Preparatory change for the real fix Fixes: dccfe3147b42 ("x86/vector: Simplify vector move cleanup") Reported-by: Song Liu <liu.song.a23@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Song Liu <songliubraving@fb.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: stable@vger.kernel.org Cc: Mike Travis <mike.travis@hpe.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Tariq Toukan <tariqt@mellanox.com> Link: https://lkml.kernel.org/r/20180604162224.721691398@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-26x86/ioapic: Use apic_ack_irq()Thomas Gleixner
commit 2b04e46d8d0b9b7ac08ded672e3eab823f01d77a upstream. To address the EBUSY fail of interrupt affinity settings in case that the previous setting has not been cleaned up yet, use the new apic_ack_irq() function instead of directly invoking ack_APIC_irq(). Preparatory change for the real fix Fixes: dccfe3147b42 ("x86/vector: Simplify vector move cleanup") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Song Liu <songliubraving@fb.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <liu.song.a23@gmail.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: stable@vger.kernel.org Cc: Mike Travis <mike.travis@hpe.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Tariq Toukan <tariqt@mellanox.com> Link: https://lkml.kernel.org/r/20180604162224.639011135@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-26x86/apic: Provide apic_ack_irq()Thomas Gleixner
commit c0255770ccdc77ef2184d2a0a2e0cde09d2b44a4 upstream. apic_ack_edge() is explicitely for handling interrupt affinity cleanup when interrupt remapping is not available or disable. Remapped interrupts and also some of the platform specific special interrupts, e.g. UV, invoke ack_APIC_irq() directly. To address the issue of failing an affinity update with -EBUSY the delayed affinity mechanism can be reused, but ack_APIC_irq() does not handle that. Adding this to ack_APIC_irq() is not possible, because that function is also used for exceptions and directly handled interrupts like IPIs. Create a new function, which just contains the conditional invocation of irq_move_irq() and the final ack_APIC_irq(). Reuse the new function in apic_ack_edge(). Preparatory change for the real fix. Fixes: dccfe3147b42 ("x86/vector: Simplify vector move cleanup") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Song Liu <songliubraving@fb.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <liu.song.a23@gmail.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: stable@vger.kernel.org Cc: Mike Travis <mike.travis@hpe.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Tariq Toukan <tariqt@mellanox.com> Link: https://lkml.kernel.org/r/20180604162224.471925894@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-26x86/apic/vector: Prevent hlist corruption and leaksThomas Gleixner
commit 80ae7b1a918e78b0bae88b0c0ad413d3fdced968 upstream. Several people observed the WARN_ON() in irq_matrix_free() which triggers when the caller tries to free an vector which is not in the allocation range. Song provided the trace information which allowed to decode the root cause. The rework of the vector allocation mechanism failed to preserve a sanity check, which prevents setting a new target vector/CPU when the previous affinity change has not fully completed. As a result a half finished affinity change can be overwritten, which can cause the leak of a irq descriptor pointer on the previous target CPU and double enqueue of the hlist head into the cleanup lists of two or more CPUs. After one CPU cleaned up its vector the next CPU will invoke the cleanup handler with vector 0, which triggers the out of range warning in the matrix allocator. Prevent this by checking the apic_data of the interrupt whether the move_in_progress flag is false and the hlist node is not hashed. Return -EBUSY if not. This prevents the damage and restores the behaviour before the vector allocation rework, but due to other changes in that area it also widens the chance that user space can observe -EBUSY. In theory this should be fine, but actually not all user space tools handle -EBUSY correctly. Addressing that is not part of this fix, but will be addressed in follow up patches. Fixes: 69cde0004a4b ("x86/vector: Use matrix allocator for vector assignment") Reported-by: Dmitry Safonov <0x7f454c46@gmail.com> Reported-by: Tariq Toukan <tariqt@mellanox.com> Reported-by: Song Liu <liu.song.a23@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Song Liu <songliubraving@fb.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Cc: Mike Travis <mike.travis@hpe.com> Cc: Borislav Petkov <bp@alien8.de> Link: https://lkml.kernel.org/r/20180604162224.303870257@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-26x86/vector: Fix the args of vector_alloc tracepointDou Liyang
commit 838d76d63ec4eaeaa12bedfa50f261480f615200 upstream. The vector_alloc tracepont reversed the reserved and ret aggs, that made the trace print wrong. Exchange them. Fixes: 8d1e3dca7de6 ("x86/vector: Add tracepoints for vector management") Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: hpa@zytor.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20180601065031.21872-1-douly.fnst@cn.fujitsu.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-26x86/MCE: Fix stack out-of-bounds write in mce-inject.c: Flags_read()Tony Luck
commit 985c78d3ff8e9c74450fa2bb08eb55e680d999ca upstream. Each of the strings that we want to put into the buf[MAX_FLAG_OPT_SIZE] in flags_read() is two characters long. But the sprintf() adds a trailing newline and will add a terminating NUL byte. So MAX_FLAG_OPT_SIZE needs to be 4. sprintf() calls vsnprintf() and *that* does return: " * The return value is the number of characters which would * be generated for the given input, excluding the trailing * '\0', as per ISO C99." Note the "excluding". Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20180427163707.ktaiysvbk3yhk4wm@agluck-desk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-21KVM: X86: Lower the default timer frequency limit to 200usWanpeng Li
[ Upstream commit 4c27625b7a67eb9006963ed2bcf8e53b259b43af ] Anthoine reported: The period used by Windows change over time but it can be 1 milliseconds or less. I saw the limit_periodic_timer_frequency print so 500 microseconds is sometimes reached. As suggested by Paolo, lower the default timer frequency limit to a smaller interval of 200 us (5000 Hz) to leave some headroom. This is required due to Windows 10 changing the scheduler tick limit from 1024 Hz to 2048 Hz. Reported-by: Anthoine Bourgeois <anthoine.bourgeois@blade-group.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Anthoine Bourgeois <anthoine.bourgeois@blade-group.com> Cc: Darren Kenny <darren.kenny@oracle.com> Cc: Jan Kiszka <jan.kiszka@web.de> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-21uprobes/x86: Prohibit probing on MOV SS instructionMasami Hiramatsu
[ Upstream commit 13ebe18c94f5b0665c01ae7fad2717ae959f4212 ] Since MOV SS and POP SS instructions will delay the exceptions until the next instruction is executed, single-stepping on it by uprobes must be prohibited. uprobe already rejects probing on POP SS (0x1f), but allows probing on MOV SS (0x8e and reg == 2). This checks the target instruction and if it is MOV SS or POP SS, returns -ENOTSUPP to reject probing. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Cc: Francis Deslauriers <francis.deslauriers@efficios.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Yonghong Song <yhs@fb.com> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "David S . Miller" <davem@davemloft.net> Link: https://lkml.kernel.org/r/152587072544.17316.5950935243917346341.stgit@devbox Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-21kprobes/x86: Prohibit probing on exception masking instructionsMasami Hiramatsu
[ Upstream commit ee6a7354a3629f9b65bc18dbe393503e9440d6f5 ] Since MOV SS and POP SS instructions will delay the exceptions until the next instruction is executed, single-stepping on it by kprobes must be prohibited. However, kprobes usually executes those instructions directly on trampoline buffer (a.k.a. kprobe-booster), except for the kprobes which has post_handler. Thus if kprobe user probes MOV SS with post_handler, it will do single-stepping on the MOV SS. This means it is safe that if it is used via ftrace or perf/bpf since those don't use the post_handler. Anyway, since the stack switching is a rare case, it is safer just rejecting kprobes on such instructions. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Cc: Francis Deslauriers <francis.deslauriers@efficios.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Yonghong Song <yhs@fb.com> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "David S . Miller" <davem@davemloft.net> Link: https://lkml.kernel.org/r/152587069574.17316.3311695234863248641.stgit@devbox Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-21x86: Delay skip of emulated hypercall instructionMarian Rotariu
[ Upstream commit 6356ee0c9602004e0a3b4b2dad68ee2ee9385b17 ] The IP increment should be done after the hypercall emulation, after calling the various handlers. In this way, these handlers can accurately identify the the IP of the VMCALL if they need it. This patch keeps the same functionality for the Hyper-V handler which does not use the return code of the standard kvm_skip_emulated_instruction() call. Signed-off-by: Marian Rotariu <mrotariu@bitdefender.com> [Hyper-V hypercalls also need kvm_skip_emulated_instruction() - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-21x86/xen: Reset VCPU0 info pointer after shared_info remapvan der Linden, Frank
[ Upstream commit d1ecfa9d1f402366b1776fbf84e635678a51414f ] This patch fixes crashes during boot for HVM guests on older (pre HVM vector callback) Xen versions. Without this, current kernels will always fail to boot on those Xen versions. Sample stack trace: BUG: unable to handle kernel paging request at ffffffffff200000 IP: __xen_evtchn_do_upcall+0x1e/0x80 PGD 1e0e067 P4D 1e0e067 PUD 1e10067 PMD 235c067 PTE 0 Oops: 0002 [#1] SMP PTI Modules linked in: CPU: 0 PID: 512 Comm: kworker/u2:0 Not tainted 4.14.33-52.13.amzn1.x86_64 #1 Hardware name: Xen HVM domU, BIOS 3.4.3.amazon 11/11/2016 task: ffff88002531d700 task.stack: ffffc90000480000 RIP: 0010:__xen_evtchn_do_upcall+0x1e/0x80 RSP: 0000:ffff880025403ef0 EFLAGS: 00010046 RAX: ffffffff813cc760 RBX: ffffffffff200000 RCX: ffffc90000483ef0 RDX: ffff880020540a00 RSI: ffff880023c78000 RDI: 000000000000001c RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff880025403f5c R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff880025400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffff200000 CR3: 0000000001e0a000 CR4: 00000000000006f0 Call Trace: <IRQ> do_hvm_evtchn_intr+0xa/0x10 __handle_irq_event_percpu+0x43/0x1a0 handle_irq_event_percpu+0x20/0x50 handle_irq_event+0x39/0x60 handle_fasteoi_irq+0x80/0x140 handle_irq+0xaf/0x120 do_IRQ+0x41/0xd0 common_interrupt+0x7d/0x7d </IRQ> During boot, the HYPERVISOR_shared_info page gets remapped to make it work with KASLR. This means that any pointer derived from it needs to be adjusted. The only value that this applies to is the vcpu_info pointer for VCPU 0. For PV and HVM with the callback vector feature, this gets done via the smp_ops prepare_boot_cpu callback. Older Xen versions do not support the HVM callback vector, so there is no Xen-specific smp_ops set up in that scenario. So, the vcpu_info pointer for VCPU 0 never gets set to the proper value, and the first reference of it will be bad. Fix this by resetting it immediately after the remap. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Reviewed-by: Eduardo Valentin <eduval@amazon.com> Reviewed-by: Alakesh Haloi <alakeshh@amazon.com> Reviewed-by: Vallish Vaidyeshwara <vallish@amazon.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: xen-devel@lists.xenproject.org Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-21bpf, x64: fix memleak when not converging on callsDaniel Borkmann
[ Upstream commit 39f56ca945af86112753646316c4c92dcd4acd82 ] The JIT logic in jit_subprogs() is as follows: for all subprogs we allocate a bpf_prog_alloc(), populate it (prog->is_func = 1 here), and pass it to bpf_int_jit_compile(). If a failure occurred during JIT and prog->jited is not set, then we bail out from attempting to JIT the whole program, and punt to the interpreter instead. In case JITing went successful, we fixup BPF call offsets and do another pass to bpf_int_jit_compile() (extra_pass is true at that point) to complete JITing calls. Given that requires to pass JIT context around addrs and jit_data from x86 JIT are freed in the extra_pass in bpf_int_jit_compile() when calls are involved (if not, they can be freed immediately). However, if in the original pass, the JIT image didn't converge then we leak addrs and jit_data since image itself is NULL, the prog->is_func is set and extra_pass is false in that case, meaning both will become unreachable and are never cleaned up, therefore we need to free as well on !image. Only x64 JIT is affected. Fixes: 1c2a088a6626 ("bpf: x64: add JIT support for multi-function programs") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-21bpf, x64: fix memleak when not converging after imageDaniel Borkmann
[ Upstream commit 3aab8884c9eb99189a3569ac4e6b205371c9ac0b ] While reviewing x64 JIT code, I noticed that we leak the prior allocated JIT image in the case where proglen != oldproglen during the JIT passes. Prior to the commit e0ee9c12157d ("x86: bpf_jit: fix two bugs in eBPF JIT compiler") we would just break out of the loop, and using the image as the JITed prog since it could only shrink in size anyway. After e0ee9c12157d, we would bail out to out_addrs label where we free addrs and jit_data but not the image coming from bpf_jit_binary_alloc(). Fixes: e0ee9c12157d ("x86: bpf_jit: fix two bugs in eBPF JIT compiler") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-21kvm: apic: Flush TLB after APIC mode/address change if VPIDs are in useJunaid Shahid
[ Upstream commit a468f2dbf921d02f5107378501693137a812999b ] Currently, KVM flushes the TLB after a change to the APIC access page address or the APIC mode when EPT mode is enabled. However, even in shadow paging mode, a TLB flush is needed if VPIDs are being used, as specified in the Intel SDM Section 29.4.5. So replace vmx_flush_tlb_ept_only() with vmx_flush_tlb(), which will flush if either EPT or VPIDs are in use. Signed-off-by: Junaid Shahid <junaids@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-21x86/cpu/intel: Add missing TLB cpuid valuesjacek.tomaka@poczta.fm
[ Upstream commit b837913fc2d9061bf9b8c0dd6bf2d24e2f98b84a ] Make kernel print the correct number of TLB entries on Intel Xeon Phi 7210 (and others) Before: [ 0.320005] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0 After: [ 0.320005] Last level dTLB entries: 4KB 256, 2MB 128, 4MB 128, 1GB 16 The entries do exist in the official Intel SMD but the type column there is incorrect (states "Cache" where it should read "TLB"), but the entries for the values 0x6B, 0x6C and 0x6D are correctly described as 'Data TLB'. Signed-off-by: Jacek Tomaka <jacek.tomaka@poczta.fm> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20180423161425.24366-1-jacekt@dugeo.com Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-21perf/x86/intel: Don't enable freeze-on-smi for PerfMon V1Kan Liang
[ Upstream commit 4e949e9b9d1e3edcdab3b54656c5851bd9e49c67 ] The SMM freeze feature was introduced since PerfMon V2. But the current code unconditionally enables the feature for all platforms. It can generate #GP exception, if the related FREEZE_WHILE_SMM bit is set for the machine with PerfMon V1. To disable the feature for PerfMon V1, perf needs to - Remove the freeze_on_smi sysfs entry by moving intel_pmu_attrs to intel_pmu, which is only applied to PerfMon V2 and later. - Check the PerfMon version before flipping the SMM bit when starting CPU Fixes: 6089327f5424 ("perf/x86: Add sysfs entry to freeze counters on SMI") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: ak@linux.intel.com Cc: eranian@google.com Cc: acme@redhat.com Link: https://lkml.kernel.org/r/1524682637-63219-1-git-send-email-kan.liang@linux.intel.com Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-21bpf, x64: fix JIT emission for dead codeGianluca Borello
[ Upstream commit 1612a981b76688c598dc944bbfbe29a2b33e3973 ] Commit 2a5418a13fcf ("bpf: improve dead code sanitizing") replaced dead code with a series of ja-1 instructions, for safety. That made JIT compilation much more complex for some BPF programs. One instance of such programs is, for example: bool flag = false ... /* A bunch of other code */ ... if (flag) do_something() In some cases llvm is not able to remove at compile time the code for do_something(), so the generated BPF program ends up with a large amount of dead instructions. In one specific real life example, there are two series of ~500 and ~1000 dead instructions in the program. When the verifier replaces them with a series of ja-1 instructions, it causes an interesting behavior at JIT time. During the first pass, since all the instructions are estimated at 64 bytes, the ja-1 instructions end up being translated as 5 bytes JMP instructions (0xE9), since the jump offsets become increasingly large (> 127) as each instruction gets discovered to be 5 bytes instead of the estimated 64. Starting from the second pass, the first N instructions of the ja-1 sequence get translated into 2 bytes JMPs (0xEB) because the jump offsets become <= 127 this time. In particular, N is defined as roughly 127 / (5 - 2) ~= 42. So, each further pass will make the subsequent N JMP instructions shrink from 5 to 2 bytes, making the image shrink every time. This means that in order to have the entire program converge, there need to be, in the real example above, at least ~1000 / 42 ~= 24 passes just for translating the dead code. If we add this number to the passes needed to translate the other non dead code, it brings such program to 40+ passes, and JIT doesn't complete. Ultimately the userspace loader fails because such BPF program was supposed to be part of a prog array owner being JITed. While it is certainly possible to try to refactor such programs to help the compiler remove dead code, the behavior is not really intuitive and it puts further burden on the BPF developer who is not expecting such behavior. To make things worse, such programs are working just fine in all the kernel releases prior to the ja-1 fix. A possible approach to mitigate this behavior consists into noticing that for ja-1 instructions we don't really need to rely on the estimated size of the previous and current instructions, we know that a -1 BPF jump offset can be safely translated into a 0xEB instruction with a jump offset of -2. Such fix brings the BPF program in the previous example to complete again in ~9 passes. Fixes: 2a5418a13fcf ("bpf: improve dead code sanitizing") Signed-off-by: Gianluca Borello <g.borello@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-21kexec_file: do not add extra alignment to efi memmapDave Young
[ Upstream commit a841aa83dff0af75c88aa846ba610a8af4c5ee21 ] Chun-Yi reported a kernel warning message below: WARNING: CPU: 0 PID: 0 at ../mm/early_ioremap.c:182 early_iounmap+0x4f/0x12c() early_iounmap(ffffffffff200180, 00000118) [0] size not consistent 00000120 The problem is x86 kexec_file_load adds extra alignment to the efi memmap: in bzImage64_load(): efi_map_sz = efi_get_runtime_map_size(); efi_map_sz = ALIGN(efi_map_sz, 16); And __efi_memmap_init maps with the size including the alignment bytes but efi_memmap_unmap use nr_maps * desc_size which does not include the extra bytes. The alignment in kexec code is only needed for the kexec buffer internal use Actually kexec should pass exact size of the efi memmap to 2nd kernel. Link: http://lkml.kernel.org/r/20180417083600.GA1972@dhcp-128-65.nay.redhat.com Signed-off-by: Dave Young <dyoung@redhat.com> Reported-by: joeyli <jlee@suse.com> Tested-by: Randy Wright <rwright@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-21kvm: x86: move MSR_IA32_TSC handling to x86.cPaolo Bonzini
[ Upstream commit dd259935e4eec844dc3e5b8a7cd951cd658b4fb6 ] This is not specific to Intel/AMD anymore. The TSC offset is available in vcpu->arch.tsc_offset. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-21X86/KVM: Properly update 'tsc_offset' to represent the running guestKarimAllah Ahmed
[ Upstream commit e79f245ddec17bbd89d73cd0169dba4be46c9b55 ] Update 'tsc_offset' on vmentry/vmexit of L2 guests to ensure that it always captures the TSC_OFFSET of the running guest whether it is the L1 or L2 guest. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> [AMD changes, fix update_ia32_tsc_adjust_msr. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-21x86: Add check for APIC access address for vmentry of L2 guestsKrish Sadhukhan
[ Upstream commit f0f4cf5b306620282db0c59ff963012e1973e025 ] According to the sub-section titled 'VM-Execution Control Fields' in the section titled 'Basic VM-Entry Checks' in Intel SDM vol. 3C, the following vmentry check must be enforced: If the 'virtualize APIC-accesses' VM-execution control is 1, the APIC-access address must satisfy the following checks: - Bits 11:0 of the address must be 0. - The address should not set any bits beyond the processor's physical-address width. This patch adds the necessary check to conform to this rule. If the check fails, we cause the L2 VMENTRY to fail which is what the associated unit test (following patch) expects. Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-21KVM: X86: fix incorrect reference of trace_kvm_pi_irte_updatehu huajun
[ Upstream commit 2698d82e519413c6ad287e6f14b29e0373ed37f8 ] In arch/x86/kvm/trace.h, this function is declared as host_irq the first input, and vcpu_id the second, instead of otherwise. Signed-off-by: hu huajun <huhuajun@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-16kvm: x86: use correct privilege level for sgdt/sidt/fxsave/fxrstor accessPaolo Bonzini
commit 3c9fa24ca7c9c47605672916491f79e8ccacb9e6 upstream. The functions that were used in the emulation of fxrstor, fxsave, sgdt and sidt were originally meant for task switching, and as such they did not check privilege levels. This is very bad when the same functions are used in the emulation of unprivileged instructions. This is CVE-2018-10853. The obvious fix is to add a new argument to ops->read_std and ops->write_std, which decides whether the access is a "system" access or should use the processor's CPL. Fixes: 129a72a0d3c8 ("KVM: x86: Introduce segmented_write_std", 2017-01-12) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-16KVM: x86: pass kvm_vcpu to kvm_read_guest_virt and kvm_write_guest_virt_systemPaolo Bonzini
commit ce14e868a54edeb2e30cb7a7b104a2fc4b9d76ca upstream. Int the next patch the emulator's .read_std and .write_std callbacks will grow another argument, which is not needed in kvm_read_guest_virt and kvm_write_guest_virt_system's callers. Since we have to make separate functions, let's give the currently existing names a nicer interface, too. Fixes: 129a72a0d3c8 ("KVM: x86: Introduce segmented_write_std", 2017-01-12) Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-16kvm: nVMX: Enforce cpl=0 for VMX instructionsFelix Wilhelm
commit 727ba748e110b4de50d142edca9d6a9b7e6111d8 upstream. VMX instructions executed inside a L1 VM will always trigger a VM exit even when executed with cpl 3. This means we must perform the privilege check in software. Fixes: 70f3aac964ae("kvm: nVMX: Remove superfluous VMX instruction fault checks") Cc: stable@vger.kernel.org Signed-off-by: Felix Wilhelm <fwilhelm@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-16KVM: x86: introduce linear_{read,write}_systemPaolo Bonzini
commit 79367a65743975e5cac8d24d08eccc7fdae832b0 upstream. Wrap the common invocation of ctxt->ops->read_std and ctxt->ops->write_std, so as to have a smaller patch when the functions grow another argument. Fixes: 129a72a0d3c8 ("KVM: x86: Introduce segmented_write_std", 2017-01-12) Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-16KVM: X86: Fix reserved bits check for MOV to CR3Wanpeng Li
commit a780a3ea628268b2ad0ed43d7f28d90db0ff18be upstream. MSB of CR3 is a reserved bit if the PCIDE bit is not set in CR4. It should be checked when PCIDE bit is not set, however commit 'd1cd3ce900441 ("KVM: MMU: check guest CR3 reserved bits based on its physical address width")' removes the bit 63 checking unconditionally. This patch fixes it by checking bit 63 of CR3 when PCIDE bit is not set in CR4. Fixes: d1cd3ce900441 (KVM: MMU: check guest CR3 reserved bits based on its physical address width) Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: stable@vger.kernel.org Reviewed-by: Junaid Shahid <junaids@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-05x86/MCE/AMD: Cache SMCA MISC block addressesBorislav Petkov
commit 78ce241099bb363b19dbd0245442e66c8de8f567 upstream. ... into a global, two-dimensional array and service subsequent reads from that cache to avoid rdmsr_on_cpu() calls during CPU hotplug (IPIs with IRQs disabled). In addition, this fixes a KASAN slab-out-of-bounds read due to wrong usage of the bank->blocks pointer. Fixes: 27bd59502702 ("x86/mce/AMD: Get address from already initialized block") Reported-by: Johannes Hirte <johannes.hirte@datenkhaos.de> Tested-by: Johannes Hirte <johannes.hirte@datenkhaos.de> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Yazen Ghannam <yazen.ghannam@amd.com> Link: http://lkml.kernel.org/r/20180414004230.GA2033@probook Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-05x86/mce/AMD: Carve out SMCA get_block_address() codeYazen Ghannam
commit 8a331f4a0863bea758561c921b94b4d28f7c4029 upstream. Carve out the SMCA code in get_block_address() into a separate helper function. No functional change. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> [ Save an indentation level. ] Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20180215210943.11530-4-Yazen.Ghannam@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30x86/apic: Set up through-local-APIC mode on the boot CPU if 'noapic' specifiedBaoquan He
[ Upstream commit bee3204ec3c49f6f53add9c3962c9012a5c036fa ] Currently the kdump kernel becomes very slow if 'noapic' is specified. Normal kernel doesn't have this bug. Kernel parameter 'noapic' is used to disable IO-APIC in system for testing or special purpose. Here the root cause is that in kdump kernel LAPIC is disabled since commit: 522e664644 ("x86/apic: Disable I/O APIC before shutdown of the local APIC") In this case we need set up through-local-APIC on boot CPU in setup_local_APIC(). In normal kernel the legacy irq mode is enabled by the BIOS. If it is virtual wire mode, the local-APIC has been enabled and set as through-local-APIC. Though we fixed the regression introduced by commit 522e664644, to further improve robustness set up the through-local-APIC mode explicitly, do not rely on the default boot IRQ mode. Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: douly.fnst@cn.fujitsu.com Cc: joro@8bytes.org Cc: prarit@redhat.com Cc: uobergfe@redhat.com Link: http://lkml.kernel.org/r/20180214054656.3780-7-bhe@redhat.com [ Rewrote the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30x86/devicetree: Fix device IRQ settings in DTIvan Gorinov
[ Upstream commit 0a5169add90e43ab45ab1ba34223b8583fcaf675 ] IRQ parameters for the SoC devices connected directly to I/O APIC lines (without PCI IRQ routing) may be specified in the Device Tree. Called from DT IRQ parser, irq_create_fwspec_mapping() calls irq_domain_alloc_irqs() with a pointer to irq_fwspec structure as @arg. But x86-specific DT IRQ allocation code casts @arg to of_phandle_args structure pointer and crashes trying to read the IRQ parameters. The function was not converted when the mapping descriptor was changed to irq_fwspec in the generic irqdomain code. Fixes: 11e4438ee330 ("irqdomain: Introduce a firmware-specific IRQ specifier structure") Signed-off-by: Ivan Gorinov <ivan.gorinov@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Rob Herring <robh+dt@kernel.org> Link: https://lkml.kernel.org/r/a234dee27ea60ce76141872da0d6bdb378b2a9ee.1520450752.git.ivan.gorinov@intel.com Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30x86/devicetree: Initialize device tree before using itIvan Gorinov
[ Upstream commit 628df9dc5ad886b0a9b33c75a7b09710eb859ca1 ] Commit 08d53aa58cb1 added CRC32 calculation in early_init_dt_verify() and checking in late initcall of_fdt_raw_init(), making early_init_dt_verify() mandatory. The required call to early_init_dt_verify() was not added to the x86-specific implementation, causing failure to create the sysfs entry in of_fdt_raw_init(). Fixes: 08d53aa58cb1 ("of/fdt: export fdt blob as /sys/firmware/fdt") Signed-off-by: Ivan Gorinov <ivan.gorinov@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Rob Herring <robh+dt@kernel.org> Link: https://lkml.kernel.org/r/c8c7e941efc63b5d25ebf9b6350b0f3df38f6098.1520450752.git.ivan.gorinov@intel.com Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30perf/x86/intel: Fix event update for auto-reloadKan Liang
[ Upstream commit d31fc13fdcb20e1c317f9a7dd6273c18fbd58308 ] There is a bug when reading event->count with large PEBS enabled. Here is an example: # ./read_count 0x71f0 0x122c0 0x1000000001c54 0x100000001257d 0x200000000bdc5 In fixed period mode, the auto-reload mechanism could be enabled for PEBS events, but the calculation of event->count does not take the auto-reload values into account. Anyone who reads event->count will get the wrong result, e.g x86_pmu_read(). This bug was introduced with the auto-reload mechanism enabled since commit: 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload mechanism when possible") Introduce intel_pmu_save_and_restart_reload() to calculate the event->count only for auto-reload. Since the counter increments a negative counter value and overflows on the sign switch, giving the interval: [-period, 0] the difference between two consequtive reads is: A) value2 - value1; when no overflows have happened in between, B) (0 - value1) + (value2 - (-period)); when one overflow happened in between, C) (0 - value1) + (n - 1) * (period) + (value2 - (-period)); when @n overflows happened in between. Here A) is the obvious difference, B) is the extension to the discrete interval, where the first term is to the top of the interval and the second term is from the bottom of the next interval and C) the extension to multiple intervals, where the middle term is the whole intervals covered. The equation for all cases is: value2 - value1 + n * period Previously the event->count is updated right before the sample output. But for case A, there is no PEBS record ready. It needs to be specially handled. Remove the auto-reload code from x86_perf_event_set_period() since we'll not longer call that function in this case. Based-on-code-from: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: acme@kernel.org Fixes: 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload mechanism when possible") Link: http://lkml.kernel.org/r/1518474035-21006-2-git-send-email-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30perf/x86/intel: Fix large period handling on Broadwell CPUsKan Liang
[ Upstream commit f605cfca8c39ffa2b98c06d2b9f30ba64f1e54e3 ] Large fixed period values could be truncated on Broadwell, for example: perf record -e cycles -c 10000000000 Here the fixed period is 0x2540BE400, but the period which finally applied is 0x540BE400 - which is wrong. The reason is that x86_pmu::limit_period() uses an u32 parameter, so the high 32 bits of 'period' get truncated. This bug was introduced in: commit 294fe0f52a44 ("perf/x86/intel: Add INST_RETIRED.ALL workarounds") It's safe to use u64 instead of u32: - Although the 'left' is s64, the value of 'left' must be positive when calling limit_period(). - bdw_limit_period() only modifies the lowest 6 bits, it doesn't touch the higher 32 bits. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: 294fe0f52a44 ("perf/x86/intel: Add INST_RETIRED.ALL workarounds") Link: http://lkml.kernel.org/r/1519926894-3520-1-git-send-email-kan.liang@linux.intel.com [ Rewrote unacceptably bad changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30perf/x86/intel: Properly save/restore the PMU state in the NMI handlerKan Liang
[ Upstream commit 82d71ed0277efc45360828af8c4e4d40e1b45352 ] The PMU is disabled in intel_pmu_handle_irq(), but cpuc->enabled is not updated accordingly. This is fine in current usage because no-one checks it - but fix it for future code: for example, the drain_pebs() will be modified to fix an auto-reload bug. Properly save/restore the old PMU state. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: acme@kernel.org Cc: kernel test robot <fengguang.wu@intel.com> Link: http://lkml.kernel.org/r/6f44ee84-56f8-79f1-559b-08e371eaeb78@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30KVM: lapic: stop advertising DIRECTED_EOI when in-kernel IOAPIC is in useVitaly Kuznetsov
[ Upstream commit 0bcc3fb95b97ac2ca223a5a870287b37f56265ac ] Devices which use level-triggered interrupts under Windows 2016 with Hyper-V role enabled don't work: Windows disables EOI broadcast in SPIV unconditionally. Our in-kernel IOAPIC implementation emulates an old IOAPIC version which has no EOI register so EOI never happens. The issue was discovered and discussed a while ago: https://www.spinics.net/lists/kvm/msg148098.html While this is a guest OS bug (it should check that IOAPIC has the required capabilities before disabling EOI broadcast) we can workaround it in KVM: advertising DIRECTED_EOI with in-kernel IOAPIC makes little sense anyway. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30KVM: VMX: raise internal error for exception during invalid protected mode stateSean Christopherson
[ Upstream commit add5ff7a216ee545a214013f26d1ef2f44a9c9f8 ] Exit to userspace with KVM_INTERNAL_ERROR_EMULATION if we encounter an exception in Protected Mode while emulating guest due to invalid guest state. Unlike Big RM, KVM doesn't support emulating exceptions in PM, i.e. PM exceptions are always injected via the VMCS. Because we will never do VMRESUME due to emulation_required, the exception is never realized and we'll keep emulating the faulting instruction over and over until we receive a signal. Exit to userspace iff there is a pending exception, i.e. don't exit simply on a requested event. The purpose of this check and exit is to aid in debugging a guest that is in all likelihood already doomed. Invalid guest state in PM is extremely limited in normal operation, e.g. it generally only occurs for a few instructions early in BIOS, and any exception at this time is all but guaranteed to be fatal. Non-vectored interrupts, e.g. INIT, SIPI and SMI, can be cleanly handled/emulated, while checking for vectored interrupts, e.g. INTR and NMI, without hitting false positives would add a fair amount of complexity for almost no benefit (getting hit by lightning seems more likely than encountering this specific scenario). Add a WARN_ON_ONCE to vmx_queue_exception() if we try to inject an exception via the VMCS and emulation_required is true. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30x86/mm: Fix bogus warning during EFI bootup, use boot_cpu_has() instead of ↵Sai Praneeth
this_cpu_has() in build_cr3_noflush() [ Upstream commit 162ee5a8ab49be40d253f90e94aef712470a3a24 ] Linus reported the following boot warning: WARNING: CPU: 0 PID: 0 at arch/x86/include/asm/tlbflush.h:134 load_new_mm_cr3+0x114/0x170 [...] Call Trace: switch_mm_irqs_off+0x267/0x590 switch_mm+0xe/0x20 efi_switch_mm+0x3e/0x50 efi_enter_virtual_mode+0x43f/0x4da start_kernel+0x3bf/0x458 secondary_startup_64+0xa5/0xb0 ... after merging: 03781e40890c: x86/efi: Use efi_switch_mm() rather than manually twiddling with %cr3 When the platform supports PCID and if CONFIG_DEBUG_VM=y is enabled, build_cr3_noflush() (called via switch_mm()) does a sanity check to see if X86_FEATURE_PCID is set. Presently, build_cr3_noflush() uses "this_cpu_has(X86_FEATURE_PCID)" to perform the check but this_cpu_has() works only after SMP is initialized (i.e. per cpu cpu_info's should be populated) and this happens to be very late in the boot process (during rest_init()). As efi_runtime_services() are called during (early) kernel boot time and run time, modify build_cr3_noflush() to use boot_cpu_has() all the time. As suggested by Dave Hansen, this should be OK because all CPU's have same capabilities on x86. With this change the warning is fixed. ( Dave also suggested that we put a warning in this_cpu_has() if it's used early in the boot process. This is still work in progress as it affects MCE. ) Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Lee Chun-Yi <jlee@suse.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Shankar <ravi.v.shankar@intel.com> Cc: Ricardo Neri <ricardo.neri@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/1522870459-7432-1-git-send-email-sai.praneeth.prakhya@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30x86/mm: Do not forbid _PAGE_RW before init for __ro_after_initDave Hansen
[ Upstream commit 639d6aafe437a7464399d2a77d006049053df06f ] __ro_after_init data gets stuck in the .rodata section. That's normally fine because the kernel itself manages the R/W properties. But, if we run __change_page_attr() on an area which is __ro_after_init, the .rodata checks will trigger and force the area to be immediately read-only, even if it is early-ish in boot. This caused problems when trying to clear the _PAGE_GLOBAL bit for these area in the PTI code: it cleared _PAGE_GLOBAL like I asked, but also took it up on itself to clear _PAGE_RW. The kernel then oopses the next time it wrote to a __ro_after_init data structure. To fix this, add the kernel_set_to_readonly check, just like we have for kernel text, just a few lines below in this function. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180406205514.8D898241@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30x86/pgtable: Don't set huge PUD/PMD on non-leaf entriesJoerg Roedel
[ Upstream commit e3e288121408c3abeed5af60b87b95c847143845 ] The pmd_set_huge() and pud_set_huge() functions are used from the generic ioremap() code to establish large mappings where this is possible. But the generic ioremap() code does not check whether the PMD/PUD entries are already populated with a non-leaf entry, so that any page-table pages these entries point to will be lost. Further, on x86-32 with SHARED_KERNEL_PMD=0, this causes a BUG_ON() in vmalloc_sync_one() when PMD entries are synced from swapper_pg_dir to the current page-table. This happens because the PMD entry from swapper_pg_dir was promoted to a huge-page entry while the current PGD still contains the non-leaf entry. Because both entries are present and point to a different page, the BUG_ON() triggers. This was actually triggered with pti-x32 enabled in a KVM virtual machine by the graphics driver. A real and better fix for that would be to improve the page-table handling in the generic ioremap() code. But that is out-of-scope for this patch-set and left for later work. Reported-by: David H. Gutteridge <dhgutteridge@sympatico.ca> Signed-off-by: Joerg Roedel <jroedel@suse.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Waiman Long <llong@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180411152437.GC15462@8bytes.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30x86/kvm: fix LAPIC timer drift when guest uses periodic modeDavid Vrabel
commit d8f2f498d9ed0c5010bc1bbc1146f94c8bf9f8cc upstream. Since 4.10, commit 8003c9ae204e (KVM: LAPIC: add APIC Timer periodic/oneshot mode VMX preemption timer support), guests using periodic LAPIC timers (such as FreeBSD 8.4) would see their timers drift significantly over time. Differences in the underlying clocks and numerical errors means the periods of the two timers (hv and sw) are not the same. This difference will accumulate with every expiry resulting in a large error between the hv and sw timer. This means the sw timer may be running slow when compared to the hv timer. When the timer is switched from hv to sw, the now active sw timer will expire late. The guest VCPU is reentered and it switches to using the hv timer. This timer catches up, injecting multiple IRQs into the guest (of which the guest only sees one as it does not get to run until the hv timer has caught up) and thus the guest's timer rate is low (and becomes increasing slower over time as the sw timer lags further and further behind). I believe a similar problem would occur if the hv timer is the slower one, but I have not observed this. Fix this by synchronizing the deadlines for both timers to the same time source on every tick. This prevents the errors from accumulating. Fixes: 8003c9ae204e21204e49816c5ea629357e283b06 Cc: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: David Vrabel <david.vrabel@nutanix.com> Cc: stable@vger.kernel.org Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30kvm: x86: IA32_ARCH_CAPABILITIES is always supportedJim Mattson
commit 1eaafe91a0df4157521b6417b3dd8430bf5f52f0 upstream. If there is a possibility that a VM may migrate to a Skylake host, then the hypervisor should report IA32_ARCH_CAPABILITIES.RSBA[bit 2] as being set (future work, of course). This implies that CPUID.(EAX=7,ECX=0):EDX.ARCH_CAPABILITIES[bit 29] should be set. Therefore, kvm should report this CPUID bit as being supported whether or not the host supports it. Userspace is still free to clear the bit if it chooses. For more information on RSBA, see Intel's white paper, "Retpoline: A Branch Target Injection Mitigation" (Document Number 337131-001), currently available at https://bugzilla.kernel.org/show_bug.cgi?id=199511. Since the IA32_ARCH_CAPABILITIES MSR is emulated in kvm, there is no dependency on hardware support for this feature. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Fixes: 28c1c9fabf48 ("KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES") Cc: stable@vger.kernel.org Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changedWei Huang
commit c4d2188206bafa177ea58e9a25b952baa0bf7712 upstream. The CPUID bits of OSXSAVE (function=0x1) and OSPKE (func=0x7, leaf=0x0) allows user apps to detect if OS has set CR4.OSXSAVE or CR4.PKE. KVM is supposed to update these CPUID bits when CR4 is updated. Current KVM code doesn't handle some special cases when updates come from emulator. Here is one example: Step 1: guest boots Step 2: guest OS enables XSAVE ==> CR4.OSXSAVE=1 and CPUID.OSXSAVE=1 Step 3: guest hot reboot ==> QEMU reset CR4 to 0, but CPUID.OSXAVE==1 Step 4: guest os checks CPUID.OSXAVE, detects 1, then executes xgetbv Step 4 above will cause an #UD and guest crash because guest OS hasn't turned on OSXAVE yet. This patch solves the problem by comparing the the old_cr4 with cr4. If the related bits have been changed, kvm_update_cpuid() needs to be called. Signed-off-by: Wei Huang <wei@redhat.com> Reviewed-by: Bandan Das <bsd@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30KVM/VMX: Expose SSBD properly to guestsKonrad Rzeszutek Wilk
commit 0aa48468d00959c8a37cd3ac727284f4f7359151 upstream. The X86_FEATURE_SSBD is an synthetic CPU feature - that is it bit location has no relevance to the real CPUID 0x7.EBX[31] bit position. For that we need the new CPU feature name. Fixes: 52817587e706 ("x86/cpufeatures: Disentangle SSBD enumeration") Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: kvm@vger.kernel.org Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: stable@vger.kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Link: https://lkml.kernel.org/r/20180521215449.26423-2-konrad.wilk@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-25x86/kexec: Avoid double free_page() upon do_kexec_load() failureTetsuo Handa
commit a466ef76b815b86748d9870ef2a430af7b39c710 upstream. >From ff82bedd3e12f0d3353282054ae48c3bd8c72012 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Wed, 9 May 2018 12:12:39 +0900 Subject: x86/kexec: Avoid double free_page() upon do_kexec_load() failure syzbot is reporting crashes after memory allocation failure inside do_kexec_load() [1]. This is because free_transition_pgtable() is called by both init_transition_pgtable() and machine_kexec_cleanup() when memory allocation failed inside init_transition_pgtable(). Regarding 32bit code, machine_kexec_free_page_tables() is called by both machine_kexec_alloc_page_tables() and machine_kexec_cleanup() when memory allocation failed inside machine_kexec_alloc_page_tables(). Fix this by leaving the error handling to machine_kexec_cleanup() (and optionally setting NULL after free_page()). [1] https://syzkaller.appspot.com/bug?id=91e52396168cf2bdd572fe1e1bc0bc645c1c6b40 Fixes: f5deb79679af6eb4 ("x86: kexec: Use one page table in x86_64 machine_kexec") Fixes: 92be3d6bdf2cb349 ("kexec/i386: allocate page table pages dynamically") Reported-by: syzbot <syzbot+d96f60296ef613fe1d69@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Baoquan He <bhe@redhat.com> Cc: thomas.lendacky@amd.com Cc: prudo@linux.vnet.ibm.com Cc: Huang Ying <ying.huang@intel.com> Cc: syzkaller-bugs@googlegroups.com Cc: takahiro.akashi@linaro.org Cc: H. Peter Anvin <hpa@zytor.com> Cc: akpm@linux-foundation.org Cc: dyoung@redhat.com Cc: kirill.shutemov@linux.intel.com Link: https://lkml.kernel.org/r/201805091942.DGG12448.tMFVFSJFQOOLHO@I-love.SAKURA.ne.jp Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-22x86/bugs: Rename SSBD_NO to SSB_NOKonrad Rzeszutek Wilk
commit 240da953fcc6a9008c92fae5b1f727ee5ed167ab upstream The "336996 Speculative Execution Side Channel Mitigations" from May defines this as SSB_NO, hence lets sync-up. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-22KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBDTom Lendacky
commit bc226f07dcd3c9ef0b7f6236fe356ea4a9cb4769 upstream Expose the new virtualized architectural mechanism, VIRT_SSBD, for using speculative store bypass disable (SSBD) under SVM. This will allow guests to use SSBD on hardware that uses non-architectural mechanisms for enabling SSBD. [ tglx: Folded the migration fixup from Paolo Bonzini ] Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-22x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFGThomas Gleixner
commit 47c61b3955cf712cadfc25635bf9bc174af030ea upstream Add the necessary logic for supporting the emulated VIRT_SPEC_CTRL MSR to x86_virt_spec_ctrl(). If either X86_FEATURE_LS_CFG_SSBD or X86_FEATURE_VIRT_SPEC_CTRL is set then use the new guest_virt_spec_ctrl argument to check whether the state must be modified on the host. The update reuses speculative_store_bypass_update() so the ZEN-specific sibling coordination can be reused. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-22x86/bugs: Rework spec_ctrl base and mask logicThomas Gleixner
commit be6fcb5478e95bb1c91f489121238deb3abca46a upstream x86_spec_ctrL_mask is intended to mask out bits from a MSR_SPEC_CTRL value which are not to be modified. However the implementation is not really used and the bitmask was inverted to make a check easier, which was removed in "x86/bugs: Remove x86_spec_ctrl_set()" Aside of that it is missing the STIBP bit if it is supported by the platform, so if the mask would be used in x86_virt_spec_ctrl() then it would prevent a guest from setting STIBP. Add the STIBP bit if supported and use the mask in x86_virt_spec_ctrl() to sanitize the value which is supplied by the guest. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>