summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/vmx.c
AgeCommit message (Collapse)Author
2019-03-23KVM: X86: Fix residual mmio emulation request to userspaceWanpeng Li
commit bbeac2830f4de270bb48141681cb730aadf8dce1 upstream. Reported by syzkaller: The kvm-intel.unrestricted_guest=0 WARNING: CPU: 5 PID: 1014 at /home/kernel/data/kvm/arch/x86/kvm//x86.c:7227 kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm] CPU: 5 PID: 1014 Comm: warn_test Tainted: G W OE 4.13.0-rc3+ #8 RIP: 0010:kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm] Call Trace: ? put_pid+0x3a/0x50 ? rcu_read_lock_sched_held+0x79/0x80 ? kmem_cache_free+0x2f2/0x350 kvm_vcpu_ioctl+0x340/0x700 [kvm] ? kvm_vcpu_ioctl+0x340/0x700 [kvm] ? __fget+0xfc/0x210 do_vfs_ioctl+0xa4/0x6a0 ? __fget+0x11d/0x210 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc2 ? __this_cpu_preempt_check+0x13/0x20 The syszkaller folks reported a residual mmio emulation request to userspace due to vm86 fails to emulate inject real mode interrupt(fails to read CS) and incurs a triple fault. The vCPU returns to userspace with vcpu->mmio_needed == true and KVM_EXIT_SHUTDOWN exit reason. However, the syszkaller testcase constructs several threads to launch the same vCPU, the thread which lauch this vCPU after the thread whichs get the vcpu->mmio_needed == true and KVM_EXIT_SHUTDOWN will trigger the warning. #define _GNU_SOURCE #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/wait.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> #include <fcntl.h> #include <unistd.h> #include <linux/kvm.h> #include <stdio.h> int kvmcpu; struct kvm_run *run; void* thr(void* arg) { int res; res = ioctl(kvmcpu, KVM_RUN, 0); printf("ret1=%d exit_reason=%d suberror=%d\n", res, run->exit_reason, run->internal.suberror); return 0; } void test() { int i, kvm, kvmvm; pthread_t th[4]; kvm = open("/dev/kvm", O_RDWR); kvmvm = ioctl(kvm, KVM_CREATE_VM, 0); kvmcpu = ioctl(kvmvm, KVM_CREATE_VCPU, 0); run = (struct kvm_run*)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, kvmcpu, 0); srand(getpid()); for (i = 0; i < 4; i++) { pthread_create(&th[i], 0, thr, 0); usleep(rand() % 10000); } for (i = 0; i < 4; i++) pthread_join(th[i], 0); } int main() { for (;;) { int pid = fork(); if (pid < 0) exit(1); if (pid == 0) { test(); exit(0); } int status; while (waitpid(pid, &status, __WALL) != pid) {} } return 0; } This patch fixes it by resetting the vcpu->mmio_needed once we receive the triple fault to avoid the residue. Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Zubin Mithra <zsm@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23KVM: nVMX: Ignore limit checks on VMX instructions using flat segmentsSean Christopherson
commit 34333cc6c2cb021662fd32e24e618d1b86de95bf upstream. Regarding segments with a limit==0xffffffff, the SDM officially states: When the effective limit is FFFFFFFFH (4 GBytes), these accesses may or may not cause the indicated exceptions. Behavior is implementation-specific and may vary from one execution to another. In practice, all CPUs that support VMX ignore limit checks for "flat segments", i.e. an expand-up data or code segment with base=0 and limit=0xffffffff. This is subtly different than wrapping the effective address calculation based on the address size, as the flat segment behavior also applies to accesses that would wrap the 4g boundary, e.g. a 4-byte access starting at 0xffffffff will access linear addresses 0xffffffff, 0x0, 0x1 and 0x2. Fixes: f9eb4af67c9d ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23KVM: nVMX: Sign extend displacements of VMX instr's mem operandsSean Christopherson
commit 946c522b603f281195af1df91837a1d4d1eb3bc9 upstream. The VMCS.EXIT_QUALIFCATION field reports the displacements of memory operands for various instructions, including VMX instructions, as a naturally sized unsigned value, but masks the value by the addr size, e.g. given a ModRM encoded as -0x28(%ebp), the -0x28 displacement is reported as 0xffffffd8 for a 32-bit address size. Despite some weird wording regarding sign extension, the SDM explicitly states that bits beyond the instructions address size are undefined: In all cases, bits of this field beyond the instruction’s address size are undefined. Failure to sign extend the displacement results in KVM incorrectly treating a negative displacement as a large positive displacement when the address size of the VMX instruction is smaller than KVM's native size, e.g. a 32-bit address size on a 64-bit KVM. The very original decoding, added by commit 064aea774768 ("KVM: nVMX: Decoding memory operands of VMX instructions"), sort of modeled sign extension by truncating the final virtual/linear address for a 32-bit address size. I.e. it messed up the effective address but made it work by adjusting the final address. When segmentation checks were added, the truncation logic was kept as-is and no sign extension logic was introduced. In other words, it kept calculating the wrong effective address while mostly generating the correct virtual/linear address. As the effective address is what's used in the segment limit checks, this results in KVM incorreclty injecting #GP/#SS faults due to non-existent segment violations when a nested VMM uses negative displacements with an address size smaller than KVM's native address size. Using the -0x28(%ebp) example, an EBP value of 0x1000 will result in KVM using 0x100000fd8 as the effective address when checking for a segment limit violation. This causes a 100% failure rate when running a 32-bit KVM build as L1 on top of a 64-bit KVM L0. Fixes: f9eb4af67c9d ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-23KVM: VMX: Fix x2apic check in vmx_msr_bitmap_mode()Joerg Roedel
The stable backport of upstream commit 904e14fb7cb96 KVM: VMX: make MSR bitmaps per-VCPU has a bug in vmx_msr_bitmap_mode(). It enables the x2apic MSR-bitmap when the kernel emulates x2apic for the guest in software. The upstream version of the commit checkes whether the hardware has virtualization enabled for x2apic emulation. Since KVM emulates x2apic for guests even when the host does not support x2apic in hardware, this causes the intercept of at least the X2APIC_TASKPRI MSR to be disabled on machines not supporting that MSR. The result is undefined behavior, on some machines (Intel Westmere based) it causes a crash of the guest kernel when it tries to access that MSR. Change the check in vmx_msr_bitmap_mode() to match the upstream code. This fixes the guest crashes observed with stable kernels starting with v4.4.168 through v4.4.175. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-20KVM: nVMX: unconditionally cancel preemption timer in free_nested ↵Peter Shier
(CVE-2019-7221) commit ecec76885bcfe3294685dc363fd1273df0d5d65f upstream. Bugzilla: 1671904 There are multiple code paths where an hrtimer may have been started to emulate an L1 VMX preemption timer that can result in a call to free_nested without an intervening L2 exit where the hrtimer is normally cancelled. Unconditionally cancel in free_nested to cover all cases. Embargoed until Feb 7th 2019. Signed-off-by: Peter Shier <pshier@google.com> Reported-by: Jim Mattson <jmattson@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reported-by: Felix Wilhelm <fwilhelm@google.com> Cc: stable@kernel.org Message-Id: <20181011184646.154065-1-pshier@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when ↵Vitaly Kuznetsov
running nested commit d391f1207067268261add0485f0f34503539c5b0 upstream. I was investigating an issue with seabios >= 1.10 which stopped working for nested KVM on Hyper-V. The problem appears to be in handle_ept_violation() function: when we do fast mmio we need to skip the instruction so we do kvm_skip_emulated_instruction(). This, however, depends on VM_EXIT_INSTRUCTION_LEN field being set correctly in VMCS. However, this is not the case. Intel's manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set when EPT MISCONFIG occurs. While on real hardware it was observed to be set, some hypervisors follow the spec and don't set it; we end up advancing IP with some random value. I checked with Microsoft and they confirmed they don't fill VM_EXIT_INSTRUCTION_LEN on EPT MISCONFIG. Fix the issue by doing instruction skip through emulator when running nested. Fixes: 68c3b4d1676d870f0453c31d5a52e7e65c7448ae Suggested-by: Radim Krčmář <rkrcmar@redhat.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> [mhaboustak: backport to 4.9.y] Signed-off-by: Mike Haboustak <haboustak@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBDTom Lendacky
commit bc226f07dcd3c9ef0b7f6236fe356ea4a9cb4769 upstream. Expose the new virtualized architectural mechanism, VIRT_SSBD, for using speculative store bypass disable (SSBD) under SVM. This will allow guests to use SSBD on hardware that uses non-architectural mechanisms for enabling SSBD. [ tglx: Folded the migration fixup from Paolo Bonzini ] Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRLThomas Gleixner
commit ccbcd2674472a978b48c91c1fbfb66c0ff959f24 upstream. AMD is proposing a VIRT_SPEC_CTRL MSR to handle the Speculative Store Bypass Disable via MSR_AMD64_LS_CFG so that guests do not have to care about the bit position of the SSBD bit and thus facilitate migration. Also, the sibling coordination on Family 17H CPUs can only be done on the host. Extend x86_spec_ctrl_set_guest() and x86_spec_ctrl_restore_host() with an extra argument for the VIRT_SPEC_CTRL MSR. Hand in 0 from VMX and in SVM add a new virt_spec_ctrl member to the CPU data structure which is going to be used in later patches for the actual implementation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [bwh: Backported to 4.4: This was partly applied before; apply just the missing bits] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17x86/KVM/VMX: Expose SPEC_CTRL Bit(2) to the guestKonrad Rzeszutek Wilk
commit da39556f66f5cfe8f9c989206974f1cb16ca5d7c upstream. Expose the CPUID.7.EDX[31] bit to the guest, and also guard against various combinations of SPEC_CTRL MSR values. The handling of the MSR (to take into account the host value of SPEC_CTRL Bit(2)) is taken care of in patch: KVM/SVM/VMX/x86/spectre_v2: Support the combination of guest and host IBRS Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> [dwmw2: Handle 4.9 guest CPUID differences, rename guest_cpu_has_ibrs() → guest_cpu_has_spec_ctrl()] Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [bwh: Backported to 4.4: Update feature bit name] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17x86/bugs, KVM: Support the combination of guest and host IBRSKonrad Rzeszutek Wilk
commit 5cf687548705412da47c9cec342fd952d71ed3d5 upstream. A guest may modify the SPEC_CTRL MSR from the value used by the kernel. Since the kernel doesn't use IBRS, this means a value of zero is what is needed in the host. But the 336996-Speculative-Execution-Side-Channel-Mitigations.pdf refers to the other bits as reserved so the kernel should respect the boot time SPEC_CTRL value and use that. This allows to deal with future extensions to the SPEC_CTRL interface if any at all. Note: This uses wrmsrl() instead of native_wrmsl(). I does not make any difference as paravirt will over-write the callq *0xfff.. with the wrmsrl assembler code. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 4.4: This was partly applied before; apply just the missing bits] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17KVM/x86: Remove indirect MSR op calls from SPEC_CTRLPaolo Bonzini
commit ecb586bd29c99fb4de599dec388658e74388daad upstream. Having a paravirt indirect call in the IBRS restore path is not a good idea, since we are trying to protect from speculative execution of bogus indirect branch targets. It is also slower, so use native_wrmsrl() on the vmentry path too. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: KarimAllah Ahmed <karahmed@amazon.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Fixes: d28b387fb74da95d69d2615732f50cceb38e9a4d Link: http://lkml.kernel.org/r/20180222154318.20361-2-pbonzini@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [bwh: Backported to 4.4: adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRLKarimAllah Ahmed
commit d28b387fb74da95d69d2615732f50cceb38e9a4d upstream. [ Based on a patch from Ashok Raj <ashok.raj@intel.com> ] Add direct access to MSR_IA32_SPEC_CTRL for guests. This is needed for guests that will only mitigate Spectre V2 through IBRS+IBPB and will not be using a retpoline+IBPB based approach. To avoid the overhead of saving and restoring the MSR_IA32_SPEC_CTRL for guests that do not actually use the MSR, only start saving and restoring when a non-zero is written to it. No attempt is made to handle STIBP here, intentionally. Filtering STIBP may be added in a future patch, which may require trapping all writes if we don't want to pass it through directly to the guest. [dwmw2: Clean up CPUID bits, save/restore manually, handle reset] Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jun Nakajima <jun.nakajima@intel.com> Cc: kvm@vger.kernel.org Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ashok Raj <ashok.raj@intel.com> Link: https://lkml.kernel.org/r/1517522386-18410-5-git-send-email-karahmed@amazon.de Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIESKarimAllah Ahmed
commit 28c1c9fabf48d6ad596273a11c46e0d0da3e14cd upstream. Intel processors use MSR_IA32_ARCH_CAPABILITIES MSR to indicate RDCL_NO (bit 0) and IBRS_ALL (bit 1). This is a read-only MSR. By default the contents will come directly from the hardware, but user-space can still override it. [dwmw2: The bit in kvm_cpuid_7_0_edx_x86_features can be unconditional] Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jun Nakajima <jun.nakajima@intel.com> Cc: kvm@vger.kernel.org Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Ashok Raj <ashok.raj@intel.com> Link: https://lkml.kernel.org/r/1517522386-18410-4-git-send-email-karahmed@amazon.de Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [bwh: Backported to 4.4: adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17KVM/x86: Add IBPB supportAshok Raj
commit 15d45071523d89b3fb7372e2135fbd72f6af9506 upstream. The Indirect Branch Predictor Barrier (IBPB) is an indirect branch control mechanism. It keeps earlier branches from influencing later ones. Unlike IBRS and STIBP, IBPB does not define a new mode of operation. It's a command that ensures predicted branch targets aren't used after the barrier. Although IBRS and IBPB are enumerated by the same CPUID enumeration, IBPB is very different. IBPB helps mitigate against three potential attacks: * Mitigate guests from being attacked by other guests. - This is addressed by issing IBPB when we do a guest switch. * Mitigate attacks from guest/ring3->host/ring3. These would require a IBPB during context switch in host, or after VMEXIT. The host process has two ways to mitigate - Either it can be compiled with retpoline - If its going through context switch, and has set !dumpable then there is a IBPB in that path. (Tim's patch: https://patchwork.kernel.org/patch/10192871) - The case where after a VMEXIT you return back to Qemu might make Qemu attackable from guest when Qemu isn't compiled with retpoline. There are issues reported when doing IBPB on every VMEXIT that resulted in some tsc calibration woes in guest. * Mitigate guest/ring0->host/ring0 attacks. When host kernel is using retpoline it is safe against these attacks. If host kernel isn't using retpoline we might need to do a IBPB flush on every VMEXIT. Even when using retpoline for indirect calls, in certain conditions 'ret' can use the BTB on Skylake-era CPUs. There are other mitigations available like RSB stuffing/clearing. * IBPB is issued only for SVM during svm_free_vcpu(). VMX has a vmclear and SVM doesn't. Follow discussion here: https://lkml.org/lkml/2018/1/15/146 Please refer to the following spec for more details on the enumeration and control. Refer here to get documentation about mitigations. https://software.intel.com/en-us/side-channel-security-support [peterz: rebase and changelog rewrite] [karahmed: - rebase - vmx: expose PRED_CMD if guest has it in CPUID - svm: only pass through IBPB if guest has it in CPUID - vmx: support !cpu_has_vmx_msr_bitmap()] - vmx: support nested] [dwmw2: Expose CPUID bit too (AMD IBPB only for now as we lack IBRS) PRED_CMD is a write-only MSR] Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: kvm@vger.kernel.org Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Jun Nakajima <jun.nakajima@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/1515720739-43819-6-git-send-email-ashok.raj@intel.com Link: https://lkml.kernel.org/r/1517522386-18410-3-git-send-email-karahmed@amazon.de Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [bwh: Backported to 4.4: adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17KVM: VMX: make MSR bitmaps per-VCPUPaolo Bonzini
commit 904e14fb7cb96401a7dc803ca2863fd5ba32ffe6 upstream. Place the MSR bitmap in struct loaded_vmcs, and update it in place every time the x2apic or APICv state can change. This is rare and the loop can handle 64 MSRs per iteration, in a similar fashion as nested_vmx_prepare_msr_bitmap. This prepares for choosing, on a per-VM basis, whether to intercept the SPEC_CTRL and PRED_CMD MSRs. Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [bwh: Backported to 4.4: - APICv support looked different - We still need to intercept the APIC_ID MSR - Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17KVM: VMX: introduce alloc_loaded_vmcsPaolo Bonzini
commit f21f165ef922c2146cc5bdc620f542953c41714b upstream. Group together the calls to alloc_vmcs and loaded_vmcs_init. Soon we'll also allocate an MSR bitmap there. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [bwh: Backported to 4.4: - No loaded_vmcs::shadow_vmcs field to initialise - Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17KVM: nVMX: Eliminate vmcs02 poolJim Mattson
commit de3a0021a60635de96aa92713c1a31a96747d72c upstream. The potential performance advantages of a vmcs02 pool have never been realized. To simplify the code, eliminate the pool. Instead, a single vmcs02 is allocated per VCPU when the VCPU enters VMX operation. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Mark Kanda <mark.kanda@oracle.com> Reviewed-by: Ameya More <ameya.more@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [bwh: Backported to 4.4: - No loaded_vmcs::shadow_vmcs field to initialise - Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17KVM: nVMX: mark vmcs12 pages dirty on L2 exitDavid Matlack
commit c9f04407f2e0b3fc9ff7913c65fcfcb0a4b61570 upstream. The host physical addresses of L1's Virtual APIC Page and Posted Interrupt descriptor are loaded into the VMCS02. The CPU may write to these pages via their host physical address while L2 is running, bypassing address-translation-based dirty tracking (e.g. EPT write protection). Mark them dirty on every exit from L2 to prevent them from getting out of sync with dirty tracking. Also mark the virtual APIC page and the posted interrupt descriptor dirty when KVM is virtualizing posted interrupt processing. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17KVM: nVMX: fix msr bitmaps to prevent L2 from accessing L0 x2APICRadim Krčmář
commit d048c098218e91ed0e10dfa1f0f80e2567fe4ef7 upstream. msr bitmap can be used to avoid a VM exit (interception) on guest MSR accesses. In some configurations of VMX controls, the guest can even directly access host's x2APIC MSRs. See SDM 29.5 VIRTUALIZING MSR-BASED APIC ACCESSES. L2 could read all L0's x2APIC MSRs and write TPR, EOI, and SELF_IPI. To do so, L1 would first trick KVM to disable all possible interceptions by enabling APICv features and then would turn those features off; nested_vmx_merge_msr_bitmap() only disabled interceptions, so VMX would not intercept previously enabled MSRs even though they were not safe with the new configuration. Correctly re-enabling interceptions is not enough as a second bug would still allow L1+L2 to access host's MSRs: msr bitmap was shared for all VMCSs, so L1 could trigger a race to get the desired combination of msr bitmap and VMX controls. This fix allocates a msr bitmap for every L1 VCPU, allows only safe x2APIC MSRs from L1's msr bitmap, and disables msr bitmaps if they would have to intercept everything anyway. Fixes: 3af18d9c5fe9 ("KVM: nVMX: Prepare for using hardware MSR bitmap") Reported-by: Jim Mattson <jmattson@google.com> Suggested-by: Wincy Van <fanwenyi0529@gmail.com> Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> [bwh: Backported to 4.4: - handle_vmon() doesn't allocate a cached vmcs12 - Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-06kvm: x86: vmx: fix vpid leakRoman Kagan
commit 63aff65573d73eb8dda4732ad4ef222dd35e4862 upstream. VPID for the nested vcpu is allocated at vmx_create_vcpu whenever nested vmx is turned on with the module parameter. However, it's only freed if the L1 guest has executed VMXON which is not a given. As a result, on a system with nested==on every creation+deletion of an L1 vcpu without running an L2 guest results in leaking one vpid. Since the total number of vpids is limited to 64k, they can eventually get exhausted, preventing L2 from starting. Delay allocation of the L2 vpid until VMXON emulation, thus matching its freeing. Fixes: 5c614b3583e7b6dab0c86356fa36c2bcbb8322a0 Cc: stable@vger.kernel.org Signed-off-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-25x86/speculation: Create spec-ctrl.h to avoid include hellThomas Gleixner
commit 28a2775217b17208811fa43a9e96bd1fdf417b86 upstream Having everything in nospec-branch.h creates a hell of dependencies when adding the prctl based switching mechanism. Move everything which is not required in nospec-branch.h to spec-ctrl.h and fix up the includes in the relevant files. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu> Reviewed-by: Matt Helsley (VMware) <matt.helsley@gmail.com> Reviewed-by: Alexey Makhalov <amakhalov@vmware.com> Reviewed-by: Bo Gan <ganb@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-16KVM: x86: pass kvm_vcpu to kvm_read_guest_virt and kvm_write_guest_virt_systemPaolo Bonzini
commit ce14e868a54edeb2e30cb7a7b104a2fc4b9d76ca upstream. Int the next patch the emulator's .read_std and .write_std callbacks will grow another argument, which is not needed in kvm_read_guest_virt and kvm_write_guest_virt_system's callers. Since we have to make separate functions, let's give the currently existing names a nicer interface, too. Fixes: 129a72a0d3c8 ("KVM: x86: Introduce segmented_write_std", 2017-01-12) Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30KVM: VMX: raise internal error for exception during invalid protected mode stateSean Christopherson
[ Upstream commit add5ff7a216ee545a214013f26d1ef2f44a9c9f8 ] Exit to userspace with KVM_INTERNAL_ERROR_EMULATION if we encounter an exception in Protected Mode while emulating guest due to invalid guest state. Unlike Big RM, KVM doesn't support emulating exceptions in PM, i.e. PM exceptions are always injected via the VMCS. Because we will never do VMRESUME due to emulation_required, the exception is never realized and we'll keep emulating the faulting instruction over and over until we receive a signal. Exit to userspace iff there is a pending exception, i.e. don't exit simply on a requested event. The purpose of this check and exit is to aid in debugging a guest that is in all likelihood already doomed. Invalid guest state in PM is extremely limited in normal operation, e.g. it generally only occurs for a few instructions early in BIOS, and any exception at this time is all but guaranteed to be fatal. Non-vectored interrupts, e.g. INIT, SIPI and SMI, can be cleanly handled/emulated, while checking for vectored interrupts, e.g. INTR and NMI, without hitting false positives would add a fair amount of complexity for almost no benefit (getting hit by lightning seems more likely than encountering this specific scenario). Add a WARN_ON_ONCE to vmx_queue_exception() if we try to inject an exception via the VMCS and emulation_required is true. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13KVM: nVMX: Fix handling of lmsw instructionJan H. Schönherr
[ Upstream commit e1d39b17e044e8ae819827810d87d809ba5f58c0 ] The decision whether or not to exit from L2 to L1 on an lmsw instruction is based on bogus values: instead of using the information encoded within the exit qualification, it uses the data also used for the mov-to-cr instruction, which boils down to using whatever is in %eax at that point. Use the correct values instead. Without this fix, an L1 may not get notified when a 32-bit Linux L2 switches its secondary CPUs to protected mode; the L1 is only notified on the next modification of CR0. This short time window poses a problem, when there is some other reason to exit to L1 in between. Then, L2 will be resumed in real mode and chaos ensues. Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de> Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28kvm/x86: fix icebp instruction handlingLinus Torvalds
commit 32d43cd391bacb5f0814c2624399a5dad3501d09 upstream. The undocumented 'icebp' instruction (aka 'int1') works pretty much like 'int3' in the absense of in-circuit probing equipment (except, obviously, that it raises #DB instead of raising #BP), and is used by some validation test-suites as such. But Andy Lutomirski noticed that his test suite acted differently in kvm than on bare hardware. The reason is that kvm used an inexact test for the icebp instruction: it just assumed that an all-zero VM exit qualification value meant that the VM exit was due to icebp. That is not unlike the guess that do_debug() does for the actual exception handling case, but it's purely a heuristic, not an absolute rule. do_debug() does it because it wants to ascribe _some_ reasons to the #DB that happened, and an empty %dr6 value means that 'icebp' is the most likely casue and we have no better information. But kvm can just do it right, because unlike the do_debug() case, kvm actually sees the real reason for the #DB in the VM-exit interruption information field. So instead of relying on an inexact heuristic, just use the actual VM exit information that says "it was 'icebp'". Right now the 'icebp' instruction isn't technically documented by Intel, but that will hopefully change. The special "privileged software exception" information _is_ actually mentioned in the Intel SDM, even though the cause of it isn't enumerated. Reported-by: Andy Lutomirski <luto@kernel.org> Tested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25KVM: nVMX: invvpid handling improvementsJan Dakinevich
commit bcdde302b8268ef7dbc4ddbdaffb5b44eafe9a1e upstream - Expose all invalidation types to the L1 - Reject invvpid instruction, if L1 passed zero vpid value to single context invalidations Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com> Tested-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> [jwang: port to 4.4] Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25kvm: nVMX: Fix kernel panics induced by illegal INVEPT/INVVPID typesJim Mattson
commit 85c856b39b479dde410ddd09df1da745343010c9 upstream Bitwise shifts by amounts greater than or equal to the width of the left operand are undefined. A malicious guest can exploit this to crash a 32-bit host, due to the BUG_ON(1)'s in handle_{invept,invvpid}. Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <1477496318-17681-1-git-send-email-jmattson@google.com> [Change 1UL to 1, to match the range check on the shift count. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [jwang: port from linux-4.9 to 4.4 ] Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25KVM: nVMX: vmx_complete_nested_posted_interrupt() can't failDavid Hildenbrand
(cherry picked from commit 6342c50ad12e8ce0736e722184a7dbdea4a3477f) vmx_complete_nested_posted_interrupt() can't fail, let's turn it into a void function. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> [jwang: port to 4.4] Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25KVM: nVMX: kmap() can't failDavid Hildenbrand
commit 42cf014d38d8822cce63703a467e00f65d000952 upstream. kmap() can't fail, therefore it will always return a valid pointer. Let's just get rid of the unnecessary checks. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [jwang: port to 4.4] Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25x86/kvm: Update spectre-v1 mitigationDan Williams
(cherry picked from commit 085331dfc6bbe3501fb936e657331ca943827600) Commit 75f139aaf896 "KVM: x86: Add memory barrier on vmcs field lookup" added a raw 'asm("lfence");' to prevent a bounds check bypass of 'vmcs_field_to_offset_table'. The lfence can be avoided in this path by using the array_index_nospec() helper designed for these types of fixes. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Andrew Honig <ahonig@google.com> Cc: kvm@vger.kernel.org Cc: Jim Mattson <jmattson@google.com> Link: https://lkml.kernel.org/r/151744959670.6342.3001723920950249067.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> [jwang: cherry pick to 4.4] Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25KVM: VMX: Make indirect call speculation safePeter Zijlstra
(cherry picked from commit c940a3fb1e2e9b7d03228ab28f375fb5a47ff699) Replace indirect call with CALL_NOSPEC. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ashok Raj <ashok.raj@intel.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Jun Nakajima <jun.nakajima@intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: rga@amazon.de Cc: Dave Hansen <dave.hansen@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lkml.kernel.org/r/20180125095843.645776917@infradead.org Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> [backport to 4.4] Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-16KVM: nVMX: Fix races when sending nested PI while dest enters/leaves L2Liran Alon
commit 6b6977117f50d60455ace86b2d256f6fb4f3de05 upstream. Consider the following scenario: 1. CPU A calls vmx_deliver_nested_posted_interrupt() to send an IPI to CPU B via virtual posted-interrupt mechanism. 2. CPU B is currently executing L2 guest. 3. vmx_deliver_nested_posted_interrupt() calls kvm_vcpu_trigger_posted_interrupt() which will note that vcpu->mode == IN_GUEST_MODE. 4. Assume that before CPU A sends the physical POSTED_INTR_NESTED_VECTOR IPI, CPU B exits from L2 to L0 during event-delivery (valid IDT-vectoring-info). 5. CPU A now sends the physical IPI. The IPI is received in host and it's handler (smp_kvm_posted_intr_nested_ipi()) does nothing. 6. Assume that before CPU A sets pi_pending=true and KVM_REQ_EVENT, CPU B continues to run in L0 and reach vcpu_enter_guest(). As KVM_REQ_EVENT is not set yet, vcpu_enter_guest() will continue and resume L2 guest. 7. At this point, CPU A sets pi_pending=true and KVM_REQ_EVENT but it's too late! CPU B already entered L2 and KVM_REQ_EVENT will only be consumed at next L2 entry! Another scenario to consider: 1. CPU A calls vmx_deliver_nested_posted_interrupt() to send an IPI to CPU B via virtual posted-interrupt mechanism. 2. Assume that before CPU A calls kvm_vcpu_trigger_posted_interrupt(), CPU B is at L0 and is about to resume into L2. Further assume that it is in vcpu_enter_guest() after check for KVM_REQ_EVENT. 3. At this point, CPU A calls kvm_vcpu_trigger_posted_interrupt() which will note that vcpu->mode != IN_GUEST_MODE. Therefore, do nothing and return false. Then, will set pi_pending=true and KVM_REQ_EVENT. 4. Now CPU B continue and resumes into L2 guest without processing the posted-interrupt until next L2 entry! To fix both issues, we just need to change vmx_deliver_nested_posted_interrupt() to set pi_pending=true and KVM_REQ_EVENT before calling kvm_vcpu_trigger_posted_interrupt(). It will fix the first scenario by chaging step (6) to note that KVM_REQ_EVENT and pi_pending=true and therefore process nested posted-interrupt. It will fix the second scenario by two possible ways: 1. If kvm_vcpu_trigger_posted_interrupt() is called while CPU B has changed vcpu->mode to IN_GUEST_MODE, physical IPI will be sent and will be received when CPU resumes into L2. 2. If kvm_vcpu_trigger_posted_interrupt() is called while CPU B hasn't yet changed vcpu->mode to IN_GUEST_MODE, then after CPU B will change vcpu->mode it will call kvm_request_pending() which will return true and therefore force another round of vcpu_enter_guest() which will note that KVM_REQ_EVENT and pi_pending=true and therefore process nested posted-interrupt. Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing") Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> [Add kvm_vcpu_kick to also handle the case where L1 doesn't intercept L2 HLT and L2 executes HLT instruction. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03KVM: VMX: Fix rflags cache during vCPU resetWanpeng Li
[ Upstream commit c37c28730bb031cc8a44a130c2555c0f3efbe2d0 ] Reported by syzkaller: *** Guest State *** CR0: actual=0x0000000080010031, shadow=0x0000000060000010, gh_mask=fffffffffffffff7 CR4: actual=0x0000000000002061, shadow=0x0000000000000000, gh_mask=ffffffffffffe8f1 CR3 = 0x000000002081e000 RSP = 0x000000000000fffa RIP = 0x0000000000000000 RFLAGS=0x00023000 DR7 = 0x00000000000000 ^^^^^^^^^^ ------------[ cut here ]------------ WARNING: CPU: 6 PID: 24431 at /home/kernel/linux/arch/x86/kvm//x86.c:7302 kvm_arch_vcpu_ioctl_run+0x651/0x2ea0 [kvm] CPU: 6 PID: 24431 Comm: reprotest Tainted: G W OE 4.14.0+ #26 RIP: 0010:kvm_arch_vcpu_ioctl_run+0x651/0x2ea0 [kvm] RSP: 0018:ffff880291d179e0 EFLAGS: 00010202 Call Trace: kvm_vcpu_ioctl+0x479/0x880 [kvm] do_vfs_ioctl+0x142/0x9a0 SyS_ioctl+0x74/0x80 entry_SYSCALL_64_fastpath+0x23/0x9a The failed vmentry is triggered by the following beautified testcase: #include <unistd.h> #include <sys/syscall.h> #include <string.h> #include <stdint.h> #include <linux/kvm.h> #include <fcntl.h> #include <sys/ioctl.h> long r[5]; int main() { struct kvm_debugregs dr = { 0 }; r[2] = open("/dev/kvm", O_RDONLY); r[3] = ioctl(r[2], KVM_CREATE_VM, 0); r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7); struct kvm_guest_debug debug = { .control = 0xf0403, .arch = { .debugreg[6] = 0x2, .debugreg[7] = 0x2 } }; ioctl(r[4], KVM_SET_GUEST_DEBUG, &debug); ioctl(r[4], KVM_RUN, 0); } which testcase tries to setup the processor specific debug registers and configure vCPU for handling guest debug events through KVM_SET_GUEST_DEBUG. The KVM_SET_GUEST_DEBUG ioctl will get and set rflags in order to set TF bit if single step is needed. All regs' caches are reset to avail and GUEST_RFLAGS vmcs field is reset to 0x2 during vCPU reset. However, the cache of rflags is not reset during vCPU reset. The function vmx_get_rflags() returns an unreset rflags cache value since the cache is marked avail, it is 0 after boot. Vmentry fails if the rflags reserved bit 1 is 0. This patch fixes it by resetting both the GUEST_RFLAGS vmcs field and its cache to 0x2 during vCPU reset. Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03KVM: x86: Don't re-execute instruction when not passing CR2 valueLiran Alon
[ Upstream commit 9b8ae63798cb97e785a667ff27e43fa6220cb734 ] In case of instruction-decode failure or emulation failure, x86_emulate_instruction() will call reexecute_instruction() which will attempt to use the cr2 value passed to x86_emulate_instruction(). However, when x86_emulate_instruction() is called from emulate_instruction(), cr2 is not passed (passed as 0) and therefore it doesn't make sense to execute reexecute_instruction() logic at all. Fixes: 51d8b66199e9 ("KVM: cleanup emulate_instruction") Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23x86/retpoline: Fill return stack buffer on vmexitDavid Woodhouse
commit 117cc7a908c83697b0b737d15ae1eb5943afe35b upstream. In accordance with the Intel and AMD documentation, we need to overwrite all entries in the RSB on exiting a guest, to prevent malicious branch target predictions from affecting the host kernel. This is needed both for retpoline and for IBRS. [ak: numbers again for the RSB stuffing labels] Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: gnomes@lxorguk.ukuu.org.uk Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: thomas.lendacky@amd.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kees Cook <keescook@google.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Link: https://lkml.kernel.org/r/1515755487-8524-1-git-send-email-dwmw@amazon.co.uk Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Razvan Ghitulete <rga@amazon.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17KVM: x86: Add memory barrier on vmcs field lookupAndrew Honig
commit 75f139aaf896d6fdeec2e468ddfa4b2fe469bf40 upstream. This adds a memory barrier when performing a lookup into the vmcs_field_to_offset_table. This is related to CVE-2017-5753. Signed-off-by: Andrew Honig <ahonig@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17kvm: vmx: Scrub hardware GPRs at VM-exitJim Mattson
commit 0cb5b30698fdc8f6b4646012e3acb4ddce430788 upstream. Guest GPR values are live in the hardware GPRs at VM-exit. Do not leave any guest values in hardware GPRs after the guest GPR values are saved to the vcpu_vmx structure. This is a partial mitigation for CVE 2017-5715 and CVE 2017-5753. Specifically, it defeats the Project Zero PoC for CVE 2017-5715. Suggested-by: Eric Northup <digitaleric@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Eric Northup <digitaleric@google.com> Reviewed-by: Benjamin Serebrin <serebrin@google.com> Reviewed-by: Andrew Honig <ahonig@google.com> [Paolo: Add AMD bits, Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25KVM: VMX: Fix enable VPID conditionsWanpeng Li
[ Upstream commit 08d839c4b134b8328ec42f2157a9ca4b93227c03 ] This can be reproduced by running L2 on L1, and disable VPID on L0 if w/o commit "KVM: nVMX: Fix nested VPID vmx exec control", the L2 crash as below: KVM: entry failed, hardware error 0x7 EAX=00000000 EBX=00000000 ECX=00000000 EDX=000306c3 ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000 EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0000 00000000 0000ffff 00009300 CS =f000 ffff0000 0000ffff 00009b00 SS =0000 00000000 0000ffff 00009300 DS =0000 00000000 0000ffff 00009300 FS =0000 00000000 0000ffff 00009300 GS =0000 00000000 0000ffff 00009300 LDT=0000 00000000 0000ffff 00008200 TR =0000 00000000 0000ffff 00008b00 GDT= 00000000 0000ffff IDT= 00000000 0000ffff CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Reference SDM 30.3 INVVPID: Protected Mode Exceptions - #UD - If not in VMX operation. - If the logical processor does not support VPIDs (IA32_VMX_PROCBASED_CTLS2[37]=0). - If the logical processor supports VPIDs (IA32_VMX_PROCBASED_CTLS2[37]=1) but does not support the INVVPID instruction (IA32_VMX_EPT_VPID_CAP[32]=0). So we should check both VPID enable bit in vmx exec control and INVVPID support bit in vmx capability MSRs to enable VPID. This patch adds the guarantee to not enable VPID if either INVVPID or single-context/all-context invalidation is not exposed in vmx capability MSRs. Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-16KVM: nVMX: reset nested_run_pending if the vCPU is going to be resetWanpeng Li
[ Upstream commit 2f707d97982286b307ef2a9b034e19aabc1abb56 ] Reported by syzkaller: WARNING: CPU: 1 PID: 27742 at arch/x86/kvm/vmx.c:11029 nested_vmx_vmexit+0x5c35/0x74d0 arch/x86/kvm/vmx.c:11029 CPU: 1 PID: 27742 Comm: a.out Not tainted 4.10.0+ #229 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:15 [inline] dump_stack+0x2ee/0x3ef lib/dump_stack.c:51 panic+0x1fb/0x412 kernel/panic.c:179 __warn+0x1c4/0x1e0 kernel/panic.c:540 warn_slowpath_null+0x2c/0x40 kernel/panic.c:583 nested_vmx_vmexit+0x5c35/0x74d0 arch/x86/kvm/vmx.c:11029 vmx_leave_nested arch/x86/kvm/vmx.c:11136 [inline] vmx_set_msr+0x1565/0x1910 arch/x86/kvm/vmx.c:3324 kvm_set_msr+0xd4/0x170 arch/x86/kvm/x86.c:1099 do_set_msr+0x11e/0x190 arch/x86/kvm/x86.c:1128 __msr_io arch/x86/kvm/x86.c:2577 [inline] msr_io+0x24b/0x450 arch/x86/kvm/x86.c:2614 kvm_arch_vcpu_ioctl+0x35b/0x46a0 arch/x86/kvm/x86.c:3497 kvm_vcpu_ioctl+0x232/0x1120 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2721 vfs_ioctl fs/ioctl.c:43 [inline] do_vfs_ioctl+0x1bf/0x1790 fs/ioctl.c:683 SYSC_ioctl fs/ioctl.c:698 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:689 entry_SYSCALL_64_fastpath+0x1f/0xc2 The syzkaller folks reported a nested_run_pending warning during userspace clear VMX capability which is exposed to L1 before. The warning gets thrown while doing (*(uint32_t*)0x20aecfe8 = (uint32_t)0x1); (*(uint32_t*)0x20aecfec = (uint32_t)0x0); (*(uint32_t*)0x20aecff0 = (uint32_t)0x3a); (*(uint32_t*)0x20aecff4 = (uint32_t)0x0); (*(uint64_t*)0x20aecff8 = (uint64_t)0x0); r[29] = syscall(__NR_ioctl, r[4], 0x4008ae89ul, 0x20aecfe8ul, 0, 0, 0, 0, 0, 0); i.e. KVM_SET_MSR ioctl with struct kvm_msrs { .nmsrs = 1, .pad = 0, .entries = { {.index = MSR_IA32_FEATURE_CONTROL, .reserved = 0, .data = 0} } } The VMLANCH/VMRESUME emulation should be stopped since the CPU is going to reset here. This patch resets the nested_run_pending since the CPU is going to be reset hence there should be nothing pending. Reported-by: Dmitry Vyukov <dvyukov@google.com> Suggested-by: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-16kvm: nVMX: VMCLEAR should not cause the vCPU to shut downJim Mattson
[ Upstream commit 587d7e72aedca91cee80c0a56811649c3efab765 ] VMCLEAR should silently ignore a failure to clear the launch state of the VMCS referenced by the operand. Signed-off-by: Jim Mattson <jmattson@google.com> [Changed "kvm_write_guest(vcpu->kvm" to "kvm_vcpu_write_guest(vcpu".] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-16KVM: VMX: remove I/O port 0x80 bypass on Intel hostsAndrew Honig
commit d59d51f088014f25c2562de59b9abff4f42a7468 upstream. This fixes CVE-2017-1000407. KVM allows guests to directly access I/O port 0x80 on Intel hosts. If the guest floods this port with writes it generates exceptions and instability in the host kernel, leading to a crash. With this change guest writes to port 0x80 on Intel will behave the same as they currently behave on AMD systems. Prevent the flooding by removing the code that sets port 0x80 as a passthrough port. This is essentially the same as upstream patch 99f85a28a78e96d28907fe036e1671a218fee597, except that patch was for AMD chipsets and this patch is for Intel. Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Fixes: fdef3ad1b386 ("KVM: VMX: Enable io bitmaps to avoid IO port 0x80 VMEXITs") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-05KVM: x86: Exit to user-mode on #UD intercept when emulator requiresLiran Alon
commit 61cb57c9ed631c95b54f8e9090c89d18b3695b3c upstream. Instruction emulation after trapping a #UD exception can result in an MMIO access, for example when emulating a MOVBE on a processor that doesn't support the instruction. In this case, the #UD vmexit handler must exit to user mode, but there wasn't any code to do so. Add it for both VMX and SVM. Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30KVM: nVMX: set IDTR and GDTR limits when loading L1 host stateLadi Prosek
commit 21f2d551183847bc7fbe8d866151d00cdad18752 upstream. Intel SDM 27.5.2 Loading Host Segment and Descriptor-Table Registers: "The GDTR and IDTR limits are each set to FFFFH." Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-18KVM: nVMX: fix guest CR4 loading when emulating L2 to L1 exitHaozhong Zhang
commit 8eb3f87d903168bdbd1222776a6b1e281f50513e upstream. When KVM emulates an exit from L2 to L1, it loads L1 CR4 into the guest CR4. Before this CR4 loading, the guest CR4 refers to L2 CR4. Because these two CR4's are in different levels of guest, we should vmx_set_cr4() rather than kvm_set_cr4() here. The latter, which is used to handle guest writes to its CR4, checks the guest change to CR4 and may fail if the change is invalid. The failure may cause trouble. Consider we start a L1 guest with non-zero L1 PCID in use, (i.e. L1 CR4.PCIDE == 1 && L1 CR3.PCID != 0) and a L2 guest with L2 PCID disabled, (i.e. L2 CR4.PCIDE == 0) and following events may happen: 1. If kvm_set_cr4() is used in load_vmcs12_host_state() to load L1 CR4 into guest CR4 (in VMCS01) for L2 to L1 exit, it will fail because of PCID check. As a result, the guest CR4 recorded in L0 KVM (i.e. vcpu->arch.cr4) is left to the value of L2 CR4. 2. Later, if L1 attempts to change its CR4, e.g., clearing VMXE bit, kvm_set_cr4() in L0 KVM will think L1 also wants to enable PCID, because the wrong L2 CR4 is used by L0 KVM as L1 CR4. As L1 CR3.PCID != 0, L0 KVM will inject GP to L1 guest. Fixes: 4704d0befb072 ("KVM: nVMX: Exiting from L2 to L1") Cc: qemu-stable@nongnu.org Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05KVM: VMX: use cmpxchg64Paolo Bonzini
commit c0a1666bcb2a33e84187a15eabdcd54056be9a97 upstream. This fixes a compilation failure on 32-bit systems. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05KVM: VMX: remove WARN_ON_ONCE in kvm_vcpu_trigger_posted_interruptHaozhong Zhang
commit 5753743fa5108b8f98bd61e40dc63f641b26c768 upstream. WARN_ON_ONCE(pi_test_sn(&vmx->pi_desc)) in kvm_vcpu_trigger_posted_interrupt() intends to detect the violation of invariant that VT-d PI notification event is not suppressed when vcpu is in the guest mode. Because the two checks for the target vcpu mode and the target suppress field cannot be performed atomically, the target vcpu mode may change in between. If that does happen, WARN_ON_ONCE() here may raise false alarms. As the previous patch fixed the real invariant breaker, remove this WARN_ON_ONCE() to avoid false alarms, and document the allowed cases instead. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Reported-by: "Ramamurthy, Venkatesh" <venkatesh.ramamurthy@intel.com> Reported-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: 28b835d60fcc ("KVM: Update Posted-Interrupts Descriptor when vCPU is preempted") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05KVM: VMX: do not change SN bit in vmx_update_pi_irte()Haozhong Zhang
commit dc91f2eb1a4021eb6705c15e474942f84ab9b211 upstream. In kvm_vcpu_trigger_posted_interrupt() and pi_pre_block(), KVM assumes that PI notification events should not be suppressed when the target vCPU is not blocked. vmx_update_pi_irte() sets the SN field before changing an interrupt from posting to remapping, but it does not check the vCPU mode. Therefore, the change of SN field may break above the assumption. Besides, I don't see reasons to suppress notification events here, so remove the changes of SN field to avoid race condition. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Reported-by: "Ramamurthy, Venkatesh" <venkatesh.ramamurthy@intel.com> Reported-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: 28b835d60fcc ("KVM: Update Posted-Interrupts Descriptor when vCPU is preempted") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05kvm: nVMX: Don't allow L2 to access the hardware CR8Jim Mattson
commit 51aa68e7d57e3217192d88ce90fd5b8ef29ec94f upstream. If L1 does not specify the "use TPR shadow" VM-execution control in vmcs12, then L0 must specify the "CR8-load exiting" and "CR8-store exiting" VM-execution controls in vmcs02. Failure to do so will give the L2 VM unrestricted read/write access to the hardware CR8. This fixes CVE-2017-12154. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05KVM: VMX: Do not BUG() on out-of-bounds guest IRQJan H. Schönherr
commit 3a8b0677fc6180a467e26cc32ce6b0c09a32f9bb upstream. The value of the guest_irq argument to vmx_update_pi_irte() is ultimately coming from a KVM_IRQFD API call. Do not BUG() in vmx_update_pi_irte() if the value is out-of bounds. (Especially, since KVM as a whole seems to hang after that.) Instead, print a message only once if we find that we don't have a route for a certain IRQ (which can be out-of-bounds or within the array). This fixes CVE-2017-1000252. Fixes: efc644048ecde54 ("KVM: x86: Update IRTE for posted-interrupts") Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21kvm: vmx: allow host to access guest MSR_IA32_BNDCFGSHaozhong Zhang
commit 691bd4340bef49cf7e5855d06cf24444b5bf2d85 upstream. It's easier for host applications, such as QEMU, if they can always access guest MSR_IA32_BNDCFGS in VMCS, even though MPX is disabled in guest cpuid. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>