summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-01-05Linux 4.4.110v4.4.110Greg Kroah-Hartman
2018-01-05kaiser: Set _PAGE_NX only if supportedGuenter Roeck
This resolves a crash if loaded under qemu + haxm under windows. See https://www.spinics.net/lists/kernel/msg2689835.html for details. Here is a boot log (the log is from chromeos-4.4, but Tao Wu says that the same log is also seen with vanilla v4.4.110-rc1). [ 0.712750] Freeing unused kernel memory: 552K [ 0.721821] init: Corrupted page table at address 57b029b332e0 [ 0.722761] PGD 80000000bb238067 PUD bc36a067 PMD bc369067 PTE 45d2067 [ 0.722761] Bad pagetable: 000b [#1] PREEMPT SMP [ 0.722761] Modules linked in: [ 0.722761] CPU: 1 PID: 1 Comm: init Not tainted 4.4.96 #31 [ 0.722761] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014 [ 0.722761] task: ffff8800bc290000 ti: ffff8800bc28c000 task.ti: ffff8800bc28c000 [ 0.722761] RIP: 0010:[<ffffffff83f4129e>] [<ffffffff83f4129e>] __clear_user+0x42/0x67 [ 0.722761] RSP: 0000:ffff8800bc28fcf8 EFLAGS: 00010202 [ 0.722761] RAX: 0000000000000000 RBX: 00000000000001a4 RCX: 00000000000001a4 [ 0.722761] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000057b029b332e0 [ 0.722761] RBP: ffff8800bc28fd08 R08: ffff8800bc290000 R09: ffff8800bb2f4000 [ 0.722761] R10: ffff8800bc290000 R11: ffff8800bb2f4000 R12: 000057b029b332e0 [ 0.722761] R13: 0000000000000000 R14: 000057b029b33340 R15: ffff8800bb1e2a00 [ 0.722761] FS: 0000000000000000(0000) GS:ffff8800bfb00000(0000) knlGS:0000000000000000 [ 0.722761] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 0.722761] CR2: 000057b029b332e0 CR3: 00000000bb2f8000 CR4: 00000000000006e0 [ 0.722761] Stack: [ 0.722761] 000057b029b332e0 ffff8800bb95fa80 ffff8800bc28fd18 ffffffff83f4120c [ 0.722761] ffff8800bc28fe18 ffffffff83e9e7a1 ffff8800bc28fd68 0000000000000000 [ 0.722761] ffff8800bc290000 ffff8800bc290000 ffff8800bc290000 ffff8800bc290000 [ 0.722761] Call Trace: [ 0.722761] [<ffffffff83f4120c>] clear_user+0x2e/0x30 [ 0.722761] [<ffffffff83e9e7a1>] load_elf_binary+0xa7f/0x18f7 [ 0.722761] [<ffffffff83de2088>] search_binary_handler+0x86/0x19c [ 0.722761] [<ffffffff83de389e>] do_execveat_common.isra.26+0x909/0xf98 [ 0.722761] [<ffffffff844febe0>] ? rest_init+0x87/0x87 [ 0.722761] [<ffffffff83de40be>] do_execve+0x23/0x25 [ 0.722761] [<ffffffff83c002e3>] run_init_process+0x2b/0x2d [ 0.722761] [<ffffffff844fec4d>] kernel_init+0x6d/0xda [ 0.722761] [<ffffffff84505b2f>] ret_from_fork+0x3f/0x70 [ 0.722761] [<ffffffff844febe0>] ? rest_init+0x87/0x87 [ 0.722761] Code: 86 84 be 12 00 00 00 e8 87 0d e8 ff 66 66 90 48 89 d8 48 c1 eb 03 4c 89 e7 83 e0 07 48 89 d9 be 08 00 00 00 31 d2 48 85 c9 74 0a <48> 89 17 48 01 f7 ff c9 75 f6 48 89 c1 85 c9 74 09 88 17 48 ff [ 0.722761] RIP [<ffffffff83f4129e>] __clear_user+0x42/0x67 [ 0.722761] RSP <ffff8800bc28fcf8> [ 0.722761] ---[ end trace def703879b4ff090 ]--- [ 0.722761] BUG: sleeping function called from invalid context at /mnt/host/source/src/third_party/kernel/v4.4/kernel/locking/rwsem.c:21 [ 0.722761] in_atomic(): 0, irqs_disabled(): 1, pid: 1, name: init [ 0.722761] CPU: 1 PID: 1 Comm: init Tainted: G D 4.4.96 #31 [ 0.722761] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014 [ 0.722761] 0000000000000086 dcb5d76098c89836 ffff8800bc28fa30 ffffffff83f34004 [ 0.722761] ffffffff84839dc2 0000000000000015 ffff8800bc28fa40 ffffffff83d57dc9 [ 0.722761] ffff8800bc28fa68 ffffffff83d57e6a ffffffff84a53640 0000000000000000 [ 0.722761] Call Trace: [ 0.722761] [<ffffffff83f34004>] dump_stack+0x4d/0x63 [ 0.722761] [<ffffffff83d57dc9>] ___might_sleep+0x13a/0x13c [ 0.722761] [<ffffffff83d57e6a>] __might_sleep+0x9f/0xa6 [ 0.722761] [<ffffffff84502788>] down_read+0x20/0x31 [ 0.722761] [<ffffffff83cc5d9b>] __blocking_notifier_call_chain+0x35/0x63 [ 0.722761] [<ffffffff83cc5ddd>] blocking_notifier_call_chain+0x14/0x16 [ 0.800374] usb 1-1: new full-speed USB device number 2 using uhci_hcd [ 0.722761] [<ffffffff83cefe97>] profile_task_exit+0x1a/0x1c [ 0.802309] [<ffffffff83cac84e>] do_exit+0x39/0xe7f [ 0.802309] [<ffffffff83ce5938>] ? vprintk_default+0x1d/0x1f [ 0.802309] [<ffffffff83d7bb95>] ? printk+0x57/0x73 [ 0.802309] [<ffffffff83c46e25>] oops_end+0x80/0x85 [ 0.802309] [<ffffffff83c7b747>] pgtable_bad+0x8a/0x95 [ 0.802309] [<ffffffff83ca7f4a>] __do_page_fault+0x8c/0x352 [ 0.802309] [<ffffffff83eefba5>] ? file_has_perm+0xc4/0xe5 [ 0.802309] [<ffffffff83ca821c>] do_page_fault+0xc/0xe [ 0.802309] [<ffffffff84507682>] page_fault+0x22/0x30 [ 0.802309] [<ffffffff83f4129e>] ? __clear_user+0x42/0x67 [ 0.802309] [<ffffffff83f4127f>] ? __clear_user+0x23/0x67 [ 0.802309] [<ffffffff83f4120c>] clear_user+0x2e/0x30 [ 0.802309] [<ffffffff83e9e7a1>] load_elf_binary+0xa7f/0x18f7 [ 0.802309] [<ffffffff83de2088>] search_binary_handler+0x86/0x19c [ 0.802309] [<ffffffff83de389e>] do_execveat_common.isra.26+0x909/0xf98 [ 0.802309] [<ffffffff844febe0>] ? rest_init+0x87/0x87 [ 0.802309] [<ffffffff83de40be>] do_execve+0x23/0x25 [ 0.802309] [<ffffffff83c002e3>] run_init_process+0x2b/0x2d [ 0.802309] [<ffffffff844fec4d>] kernel_init+0x6d/0xda [ 0.802309] [<ffffffff84505b2f>] ret_from_fork+0x3f/0x70 [ 0.802309] [<ffffffff844febe0>] ? rest_init+0x87/0x87 [ 0.830559] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 [ 0.830559] [ 0.831305] Kernel Offset: 0x2c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 0.831305] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 The crash part of this problem may be solved with the following patch (thanks to Hugh for the hint). There is still another problem, though - with this patch applied, the qemu session aborts with "VCPU Shutdown request", whatever that means. Cc: lepton <ytht.net@gmail.com> Signed-off-by: Guenter Roeck <groeck@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05x86/kasan: Clear kasan_zero_page after TLB flushAndrey Ryabinin
commit 69e0210fd01ff157d332102219aaf5c26ca8069b upstream. Currently we clear kasan_zero_page before __flush_tlb_all(). This works with current implementation of native_flush_tlb[_global]() because it doesn't cause do any writes to kasan shadow memory. But any subtle change made in native_flush_tlb*() could break this. Also current code seems doesn't work for paravirt guests (lguest). Only after the TLB flush we can be sure that kasan_zero_page is not used as early shadow anymore (instrumented code will not write to it). So it should cleared it only after the TLB flush. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/1452516679-32040-2-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Jamie Iles <jamie.iles@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05x86/vdso: Get pvclock data from the vvar VMA instead of the fixmapAndy Lutomirski
commit dac16fba6fc590fa7239676b35ed75dae4c4cd2b upstream. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/9d37826fdc7e2d2809efe31d5345f97186859284.1449702533.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Jamie Iles <jamie.iles@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05x86, vdso, pvclock: Simplify and speed up the vdso pvclock readerAndy Lutomirski
commit 6b078f5de7fc0851af4102493c7b5bb07e49c4cb upstream. The pvclock vdso code was too abstracted to understand easily and excessively paranoid. Simplify it for a huge speedup. This opens the door for additional simplifications, as the vdso no longer accesses the pvti for any vcpu other than vcpu 0. Before, vclock_gettime using kvm-clock took about 45ns on my machine. With this change, it takes 29ns, which is almost as fast as the pure TSC implementation. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/6b51dcc41f1b101f963945c5ec7093d72bdac429.1449702533.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Jamie Iles <jamie.iles@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05KPTI: Report when enabledKees Cook
Make sure dmesg reports when KPTI is enabled. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05KPTI: Rename to PAGE_TABLE_ISOLATIONKees Cook
This renames CONFIG_KAISER to CONFIG_PAGE_TABLE_ISOLATION. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05x86/kaiser: Move feature detection upBorislav Petkov
... before the first use of kaiser_enabled as otherwise funky things happen: about to get started... (XEN) d0v0 Unhandled page fault fault/trap [#14, ec=0000] (XEN) Pagetable walk from ffff88022a449090: (XEN) L4[0x110] = 0000000229e0e067 0000000000001e0e (XEN) L3[0x008] = 0000000000000000 ffffffffffffffff (XEN) domain_crash_sync called from entry.S: fault at ffff82d08033fd08 entry.o#create_bounce_frame+0x135/0x14d (XEN) Domain 0 (vcpu#0) crashed on cpu#0: (XEN) ----[ Xen-4.9.1_02-3.21 x86_64 debug=n Not tainted ]---- (XEN) CPU: 0 (XEN) RIP: e033:[<ffffffff81007460>] (XEN) RFLAGS: 0000000000000286 EM: 1 CONTEXT: pv guest (d0v0) Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: disabled on Xen PVJiri Kosina
Kaiser cannot be used on paravirtualized MMUs (namely reading and writing CR3). This does not work with KAISER as the CR3 switch from and to user space PGD would require to map the whole XEN_PV machinery into both. More importantly, enabling KAISER on Xen PV doesn't make too much sense, as PV guests use distinct %cr3 values for kernel and user already. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05x86/kaiser: Reenable PARAVIRTBorislav Petkov
Now that the required bits have been addressed, reenable PARAVIRT. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05x86/paravirt: Dont patch flush_tlb_singleThomas Gleixner
commit a035795499ca1c2bd1928808d1a156eda1420383 upstream native_flush_tlb_single() will be changed with the upcoming PAGE_TABLE_ISOLATION feature. This requires to have more code in there than INVLPG. Remove the paravirt patching for it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: linux-mm@kvack.org Cc: michael.schwarz@iaik.tugraz.at Cc: moritz.lipp@iaik.tugraz.at Cc: richard.fellner@student.tugraz.at Link: https://lkml.kernel.org/r/20171204150606.828111617@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: kaiser_flush_tlb_on_return_to_user() check PCIDHugh Dickins
Let kaiser_flush_tlb_on_return_to_user() do the X86_FEATURE_PCID check, instead of each caller doing it inline first: nobody needs to optimize for the noPCID case, it's clearer this way, and better suits later changes. Replace those no-op X86_CR3_PCID_KERN_FLUSH lines by a BUILD_BUG_ON() in load_new_mm_cr3(), in case something changes. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: asm/tlbflush.h handle noPGE at lower levelHugh Dickins
I found asm/tlbflush.h too twisty, and think it safer not to avoid __native_flush_tlb_global_irq_disabled() in the kaiser_enabled case, but instead let it handle kaiser_enabled along with cr3: it can just use __native_flush_tlb() for that, no harm in re-disabling preemption. (This is not the same change as Kirill and Dave have suggested for upstream, flipping PGE in cr4: that's neat, but needs a cpu_has_pge check; cr3 is enough for kaiser, and thought to be cheaper than cr4.) Also delete the X86_FEATURE_INVPCID invpcid_flush_all_nonglobals() preference from __native_flush_tlb(): unlike the invpcid_flush_all() preference in __native_flush_tlb_global(), it's not seen in upstream 4.14, and was recently reported to be surprisingly slow. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: drop is_atomic arg to kaiser_pagetable_walk()Hugh Dickins
I have not observed a might_sleep() warning from setup_fixmap_gdt()'s use of kaiser_add_mapping() in our tree (why not?), but like upstream we have not provided a way for that to pass is_atomic true down to kaiser_pagetable_walk(), and at startup it's far from a likely source of trouble: so just delete the walk's is_atomic arg and might_sleep(). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: use ALTERNATIVE instead of x86_cr3_pcid_noflushHugh Dickins
Now that we're playing the ALTERNATIVE game, use that more efficient method: instead of user-mapping an extra page, and reading an extra cacheline each time for x86_cr3_pcid_noflush. Neel has found that __stringify(bts $X86_CR3_PCID_NOFLUSH_BIT, %rax) is a working substitute for the "bts $63, %rax" in these ALTERNATIVEs; but the one line with $63 in looks clearer, so let's stick with that. Worried about what happens with an ALTERNATIVE between the jump and jump label in another ALTERNATIVE? I was, but have checked the combinations in SWITCH_KERNEL_CR3_NO_STACK at entry_SYSCALL_64, and it does a good job. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05x86/kaiser: Check boottime cmdline paramsBorislav Petkov
AMD (and possibly other vendors) are not affected by the leak KAISER is protecting against. Keep the "nopti" for traditional reasons and add pti=<on|off|auto> like upstream. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05x86/kaiser: Rename and simplify X86_FEATURE_KAISER handlingBorislav Petkov
Concentrate it in arch/x86/mm/kaiser.c and use the upstream string "nopti". Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: add "nokaiser" boot option, using ALTERNATIVEHugh Dickins
Added "nokaiser" boot option: an early param like "noinvpcid". Most places now check int kaiser_enabled (#defined 0 when not CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S and entry_64_compat.S are using the ALTERNATIVE technique, which patches in the preferred instructions at runtime. That technique is tied to x86 cpu features, so X86_FEATURE_KAISER is fabricated. Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that, but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when nokaiser like when !CONFIG_KAISER, but not setting either when kaiser - neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL won't get set in some obscure corner, or something add PGE into CR4. By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled, all page table setup which uses pte_pfn() masks it out of the ptes. It's slightly shameful that the same declaration versus definition of kaiser_enabled appears in not one, not two, but in three header files (asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h). I felt safer that way, than with #including any of those in any of the others; and did not feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes them all, so we shall hear about it if they get out of synch. Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER from kaiser.c; removed the unused native_get_normal_pgd(); removed the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some comments. But more interestingly, set CR4.PSE in secondary_startup_64: the manual is clear that it does not matter whether it's 0 or 1 when 4-level-pts are enabled, but I was distracted to find cr4 different on BSP and auxiliaries - BSP alone was adding PSE, in probe_page_size_mask(). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: fix unlikely error in alloc_ldt_struct()Hugh Dickins
An error from kaiser_add_mapping() here is not at all likely, but Eric Biggers rightly points out that __free_ldt_struct() relies on new_ldt->size being initialized: move that up. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: _pgd_alloc() without __GFP_REPEAT to avoid stallsHugh Dickins
Synthetic filesystem mempressure testing has shown softlockups, with hour-long page allocation stalls, and pgd_alloc() trying for order:1 with __GFP_REPEAT in one of the backtraces each time. That's _pgd_alloc() going for a Kaiser double-pgd, using the __GFP_REPEAT common to all page table allocations, but actually having no effect on order:0 (see should_alloc_oom() and should_continue_reclaim() in this tree, but beware that ports to another tree might behave differently). Order:1 stack allocation has been working satisfactorily without __GFP_REPEAT forever, and page table allocation only asks __GFP_REPEAT for awkward occasions in a long-running process: it's not appropriate at fork or exec time, and seems to be doing much more harm than good: getting those contiguous pages under very heavy mempressure can be hard (though even without it, Kaiser does generate more mempressure). Mask out that __GFP_REPEAT inside _pgd_alloc(). Why not take it out of the PGALLOG_GFP altogether, as v4.7 commit a3a9a59d2067 ("x86: get rid of superfluous __GFP_REPEAT") did? Because I think that might make a difference to our page table memcg charging, which I'd prefer not to interfere with at this time. hughd adds: __alloc_pages_slowpath() in the 4.4.89-stable tree handles __GFP_REPEAT a little differently than in prod kernel or 3.18.72-stable, so it may not always be exactly a no-op on order:0 pages, as said above; but I think still appropriate to omit it from Kaiser or non-Kaiser pgd. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: paranoid_entry pass cr3 need to paranoid_exitHugh Dickins
Neel Natu points out that paranoid_entry() was wrong to assume that an entry that did not need swapgs would not need SWITCH_KERNEL_CR3: paranoid_entry (used for debug breakpoint, int3, double fault or MCE; though I think it's only the MCE case that is cause for concern here) can break in at an awkward time, between cr3 switch and swapgs, but its handling always needs kernel gs and kernel cr3. Easy to fix in itself, but paranoid_entry() also needs to convey to paranoid_exit() (and my reading of macro idtentry says paranoid_entry and paranoid_exit are always paired) how to restore the prior state. The swapgs state is already conveyed by %ebx (0 or 1), so extend that also to convey when SWITCH_USER_CR3 will be needed (2 or 3). (Yes, I'd much prefer that 0 meant no swapgs, whereas it's the other way round: and a convention shared with error_entry() and error_exit(), which I don't want to touch. Perhaps I should have inverted the bit for switch cr3 too, but did not.) paranoid_exit() would be straightforward, except for TRACE_IRQS: it did TRACE_IRQS_IRETQ when doing swapgs, but TRACE_IRQS_IRETQ_DEBUG when not: which is it supposed to use when SWITCH_USER_CR3 is split apart from that? As best as I can determine, commit 5963e317b1e9 ("ftrace/x86: Do not change stacks in DEBUG when calling lockdep") missed the swapgs case, and should have used TRACE_IRQS_IRETQ_DEBUG there too (the discrepancy has nothing to do with the liberal use of _NO_STACK and _UNSAFE_STACK hereabouts: TRACE_IRQS_OFF_DEBUG has just been used in all cases); discrepancy lovingly preserved across several paranoid_exit() cleanups, but I'm now removing it. Neel further indicates that to use SWITCH_USER_CR3_NO_STACK there in paranoid_exit() is now not only unnecessary but unsafe: might corrupt syscall entry's unsafe_stack_register_backup of %rax. Just use SWITCH_USER_CR3: and delete SWITCH_USER_CR3_NO_STACK altogether, before we make the mistake of using it again. hughd adds: this commit fixes an issue in the Kaiser-without-PCIDs part of the series, and ought to be moved earlier, if you decided to make a release of Kaiser-without-PCIDs. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: x86_cr3_pcid_noflush and x86_cr3_pcid_userHugh Dickins
Mostly this commit is just unshouting X86_CR3_PCID_KERN_VAR and X86_CR3_PCID_USER_VAR: we usually name variables in lower-case. But why does x86_cr3_pcid_noflush need to be __aligned(PAGE_SIZE)? Ah, it's a leftover from when kaiser_add_user_map() once complained about mapping the same page twice. Make it __read_mostly instead. (I'm a little uneasy about all the unrelated data which shares its page getting user-mapped too, but that was so before, and not a big deal: though we call it user-mapped, it's not mapped with _PAGE_USER.) And there is a little change around the two calls to do_nmi(). Previously they set the NOFLUSH bit (if PCID supported) when forcing to kernel context before do_nmi(); now they also have the NOFLUSH bit set (if PCID supported) when restoring context after: nothing done in do_nmi() should require a TLB to be flushed here. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: PCID 0 for kernel and 128 for userHugh Dickins
Why was 4 chosen for kernel PCID and 6 for user PCID? No good reason in a backport where PCIDs are only used for Kaiser. If we continue with those, then we shall need to add Andy Lutomirski's 4.13 commit 6c690ee1039b ("x86/mm: Split read_cr3() into read_cr3_pa() and __read_cr3()"), which deals with the problem of read_cr3() callers finding stray bits in the cr3 that they expected to be page-aligned; and for hibernation, his 4.14 commit f34902c5c6c0 ("x86/hibernate/64: Mask off CR3's PCID bits in the saved CR3"). But if 0 is used for kernel PCID, then there's no need to add in those commits - whenever the kernel looks, it sees 0 in the lower bits; and 0 for kernel seems an obvious choice. And I naughtily propose 128 for user PCID. Because there's a place in _SWITCH_TO_USER_CR3 where it takes note of the need for TLB FLUSH, but needs to reset that to NOFLUSH for the next occasion. Currently it does so with a "movb $(0x80)" into the high byte of the per-cpu quadword, but that will cause a machine without PCID support to crash. Now, if %al just happened to have 0x80 in it at that point, on a machine with PCID support, but 0 on a machine without PCID support... (That will go badly wrong once the pgd can be at a physical address above 2^56, but even with 5-level paging, physical goes up to 2^52.) Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush userHugh Dickins
We have many machines (Westmere, Sandybridge, Ivybridge) supporting PCID but not INVPCID: on these load_new_mm_cr3() simply crashed. Flushing user context inside load_new_mm_cr3() without the use of invpcid is difficult: momentarily switch from kernel to user context and back to do so? I'm not sure whether that can be safely done at all, and would risk polluting user context with kernel internals, and kernel context with stale user externals. Instead, follow the hint in the comment that was there: change X86_CR3_PCID_USER_VAR to be a per-cpu variable, then load_new_mm_cr3() can leave a note in it, for SWITCH_USER_CR3 on return to userspace to flush user context TLB, instead of default X86_CR3_PCID_USER_NOFLUSH. Which works well enough that there's no need to do it this way only when invpcid is unsupported: it's a good alternative to invpcid here. But there's a couple of inlines in asm/tlbflush.h that need to do the same trick, so it's best to localize all this per-cpu business in mm/kaiser.c: moving that part of the initialization from setup_pcid() to kaiser_setup_pcid(); with kaiser_flush_tlb_on_return_to_user() the function for noting an X86_CR3_PCID_USER_FLUSH. And let's keep a KAISER_SHADOW_PGD_OFFSET in there, to avoid the extra OR on exit. I did try to make the feature tests in asm/tlbflush.h more consistent with each other: there seem to be far too many ways of performing such tests, and I don't have a good grasp of their differences. At first I converted them all to be static_cpu_has(): but that proved to be a mistake, as the comment in __native_flush_tlb_single() hints; so then I reversed and made them all this_cpu_has(). Probably all gratuitous change, but that's the way it's working at present. I am slightly bothered by the way non-per-cpu X86_CR3_PCID_KERN_VAR gets re-initialized by each cpu (before and after these changes): no problem when (as usual) all cpus on a machine have the same features, but in principle incorrect. However, my experiment to per-cpu-ify that one did not end well... Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: enhanced by kernel and user PCIDsDave Hansen
Merged performance improvements to Kaiser, using distinct kernel and user Process Context Identifiers to minimize the TLB flushing. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: vmstat show NR_KAISERTABLE as nr_overheadHugh Dickins
The kaiser update made an interesting choice, never to free any shadow page tables. Contention on global spinlock was worrying, particularly with it held across page table scans when freeing. Something had to be done: I was going to add refcounting; but simply never to free them is an appealing choice, minimizing contention without complicating the code (the more a page table is found already, the less the spinlock is used). But leaking pages in this way is also a worry: can we get away with it? At the very least, we need a count to show how bad it actually gets: in principle, one might end up wasting about 1/256 of memory that way (1/512 for when direct-mapped pages have to be user-mapped, plus 1/512 for when they are user-mapped from the vmalloc area on another occasion (but we don't have vmalloc'ed stacks, so only large ldts are vmalloc'ed). Add per-cpu stat NR_KAISERTABLE: including 256 at startup for the shared pgd entries, and 1 for each intermediate page table added thereafter for user-mapping - but leave out the 1 per mm, for its shadow pgd, because that distracts from the monotonic increase. Shown in /proc/vmstat as nr_overhead (0 if kaiser not enabled). In practice, it doesn't look so bad so far: more like 1/12000 after nine hours of gtests below; and movable pageblock segregation should tend to cluster the kaiser tables into a subset of the address space (if not, they will be bad for compaction too). But production may tell a different story: keep an eye on this number, and bring back lighter freeing if it gets out of control (maybe a shrinker). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: delete KAISER_REAL_SWITCH optionHugh Dickins
We fail to see what CONFIG_KAISER_REAL_SWITCH is for: it seems to be left over from early development, and now just obscures tricky parts of the code. Delete it before adding PCIDs, or nokaiser boot option. (Or if there is some good reason to keep the option, then it needs a help text - and a "depends on KAISER", so that all those without KAISER are not asked the question.) Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: name that 0x1000 KAISER_SHADOW_PGD_OFFSETHugh Dickins
There's a 0x1000 in various places, which looks better with a name. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: cleanups while trying for gold linkHugh Dickins
While trying to get our gold link to work, four cleanups: matched the gdt_page declaration to its definition; in fiddling unsuccessfully with PERCPU_INPUT(), lined up backslashes; lined up the backslashes according to convention in percpu-defs.h; deleted the unused irq_stack_pointer addition to irq_stack_union. Sad to report that aligning backslashes does not appear to help gold align to 8192: but while these did not help, they are worth keeping. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: kaiser_remove_mapping() move along the pgdHugh Dickins
When removing the bogus comment from kaiser_remove_mapping(), I really ought to have checked the extent of its bogosity: as Neel points out, there is nothing to stop unmap_pud_range_nofree() from continuing beyond the end of a pud (and starting in the wrong position on the next). Fix kaiser_remove_mapping() to constrain the extent and advance pgd pointer correctly: use pgd_addr_end() macro as used throughout base mm (but don't assume page-rounded start and size in this case). But this bug was very unlikely to trigger in this backport: since any buddy allocation is contained within a single pud extent, and we are not using vmapped stacks (and are only mapping one page of stack anyway): the only way to hit this bug here would be when freeing a large modified ldt. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: tidied up kaiser_add/remove_mapping slightlyHugh Dickins
Yes, unmap_pud_range_nofree()'s declaration ought to be in a header file really, but I'm not sure we want to use it anyway: so for now just declare it inside kaiser_remove_mapping(). And there doesn't seem to be such a thing as unmap_p4d_range(), even in a 5-level paging tree. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: tidied up asm/kaiser.h somewhatHugh Dickins
Mainly deleting a surfeit of blank lines, and reflowing header comment. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: ENOMEM if kaiser_pagetable_walk() NULLHugh Dickins
kaiser_add_user_map() took no notice when kaiser_pagetable_walk() failed. And avoid its might_sleep() when atomic (though atomic at present unused). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: fix perf crashesHugh Dickins
Avoid perf crashes: place debug_store in the user-mapped per-cpu area instead of allocating, and use page allocator plus kaiser_add_mapping() to keep the BTS and PEBS buffers user-mapped (that is, present in the user mapping, though visible only to kernel and hardware). The PEBS fixup buffer does not need this treatment. The need for a user-mapped struct debug_store showed up before doing any conscious perf testing: in a couple of kernel paging oopses on Westmere, implicating the debug_store offset of the per-cpu area. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: fix regs to do_nmi() ifndef CONFIG_KAISERHugh Dickins
pjt has observed that nmi's second (nmi_from_kernel) call to do_nmi() adjusted the %rdi regs arg, rightly when CONFIG_KAISER, but wrongly when not CONFIG_KAISER. Although the minimal change is to add an #ifdef CONFIG_KAISER around the addq line, that looks cluttered, and I prefer how the first call to do_nmi() handled it: prepare args in %rdi and %rsi before getting into the CONFIG_KAISER block, since it does not touch them at all. And while we're here, place the "#ifdef CONFIG_KAISER" that follows each, to enclose the "Unconditionally restore CR3" comment: matching how the "Unconditionally use kernel CR3" comment above is enclosed. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: KAISER depends on SMPHugh Dickins
It is absurd that KAISER should depend on SMP, but apparently nobody has tried a UP build before: which breaks on implicit declaration of function 'per_cpu_offset' in arch/x86/mm/kaiser.c. Now, you would expect that to be trivially fixed up; but looking at the System.map when that block is #ifdef'ed out of kaiser_init(), I see that in a UP build __per_cpu_user_mapped_end is precisely at __per_cpu_user_mapped_start, and the items carefully gathered into that section for user-mapping on SMP, dispersed elsewhere on UP. So, some other kind of section assignment will be needed on UP, but implementing that is not a priority: just make KAISER depend on SMP for now. Also inserted a blank line before the option, tidied up the brief Kconfig help message, and added an "If unsure, Y". Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: fix build and FIXME in alloc_ldt_struct()Hugh Dickins
Include linux/kaiser.h instead of asm/kaiser.h to build ldt.c without CONFIG_KAISER. kaiser_add_mapping() does already return an error code, so fix the FIXME. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: stack map PAGE_SIZE at THREAD_SIZE-PAGE_SIZEHugh Dickins
Kaiser only needs to map one page of the stack; and kernel/fork.c did not build on powerpc (no __PAGE_KERNEL). It's all cleaner if linux/kaiser.h provides kaiser_map_thread_stack() and kaiser_unmap_thread_stack() wrappers around asm/kaiser.h's kaiser_add_mapping() and kaiser_remove_mapping(). And use linux/kaiser.h in init/main.c to avoid the #ifdefs there. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: do not set _PAGE_NX on pgd_noneHugh Dickins
native_pgd_clear() uses native_set_pgd(), so native_set_pgd() must avoid setting the _PAGE_NX bit on an otherwise pgd_none() entry: usually that just generated a warning on exit, but sometimes more mysterious and damaging failures (our production machines could not complete booting). The original fix to this just avoided adding _PAGE_NX to an empty entry; but eventually more problems surfaced with kexec, and EFI mapping expected to be a problem too. So now instead change native_set_pgd() to update shadow only if _PAGE_USER: A few places (kernel/machine_kexec_64.c, platform/efi/efi_64.c for sure) use set_pgd() to set up a temporary internal virtual address space, with physical pages remapped at what Kaiser regards as userspace addresses: Kaiser then assumes a shadow pgd follows, which it will try to corrupt. This appears to be responsible for the recent kexec and kdump failures; though it's unclear how those did not manifest as a problem before. Ah, the shadow pgd will only be assumed to "follow" if the requested pgd is on an even-numbered page: so I suppose it was going wrong 50% of the time all along. What we need is a flag to set_pgd(), to tell it we're dealing with userspace. Er, isn't that what the pgd's _PAGE_USER bit is saying? Add a test for that. But we cannot do the same for pgd_clear() (which may be called to clear corrupted entries - set aside the question of "corrupt in which pgd?" until later), so there just rely on pgd_clear() not being called in the problematic cases - with a WARN_ON_ONCE() which should fire half the time if it is. But this is getting too big for an inline function: move it into arch/x86/mm/kaiser.c (which then demands a boot/compressed mod); and de-void and de-space native_get_shadow/normal_pgd() while here. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: merged updateDave Hansen
Merged fixes and cleanups, rebased to 4.4.89 tree (no 5-level paging). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05KAISER: Kernel Address IsolationRichard Fellner
This patch introduces our implementation of KAISER (Kernel Address Isolation to have Side-channels Efficiently Removed), a kernel isolation technique to close hardware side channels on kernel address information. More information about the patch can be found on: https://github.com/IAIK/KAISER From: Richard Fellner <richard.fellner@student.tugraz.at> From: Daniel Gruss <daniel.gruss@iaik.tugraz.at> X-Subject: [RFC, PATCH] x86_64: KAISER - do not map kernel in user mode Date: Thu, 4 May 2017 14:26:50 +0200 Link: http://marc.info/?l=linux-kernel&m=149390087310405&w=2 Kaiser-4.10-SHA1: c4b1831d44c6144d3762ccc72f0c4e71a0c713e5 To: <linux-kernel@vger.kernel.org> To: <kernel-hardening@lists.openwall.com> Cc: <clementine.maurice@iaik.tugraz.at> Cc: <moritz.lipp@iaik.tugraz.at> Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at> Cc: Richard Fellner <richard.fellner@student.tugraz.at> Cc: Ingo Molnar <mingo@kernel.org> Cc: <kirill.shutemov@linux.intel.com> Cc: <anders.fogh@gdata-adan.de> After several recent works [1,2,3] KASLR on x86_64 was basically considered dead by many researchers. We have been working on an efficient but effective fix for this problem and found that not mapping the kernel space when running in user mode is the solution to this problem [4] (the corresponding paper [5] will be presented at ESSoS17). With this RFC patch we allow anybody to configure their kernel with the flag CONFIG_KAISER to add our defense mechanism. If there are any questions we would love to answer them. We also appreciate any comments! Cheers, Daniel (+ the KAISER team from Graz University of Technology) [1] http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf [2] https://www.blackhat.com/docs/us-16/materials/us-16-Fogh-Using-Undocumented-CPU-Behaviour-To-See-Into-Kernel-Mode-And-Break-KASLR-In-The-Process.pdf [3] https://www.blackhat.com/docs/us-16/materials/us-16-Jang-Breaking-Kernel-Address-Space-Layout-Randomization-KASLR-With-Intel-TSX.pdf [4] https://github.com/IAIK/KAISER [5] https://gruss.cc/files/kaiser.pdf [patch based also on https://raw.githubusercontent.com/IAIK/KAISER/master/KAISER/0001-KAISER-Kernel-Address-Isolation.patch] Signed-off-by: Richard Fellner <richard.fellner@student.tugraz.at> Signed-off-by: Moritz Lipp <moritz.lipp@iaik.tugraz.at> Signed-off-by: Daniel Gruss <daniel.gruss@iaik.tugraz.at> Signed-off-by: Michael Schwarz <michael.schwarz@iaik.tugraz.at> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05x86/boot: Add early cmdline parsing for options with argumentsTom Lendacky
commit e505371dd83963caae1a37ead9524e8d997341be upstream. Add a cmdline_find_option() function to look for cmdline options that take arguments. The argument is returned in a supplied buffer and the argument length (regardless of whether it fits in the supplied buffer) is returned, with -1 indicating not found. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Dave Young <dyoung@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Toshimitsu Kani <toshi.kani@hpe.com> Cc: kasan-dev@googlegroups.com Cc: kvm@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-efi@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/36b5f97492a9745dce27682305f990fc20e5cf8a.1500319216.git.thomas.lendacky@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02Linux 4.4.109v4.4.109Greg Kroah-Hartman
2018-01-02mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UPAndy Lutomirski
commit 5dd0b16cdaff9b94da06074d5888b03235c0bf17 upstream. This fixes CONFIG_SMP=n, CONFIG_DEBUG_TLBFLUSH=y without introducing further #ifdef soup. Caught by a Kbuild bot randconfig build. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: ce4a4e565f52 ("x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code") Link: http://lkml.kernel.org/r/76da9a3cc4415996f2ad2c905b93414add322021.1496673616.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02n_tty: fix EXTPROC vs ICANON interaction with TIOCINQ (aka FIONREAD)Linus Torvalds
commit 966031f340185eddd05affcf72b740549f056348 upstream. We added support for EXTPROC back in 2010 in commit 26df6d13406d ("tty: Add EXTPROC support for LINEMODE") and the intent was to allow it to override some (all?) ICANON behavior. Quoting from that original commit message: There is a new bit in the termios local flag word, EXTPROC. When this bit is set, several aspects of the terminal driver are disabled. Input line editing, character echo, and mapping of signals are all disabled. This allows the telnetd to turn off these functions when in linemode, but still keep track of what state the user wants the terminal to be in. but the problem turns out that "several aspects of the terminal driver are disabled" is a bit ambiguous, and you can really confuse the n_tty layer by setting EXTPROC and then causing some of the ICANON invariants to no longer be maintained. This fixes at least one such case (TIOCINQ) becoming unhappy because of the confusion over whether ICANON really means ICANON when EXTPROC is set. This basically makes TIOCINQ match the case of read: if EXTPROC is set, we ignore ICANON. Also, make sure to reset the ICANON state ie EXTPROC changes, not just if ICANON changes. Fixes: 26df6d13406d ("tty: Add EXTPROC support for LINEMODE") Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Reported-by: syzkaller <syzkaller@googlegroups.com> Cc: Jiri Slaby <jslaby@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02x86/smpboot: Remove stale TLB flush invocationsThomas Gleixner
commit 322f8b8b340c824aef891342b0f5795d15e11562 upstream. smpboot_setup_warm_reset_vector() and smpboot_restore_warm_reset_vector() invoke local_flush_tlb() for no obvious reason. Digging in history revealed that the original code in the 2.1 era added those because the code manipulated a swapper_pg_dir pagetable entry. The pagetable manipulation was removed long ago in the 2.3 timeframe, but the TLB flush invocations stayed around forever. Remove them along with the pointless pr_debug()s which come from the same 2.1 change. Reported-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20171230211829.586548655@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick()Thomas Gleixner
commit 5d62c183f9e9df1deeea0906d099a94e8a43047a upstream. The conditions in irq_exit() to invoke tick_nohz_irq_exit() which subsequently invokes tick_nohz_stop_sched_tick() are: if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu)) If need_resched() is not set, but a timer softirq is pending then this is an indication that the softirq code punted and delegated the execution to softirqd. need_resched() is not true because the current interrupted task takes precedence over softirqd. Invoking tick_nohz_irq_exit() in this case can cause an endless loop of timer interrupts because the timer wheel contains an expired timer, but softirqs are not yet executed. So it returns an immediate expiry request, which causes the timer to fire immediately again. Lather, rinse and repeat.... Prevent that by adding a check for a pending timer soft interrupt to the conditions in tick_nohz_stop_sched_tick() which avoid calling get_next_timer_interrupt(). That keeps the tick sched timer on the tick and prevents a repetitive programming of an already expired timer. Reported-by: Sebastian Siewior <bigeasy@linutronix.d> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Anna-Maria Gleixner <anna-maria@linutronix.de> Cc: Sebastian Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272156050.2431@nanos Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02usb: xhci: Add XHCI_TRUST_TX_LENGTH for Renesas uPD720201Daniel Thompson
commit da99706689481717998d1d48edd389f339eea979 upstream. When plugging in a USB webcam I see the following message: xhci_hcd 0000:04:00.0: WARN Successful completion on short TX: needs XHCI_TRUST_TX_LENGTH quirk? handle_tx_event: 913 callbacks suppressed All is quiet again with this patch (and I've done a fair but of soak testing with the camera since). Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02USB: Fix off by one in type-specific length check of BOS SSP capabilityMathias Nyman
commit 07b9f12864d16c3a861aef4817eb1efccbc5d0e6 upstream. USB 3.1 devices are not detected as 3.1 capable since 4.15-rc3 due to a off by one in commit 81cf4a45360f ("USB: core: Add type-specific length check of BOS descriptors") It uses USB_DT_USB_SSP_CAP_SIZE() to get SSP capability size which takes the zero based SSAC as argument, not the actual count of sublink speed attributes. USB3 spec 9.6.2.5 says "The number of Sublink Speed Attributes = SSAC + 1." The type-specific length check patch was added to stable and needs to be fixed there as well Fixes: 81cf4a45360f ("USB: core: Add type-specific length check of BOS descriptors") CC: Masakazu Mokuno <masakazu.mokuno@gmail.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02usb: add RESET_RESUME for ELSA MicroLink 56KOliver Neukum
commit b9096d9f15c142574ebebe8fbb137012bb9d99c2 upstream. This modem needs this quirk to operate. It produces timeouts when resumed without reset. Signed-off-by: Oliver Neukum <oneukum@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>