summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2017-06-14sparc64: mm: fix copy_tsb to correctly copy huge page TSBsMike Kravetz
[ Upstream commit 654f4807624a657f364417c2a7454f0df9961734 ] When a TSB grows beyond its current capacity, a new TSB is allocated and copy_tsb is called to copy entries from the old TSB to the new. A hash shift based on page size is used to calculate the index of an entry in the TSB. copy_tsb has hard coded PAGE_SHIFT in these calculations. However, for huge page TSBs the value REAL_HPAGE_SHIFT should be used. As a result, when copy_tsb is called for a huge page TSB the entries are placed at the incorrect index in the newly allocated TSB. When doing hardware table walk, the MMU does not match these entries and we end up in the TSB miss handling code. This code will then create and write an entry to the correct index in the TSB. We take a performance hit for the table walk miss and recreation of these entries. Pass a new parameter to copy_tsb that is the page size shift to be used when copying the TSB. Suggested-by: Anthony Yznaga <anthony.yznaga@oracle.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14sparc64: Add __multi3 for gcc 7.x and later.David S. Miller
[ Upstream commit 1b4af13ff2cc6897557bb0b8d9e2fad4fa4d67aa ] Reported-by: Waldemar Brodkorb <wbx@openadk.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07x86/boot: Use CROSS_COMPILE prefix for readelfRob Landley
commit 3780578761921f094179c6289072a74b2228c602 upstream. The boot code Makefile contains a straight 'readelf' invocation. This causes build warnings in cross compile environments, when there is no unprefixed readelf accessible via $PATH. Add the missing $(CROSS_COMPILE) prefix. [ tglx: Rewrote changelog ] Fixes: 98f78525371b ("x86/boot: Refuse to build with data relocations") Signed-off-by: Rob Landley <rob@landley.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Paul Bolle <pebolle@tiscali.nl> Cc: "H.J. Lu" <hjl.tools@gmail.com> Link: http://lkml.kernel.org/r/ced18878-693a-9576-a024-113ef39a22c0@landley.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07x86/MCE: Export memory_error()Borislav Petkov
commit 2d1f406139ec20320bf38bcd2461aa8e358084b5 upstream. Export the function which checks whether an MCE is a memory error to other users so that we can reuse the logic. Drop the boot_cpu_data use, while at it, as mce.cpuvendor already has the CPU vendor in there. Integrate a piece from a patch from Vishal Verma <vishal.l.verma@intel.com> to export it for modules (nfit). The main reason we're exporting it is that the nfit handler nfit_handle_mce() needs to detect a memory error properly before doing its recovery actions. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Link: http://lkml.kernel.org/r/20170519093915.15413-2-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07powerpc/spufs: Fix hash faults for kernel regionsJeremy Kerr
commit d75e4919cc0b6fbcbc8d6654ef66d87a9dbf1526 upstream. Commit ac29c64089b7 ("powerpc/mm: Replace _PAGE_USER with _PAGE_PRIVILEGED") swapped _PAGE_USER for _PAGE_PRIVILEGED, and introduced check_pte_access() which denied kernel access to non-_PAGE_PRIVILEGED pages. However, it didn't add _PAGE_PRIVILEGED to the hash fault handler for spufs' kernel accesses, so the DMAs required to establish SPE memory no longer work. This change adds _PAGE_PRIVILEGED to the hash fault handler for kernel accesses. Fixes: ac29c64089b7 ("powerpc/mm: Replace _PAGE_USER with _PAGE_PRIVILEGED") Signed-off-by: Jeremy Kerr <jk@ozlabs.org> Reported-by: Sombat Tragolgosol <sombat3960@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07powerpc: Fix booting P9 hash with CONFIG_PPC_RADIX_MMU=NMichael Neuling
commit d957fb4d173647640a2b83e7c7e56a580e7fc7e7 upstream. Currently if you disable CONFIG_PPC_RADIX_MMU you'll crash on boot on a P9. This is because we still set MMU_FTR_TYPE_RADIX via ibm,pa-features and MMU_FTR_TYPE_RADIX is what's used for code patching in much of the asm code (ie. slb_miss_realmode) This patch fixes the problem by stopping MMU_FTR_TYPE_RADIX from being set from ibm.pa-features. We may eventually end up removing the CONFIG_PPC_RADIX_MMU option completely but until then this fixes the issue. Fixes: 17a3dd2f5fc7 ("powerpc/mm/radix: Use firmware feature to enable Radix MMU") Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07sparc/ftrace: Fix ftrace graph time measurementLiam R. Howlett
[ Upstream commit 48078d2dac0a26f84f5f3ec704f24f7c832cce14 ] The ftrace function_graph time measurements of a given function is not accurate according to those recorded by ftrace using the function filters. This change pulls the x86_64 fix from 'commit 722b3c746953 ("ftrace/graph: Trace function entry before updating index")' into the sparc specific prepare_ftrace_return which stops ftrace from counting interrupted tasks in the time measurement. Example measurements for select_task_rq_fair running "hackbench 100 process 1000": | tracing/trace_stat/function0 | function_graph Before patch | 2.802 us | 4.255 us After patch | 2.749 us | 3.094 us Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07sparc: Fix -Wstringop-overflow warningOrlando Arias
[ Upstream commit deba804c90642c8ed0f15ac1083663976d578f54 ] Greetings, GCC 7 introduced the -Wstringop-overflow flag to detect buffer overflows in calls to string handling functions [1][2]. Due to the way ``empty_zero_page'' is declared in arch/sparc/include/setup.h, this causes a warning to trigger at compile time in the function mem_init(), which is subsequently converted to an error. The ensuing patch fixes this issue and aligns the declaration of empty_zero_page to that of other architectures. Thank you. Cheers, Orlando. [1] https://gcc.gnu.org/ml/gcc-patches/2016-10/msg02308.html [2] https://gcc.gnu.org/gcc-7/changes.html Signed-off-by: Orlando Arias <oarias@knights.ucf.edu> -------------------------------------------------------------------------------- Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07sparc64: Fix mapping of 64k pages with MAP_FIXEDNitin Gupta
[ Upstream commit b6c41cb050d5debc7e4eaa0a81cbdbad72588891 ] An incorrect huge page alignment check caused mmap failure for 64K pages when MAP_FIXED is used with address not aligned to HPAGE_SIZE. Orabug: 25885991 Fixes: dcd1912d21a0 ("sparc64: Add 64K page size support") Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07bpf, arm64: fix faulty emission of map access in tail callsDaniel Borkmann
[ Upstream commit d8b54110ee944de522ccd3531191f39986ec20f9 ] Shubham was recently asking on netdev why in arm64 JIT we don't multiply the index for accessing the tail call map by 8. That led me into testing out arm64 JIT wrt tail calls and it turned out I got a NULL pointer dereference on the tail call. The buggy access is at: prog = array->ptrs[index]; if (prog == NULL) goto out; [...] 00000060: d2800e0a mov x10, #0x70 // #112 00000064: f86a682a ldr x10, [x1,x10] 00000068: f862694b ldr x11, [x10,x2] 0000006c: b40000ab cbz x11, 0x00000080 [...] The code triggering the crash is f862694b. x1 at the time contains the address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning, above we load the pointer to the program at map slot 0 into x10. x10 can then be NULL if the slot is not occupied, which we later on try to access with a user given offset in x2 that is the map index. Fix this by emitting the following instead: [...] 00000060: d2800e0a mov x10, #0x70 // #112 00000064: 8b0a002a add x10, x1, x10 00000068: d37df04b lsl x11, x2, #3 0000006c: f86b694b ldr x11, [x10,x11] 00000070: b40000ab cbz x11, 0x00000084 [...] This basically adds the offset to ptrs to the base address of the bpf array we got and we later on access the map with an index * 8 offset relative to that. The tail call map itself is basically one large area with meta data at the head followed by the array of prog pointers. This makes tail calls working again, tested on Cavium ThunderX ARMv8. Fixes: ddb55992b04d ("arm64: bpf: implement bpf_tail_call() helper") Reported-by: Shubham Bansal <illusionist.neo@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25um: Fix to call read_initrd after init_bootmemMasami Hiramatsu
commit 5b4236e17cc1bd9fa14b2b0c7a4ae632d41f2e20 upstream. Since read_initrd() invokes alloc_bootmem() for allocating memory to load initrd image, it must be called after init_bootmem. This makes read_initrd() called directly from setup_arch() after init_bootmem() and mem_total_pages(). Fixes: b63236972e1 ("um: Setup physical memory in setup_arch()") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25osf_wait4(): fix infoleakAl Viro
commit a8c39544a6eb2093c04afd5005b6192bd0e880c6 upstream. failing sys_wait4() won't fill struct rusage... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25kvm: arm/arm64: Force reading uncached stage2 PGDSuzuki K Poulose
commit 2952a6070e07ebdd5896f1f5b861acad677caded upstream. Make sure we don't use a cached value of the KVM stage2 PGD while resetting the PGD. Cc: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reviewed-by: Christoffer Dall <cdall@linaro.org> Signed-off-by: Christoffer Dall <cdall@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25kvm: arm/arm64: Fix use after free of stage2 page tableSuzuki K Poulose
commit 0c428a6a9256fcd66817e12db32a50b405ed2e5c upstream. We yield the kvm->mmu_lock occassionaly while performing an operation (e.g, unmap or permission changes) on a large area of stage2 mappings. However this could possibly cause another thread to clear and free up the stage2 page tables while we were waiting for regaining the lock and thus the original thread could end up in accessing memory that was freed. This patch fixes the problem by making sure that the stage2 pagetable is still valid after we regain the lock. The fact that mmu_notifer->release() could be called twice (via __mmu_notifier_release and mmu_notifier_unregsister) enhances the possibility of hitting this race where there are two threads trying to unmap the entire guest shadow pages. While at it, cleanup the redudant checks around cond_resched_lock in stage2_wp_range(), as cond_resched_lock already does the same checks. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: andreyknvl@google.com Cc: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reviewed-by: Christoffer Dall <cdall@linaro.org> Signed-off-by: Christoffer Dall <cdall@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25kvm: arm/arm64: Fix race in resetting stage2 PGDSuzuki K Poulose
commit 6c0d706b563af732adb094c5bf807437e8963e84 upstream. In kvm_free_stage2_pgd() we check the stage2 PGD before holding the lock and proceed to take the lock if it is valid. And we unmap the page tables, followed by releasing the lock. We reset the PGD only after dropping this lock, which could cause a race condition where another thread waiting on or even holding the lock, could potentially see that the PGD is still valid and proceed to perform a stage2 operation and later encounter a NULL PGD. [223090.242280] Unable to handle kernel NULL pointer dereference at virtual address 00000040 [223090.262330] PC is at unmap_stage2_range+0x8c/0x428 [223090.262332] LR is at kvm_unmap_hva_handler+0x2c/0x3c [223090.262531] Call trace: [223090.262533] [<ffff0000080adb78>] unmap_stage2_range+0x8c/0x428 [223090.262535] [<ffff0000080adf40>] kvm_unmap_hva_handler+0x2c/0x3c [223090.262537] [<ffff0000080ace2c>] handle_hva_to_gpa+0xb0/0x104 [223090.262539] [<ffff0000080af988>] kvm_unmap_hva+0x5c/0xbc [223090.262543] [<ffff0000080a2478>] kvm_mmu_notifier_invalidate_page+0x50/0x8c [223090.262547] [<ffff0000082274f8>] __mmu_notifier_invalidate_page+0x5c/0x84 [223090.262551] [<ffff00000820b700>] try_to_unmap_one+0x1d0/0x4a0 [223090.262553] [<ffff00000820c5c8>] rmap_walk+0x1cc/0x2e0 [223090.262555] [<ffff00000820c90c>] try_to_unmap+0x74/0xa4 [223090.262557] [<ffff000008230ce4>] migrate_pages+0x31c/0x5ac [223090.262561] [<ffff0000081f869c>] compact_zone+0x3fc/0x7ac [223090.262563] [<ffff0000081f8ae0>] compact_zone_order+0x94/0xb0 [223090.262564] [<ffff0000081f91c0>] try_to_compact_pages+0x108/0x290 [223090.262569] [<ffff0000081d5108>] __alloc_pages_direct_compact+0x70/0x1ac [223090.262571] [<ffff0000081d64a0>] __alloc_pages_nodemask+0x434/0x9f4 [223090.262572] [<ffff0000082256f0>] alloc_pages_vma+0x230/0x254 [223090.262574] [<ffff000008235e5c>] do_huge_pmd_anonymous_page+0x114/0x538 [223090.262576] [<ffff000008201bec>] handle_mm_fault+0xd40/0x17a4 [223090.262577] [<ffff0000081fb324>] __get_user_pages+0x12c/0x36c [223090.262578] [<ffff0000081fb804>] get_user_pages_unlocked+0xa4/0x1b8 [223090.262579] [<ffff0000080a3ce8>] __gfn_to_pfn_memslot+0x280/0x31c [223090.262580] [<ffff0000080a3dd0>] gfn_to_pfn_prot+0x4c/0x5c [223090.262582] [<ffff0000080af3f8>] kvm_handle_guest_abort+0x240/0x774 [223090.262584] [<ffff0000080b2bac>] handle_exit+0x11c/0x1ac [223090.262586] [<ffff0000080ab99c>] kvm_arch_vcpu_ioctl_run+0x31c/0x648 [223090.262587] [<ffff0000080a1d78>] kvm_vcpu_ioctl+0x378/0x768 [223090.262590] [<ffff00000825df5c>] do_vfs_ioctl+0x324/0x5a4 [223090.262591] [<ffff00000825e26c>] SyS_ioctl+0x90/0xa4 [223090.262595] [<ffff000008085d84>] el0_svc_naked+0x38/0x3c This patch moves the stage2 PGD manipulation under the lock. Reported-by: Alexander Graf <agraf@suse.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: Christoffer Dall <cdall@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Christoffer Dall <cdall@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25MIPS: Loongson-3: Select MIPS_L1_CACHE_SHIFT_6Huacai Chen
commit 17c99d9421695a0e0de18bf1e7091d859e20ec1d upstream. Some newer Loongson-3 have 64 bytes cache lines, so select MIPS_L1_CACHE_SHIFT_6. Signed-off-by: Huacai Chen <chenhc@lemote.com> Cc: John Crispin <john@phrozen.org> Cc: Steven J . Hill <Steven.Hill@caviumnetworks.com> Cc: Fuxin Zhang <zhangfx@lemote.com> Cc: Zhangjin Wu <wuzhangjin@gmail.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/15755/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25metag/uaccess: Check access_ok in strncpy_from_userJames Hogan
commit 3a158a62da0673db918b53ac1440845a5b64fd90 upstream. The metag implementation of strncpy_from_user() doesn't validate the src pointer, which could allow reading of arbitrary kernel memory. Add a short access_ok() check to prevent that. Its still possible for it to read across the user/kernel boundary, but it will invariably reach a NUL character after only 9 bytes, leaking only a static kernel address being loaded into D0Re0 at the beginning of __start, which is acceptable for the immediate fix. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: linux-metag@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25metag/uaccess: Fix access_ok()James Hogan
commit 8a8b56638bcac4e64cccc88bf95a0f9f4b19a2fb upstream. The __user_bad() macro used by access_ok() has a few corner cases noticed by Al Viro where it doesn't behave correctly: - The kernel range check has off by 1 errors which permit access to the first and last byte of the kernel mapped range. - The kernel range check ends at LINCORE_BASE rather than META_MEMORY_LIMIT, which is ineffective when the kernel is in global space (an extremely uncommon configuration). There are a couple of other shortcomings here too: - Access to the whole of the other address space is permitted (i.e. the global half of the address space when the kernel is in local space). This isn't ideal as it could theoretically still contain privileged mappings set up by the bootloader. - The size argument is unused, permitting user copies which start on valid pages at the end of the user address range and cross the boundary into the kernel address space (e.g. addr = 0x3ffffff0, size > 0x10). It isn't very convenient to add size checks when disallowing certain regions, and it seems far safer to be sure and explicit about what userland is able to access, so invert the logic to allow certain regions instead, and fix the off by 1 errors and missing size checks. This also allows the get_fs() == KERNEL_DS check to be more easily optimised into the user address range case. We now have 3 such allowed regions: - The user address range (incorporating the get_fs() == KERNEL_DS check). - NULL (some kernel code expects this to work, and we'll always catch the fault anyway). - The core code memory region. Fixes: 373cd784d0fc ("metag: Memory handling") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: linux-metag@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25arm64: entry: improve data abort handling of tagged pointersKristina Martsenko
commit 276e93279a630657fff4b086ba14c95955912dfa upstream. When handling a data abort from EL0, we currently zero the top byte of the faulting address, as we assume the address is a TTBR0 address, which may contain a non-zero address tag. However, the address may be a TTBR1 address, in which case we should not zero the top byte. This patch fixes that. The effect is that the full TTBR1 address is passed to the task's signal handler (or printed out in the kernel log). When handling a data abort from EL1, we leave the faulting address intact, as we assume it's either a TTBR1 address or a TTBR0 address with tag 0x00. This is true as far as I'm aware, we don't seem to access a tagged TTBR0 address anywhere in the kernel. Regardless, it's easy to forget about address tags, and code added in the future may not always remember to remove tags from addresses before accessing them. So add tag handling to the EL1 data abort handler as well. This also makes it consistent with the EL0 data abort handler. Fixes: d50240a5f6ce ("arm64: mm: permit use of tagged pointers at EL0") Reviewed-by: Dave Martin <Dave.Martin@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Kristina Martsenko <kristina.martsenko@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25arm64: hw_breakpoint: fix watchpoint matching for tagged pointersKristina Martsenko
commit 7dcd9dd8cebe9fa626af7e2358d03a37041a70fb upstream. When we take a watchpoint exception, the address that triggered the watchpoint is found in FAR_EL1. We compare it to the address of each configured watchpoint to see which one was hit. The configured watchpoint addresses are untagged, while the address in FAR_EL1 will have an address tag if the data access was done using a tagged address. The tag needs to be removed to compare the address to the watchpoints. Currently we don't remove it, and as a result can report the wrong watchpoint as being hit (specifically, always either the highest TTBR0 watchpoint or lowest TTBR1 watchpoint). This patch removes the tag. Fixes: d50240a5f6ce ("arm64: mm: permit use of tagged pointers at EL0") Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Kristina Martsenko <kristina.martsenko@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25arm64: traps: fix userspace cache maintenance emulation on a tagged pointerKristina Martsenko
commit 81cddd65b5c82758ea5571a25e31ff6f1f89ff02 upstream. When we emulate userspace cache maintenance in the kernel, we can currently send the task a SIGSEGV even though the maintenance was done on a valid address. This happens if the address has a non-zero address tag, and happens to not be mapped in. When we get the address from a user register, we don't currently remove the address tag before performing cache maintenance on it. If the maintenance faults, we end up in either __do_page_fault, where find_vma can't find the VMA if the address has a tag, or in do_translation_fault, where the tagged address will appear to be above TASK_SIZE. In both cases, the address is not mapped in, and the task is sent a SIGSEGV. This patch removes the tag from the address before using it. With this patch, the fault is handled correctly, the address gets mapped in, and the cache maintenance succeeds. As a second bug, if cache maintenance (correctly) fails on an invalid tagged address, the address gets passed into arm64_notify_segfault, where find_vma fails to find the VMA due to the tag, and the wrong si_code may be sent as part of the siginfo_t of the segfault. With this patch, the correct si_code is sent. Fixes: 7dd01aef0557 ("arm64: trap userspace "dc cvau" cache operation on errata-affected core") Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Kristina Martsenko <kristina.martsenko@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25arm64: uaccess: ensure extension of access_ok() addrMark Rutland
commit a06040d7a791a9177581dcf7293941bd92400856 upstream. Our access_ok() simply hands its arguments over to __range_ok(), which implicitly assummes that the addr parameter is 64 bits wide. This isn't necessarily true for compat code, which might pass down a 32-bit address parameter. In these cases, we don't have a guarantee that the address has been zero extended to 64 bits, and the upper bits of the register may contain unknown values, potentially resulting in a suprious failure. Avoid this by explicitly casting the addr parameter to an unsigned long (as is done on other architectures), ensuring that the parameter is widened appropriately. Fixes: 0aea86a2176c ("arm64: User access library functions") Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25arm64: armv8_deprecated: ensure extension of addrMark Rutland
commit 55de49f9aa17b0b2b144dd2af587177b9aadf429 upstream. Our compat swp emulation holds the compat user address in an unsigned int, which it passes to __user_swpX_asm(). When a 32-bit value is passed in a register, the upper 32 bits of the register are unknown, and we must extend the value to 64 bits before we can use it as a base address. This patch casts the address to unsigned long to ensure it has been suitably extended, avoiding the potential issue, and silencing a related warning from clang. Fixes: bd35a4adc413 ("arm64: Port SWP/SWPB emulation support from arm") Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25arm64: ensure extension of smp_store_release valueMark Rutland
commit 994870bead4ab19087a79492400a5478e2906196 upstream. When an inline assembly operand's type is narrower than the register it is allocated to, the least significant bits of the register (up to the operand type's width) are valid, and any other bits are permitted to contain any arbitrary value. This aligns with the AAPCS64 parameter passing rules. Our __smp_store_release() implementation does not account for this, and implicitly assumes that operands have been zero-extended to the width of the type being stored to. Thus, we may store unknown values to memory when the value type is narrower than the pointer type (e.g. when storing a char to a long). This patch fixes the issue by casting the value operand to the same width as the pointer operand in all cases, which ensures that the value is zero-extended as we expect. We use the same union trickery as __smp_load_acquire and {READ,WRITE}_ONCE() to avoid GCC complaining that pointers are potentially cast to narrower width integers in unreachable paths. A whitespace issue at the top of __smp_store_release() is also corrected. No changes are necessary for __smp_load_acquire(). Load instructions implicitly clear any upper bits of the register, and the compiler will only consider the least significant bits of the register as valid regardless. Fixes: 47933ad41a86 ("arch: Introduce smp_load_acquire(), smp_store_release()") Fixes: 878a84d5a8a1 ("arm64: add missing data types in smp_load_acquire/smp_store_release") Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25arm64: xchg: hazard against entire exchange variableMark Rutland
commit fee960bed5e857eb126c4e56dd9ff85938356579 upstream. The inline assembly in __XCHG_CASE() uses a +Q constraint to hazard against other accesses to the memory location being exchanged. However, the pointer passed to the constraint is a u8 pointer, and thus the hazard only applies to the first byte of the location. GCC can take advantage of this, assuming that other portions of the location are unchanged, as demonstrated with the following test case: union u { unsigned long l; unsigned int i[2]; }; unsigned long update_char_hazard(union u *u) { unsigned int a, b; a = u->i[1]; asm ("str %1, %0" : "+Q" (*(char *)&u->l) : "r" (0UL)); b = u->i[1]; return a ^ b; } unsigned long update_long_hazard(union u *u) { unsigned int a, b; a = u->i[1]; asm ("str %1, %0" : "+Q" (*(long *)&u->l) : "r" (0UL)); b = u->i[1]; return a ^ b; } The linaro 15.08 GCC 5.1.1 toolchain compiles the above as follows when using -O2 or above: 0000000000000000 <update_char_hazard>: 0: d2800001 mov x1, #0x0 // #0 4: f9000001 str x1, [x0] 8: d2800000 mov x0, #0x0 // #0 c: d65f03c0 ret 0000000000000010 <update_long_hazard>: 10: b9400401 ldr w1, [x0,#4] 14: d2800002 mov x2, #0x0 // #0 18: f9000002 str x2, [x0] 1c: b9400400 ldr w0, [x0,#4] 20: 4a000020 eor w0, w1, w0 24: d65f03c0 ret This patch fixes the issue by passing an unsigned long pointer into the +Q constraint, as we do for our cmpxchg code. This may hazard against more than is necessary, but this is better than missing a necessary hazard. Fixes: 305d454aaa29 ("arm64: atomics: implement native {relaxed, acquire, release} atomics") Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25arm64: dts: hi6220: Reset the mmc hostsDaniel Lezcano
commit 0fbdf9953b41c28845fe8d05007ff09634ee3000 upstream. The MMC hosts could be left in an unconsistent or uninitialized state from the firmware. Instead of assuming, the firmware did the right things, let's reset the host controllers. This change fixes a bug when the mmc2/sdio is initialized leading to a hung task: [ 242.704294] INFO: task kworker/7:1:675 blocked for more than 120 seconds. [ 242.711129] Not tainted 4.9.0-rc8-00017-gcf0251f #3 [ 242.716571] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 242.724435] kworker/7:1 D 0 675 2 0x00000000 [ 242.729973] Workqueue: events_freezable mmc_rescan [ 242.734796] Call trace: [ 242.737269] [<ffff00000808611c>] __switch_to+0xa8/0xb4 [ 242.742437] [<ffff000008d07c04>] __schedule+0x1c0/0x67c [ 242.747689] [<ffff000008d08254>] schedule+0x40/0xa0 [ 242.752594] [<ffff000008d0b284>] schedule_timeout+0x1c4/0x35c [ 242.758366] [<ffff000008d08e38>] wait_for_common+0xd0/0x15c [ 242.763964] [<ffff000008d09008>] wait_for_completion+0x28/0x34 [ 242.769825] [<ffff000008a1a9f4>] mmc_wait_for_req_done+0x40/0x124 [ 242.775949] [<ffff000008a1ab98>] mmc_wait_for_req+0xc0/0xf8 [ 242.781549] [<ffff000008a1ac3c>] mmc_wait_for_cmd+0x6c/0x84 [ 242.787149] [<ffff000008a26610>] mmc_io_rw_direct_host+0x9c/0x114 [ 242.793270] [<ffff000008a26aa0>] sdio_reset+0x34/0x7c [ 242.798347] [<ffff000008a1d46c>] mmc_rescan+0x2fc/0x360 [ ... ] Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Wei Xu <xuwei5@hisilicon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25ARM: dts: imx6sx-sdb: Remove OPP overrideLeonard Crestez
commit d8581c7c8be172dac156a19d261f988a72ce596f upstream. The board file for imx6sx-sdb overrides cpufreq operating points to use higher voltages. This is done because the board has a shared rail for VDD_ARM_IN and VDD_SOC_IN and when using LDO bypass the shared voltage needs to be a value suitable for both ARM and SOC. This only applies to LDO bypass mode, a feature not present in upstream. When LDOs are enabled the effect is to use higher voltages than necessary for no good reason. Setting these higher voltages can make some boards fail to boot with ugly semi-random crashes reminiscent of memory corruption. These failures only happen on board rev. C, rev. B is reported to still work. Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com> Fixes: 54183bd7f766 ("ARM: imx6sx-sdb: add revb board and make it default") Signed-off-by: Shawn Guo <shawnguo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25ARM: dts: at91: sama5d3_xplained: not all ADC channels are availableLudovic Desroches
commit d3df1ec06353e51fc44563d2e7e18d42811af290 upstream. Remove ADC channels that are not available by default on the sama5d3_xplained board (resistor not populated) in order to not create confusion. Signed-off-by: Ludovic Desroches <ludovic.desroches@microchip.com> Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25ARM: dts: at91: sama5d3_xplained: fix ADC vrefLudovic Desroches
commit 9cdd31e5913c1f86dce7e201b086155b3f24896b upstream. The voltage reference for the ADC is not 3V but 3.3V since it is connected to VDDANA. Signed-off-by: Ludovic Desroches <ludovic.desroches@microchip.com> Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25ARM: 8670/1: V7M: Do not corrupt vector table around v7m_invalidate_l1 callVladimir Murzin
commit 6d80594936914e798b1b54b3bfe4bd68d8418966 upstream. We save/restore registers around v7m_invalidate_l1 to address pointed by r12, which is vector table, so the first eight entries are overwritten with a garbage. We already have stack setup at that stage, so use it to save/restore register. Fixes: 6a8146f420be ("ARM: 8609/1: V7M: Add support for the Cortex-M7 processor") Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25ARM: 8667/3: Fix memory attribute inconsistencies when using fixmapJon Medhurst
commit b089c31c519c3906c14801b6ec483e18a5152a50 upstream. To cope with the variety in ARM architectures and configurations, the pagetable attributes for kernel memory are generated at runtime to match the system the kernel finds itself on. This calculated value is stored in pgprot_kernel. However, when early fixmap support was added for ARM (commit a5f4c561b3b1) the attributes used for mappings were hard coded because pgprot_kernel is not set up early enough. Unfortunately, when fixmap is used after early boot this means the memory being mapped can have different attributes to existing mappings, potentially leading to unpredictable behaviour. A specific problem also exists due to the hard coded values not include the 'shareable' attribute which means on systems where this matters (e.g. those with multiple CPU clusters) the cache contents for a memory location can become inconsistent between CPUs. To resolve these issues we change fixmap to use the same memory attributes (from pgprot_kernel) that the rest of the kernel uses. To enable this we need to refactor the initialisation code so build_mem_type_table() is called early enough. Note, that relies on early param parsing for memory type overrides passed via the kernel command line, so we need to make sure this call is still after parse_early_params(). [ardb: keep early_fixmap_init() before param parsing, for earlycon] Fixes: a5f4c561b3b1 ("ARM: 8415/1: early fixmap support for earlycon") Tested-by: afzal mohammed <afzal.mohd.ma@gmail.com> Signed-off-by: Jon Medhurst <tixy@linaro.org> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25ARM: 8662/1: module: split core and init PLT sectionsArd Biesheuvel
commit b7ede5a1f5905ac394cc8e61712a13e3c5cb7b8f upstream. Since commit 35fa91eed817 ("ARM: kernel: merge core and init PLTs"), the ARM module PLT code allocates all PLT entries in a single core section, since the overhead of having a separate init PLT section is not justified by the small number of PLT entries usually required for init code. However, the core and init module regions are allocated independently, and there is a corner case where the core region may be allocated from the VMALLOC region if the dedicated module region is exhausted, but the init region, being much smaller, can still be allocated from the module region. This puts the PLT entries out of reach of the relocated branch instructions, defeating the whole purpose of PLTs. So split the core and init PLT regions, and name the latter ".init.plt" so it gets allocated along with (and sufficiently close to) the .init sections that it serves. Also, given that init PLT entries may need to be emitted for branches that target the core module, modify the logic that disregards defined symbols to only disregard symbols that are defined in the same section. Fixes: 35fa91eed817 ("ARM: kernel: merge core and init PLTs") Reported-by: Angus Clark <angus@angusclark.org> Tested-by: Angus Clark <angus@angusclark.org> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25KVM: arm: plug potential guest hardware debug leakageZhichao Huang
commit 661e6b02b5aa82db31897f36e96324b77450fd7a upstream. Hardware debugging in guests is not intercepted currently, it means that a malicious guest can bring down the entire machine by writing to the debug registers. This patch enable trapping of all debug registers, preventing the guests to access the debug registers. This includes access to the debug mode(DBGDSCR) in the guest world all the time which could otherwise mess with the host state. Reads return 0 and writes are ignored (RAZ_WI). The result is the guest cannot detect any working hardware based debug support. As debug exceptions are still routed to the guest normal debug using software based breakpoints still works. To support debugging using hardware registers we need to implement a debug register aware world switch as well as special trapping for registers that may affect the host state. Signed-off-by: Zhichao Huang <zhichao.huang@linaro.org> Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Christoffer Dall <cdall@linaro.org> Signed-off-by: Christoffer Dall <cdall@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25arm: KVM: Do not use stack-protector to compile HYP codeMarc Zyngier
commit 501ad27c67ed0b90df465f23d33e9aed64058a47 upstream. We like living dangerously. Nothing explicitely forbids stack-protector to be used in the HYP code, while distributions routinely compile their kernel with it. We're just lucky that no code actually triggers the instrumentation. Let's not try our luck for much longer, and disable stack-protector for code living at HYP. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <cdall@linaro.org> Signed-off-by: Christoffer Dall <cdall@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25arm64: KVM: Do not use stack-protector to compile EL2 codeMarc Zyngier
commit cde13b5dad60471886a3bccb4f4134c647c4a9dc upstream. We like living dangerously. Nothing explicitely forbids stack-protector to be used in the EL2 code, while distributions routinely compile their kernel with it. We're just lucky that no code actually triggers the instrumentation. Let's not try our luck for much longer, and disable stack-protector for code living at EL2. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <cdall@linaro.org> Signed-off-by: Christoffer Dall <cdall@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25powerpc/tm: Fix FP and VMX register corruptionMichael Neuling
commit f48e91e87e67b56bef63393d1a02c6e22c1d7078 upstream. In commit dc3106690b20 ("powerpc: tm: Always use fp_state and vr_state to store live registers"), a section of code was removed that copied the current state to checkpointed state. That code should not have been removed. When an FP (Floating Point) unavailable is taken inside a transaction, we need to abort the transaction. This is because at the time of the tbegin, the FP state is bogus so the state stored in the checkpointed registers is incorrect. To fix this, we treclaim (to get the checkpointed GPRs) and then copy the thread_struct FP live state into the checkpointed state. We then trecheckpoint so that the FP state is correctly restored into the CPU. The copying of the FP registers from live to checkpointed is what was missing. This simplifies the logic slightly from the original patch. tm_reclaim_thread() will now always write the checkpointed FP state. Either the checkpointed FP state will be written as part of the actual treclaim (in tm.S), or it'll be a copy of the live state. Which one we use is based on MSR[FP] from userspace. Similarly for VMX. Fixes: dc3106690b20 ("powerpc: tm: Always use fp_state and vr_state to store live registers") Signed-off-by: Michael Neuling <mikey@neuling.org> Reviewed-by: cyrilbur@gmail.com Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25powerpc/mm: Fix crash in page table dump with huge pagesMichael Ellerman
commit bfb9956ab4d8242f4594b5f4bee534b935384fd9 upstream. The page table dump code doesn't know about huge pages, so currently it crashes (or walks random memory, usually leading to a crash), if it finds a huge page. On Book3S we only see huge pages in the Linux page tables when we're using the P9 Radix MMU. Teaching the code to properly handle huge pages is a bit more involved, so for now just prevent the crash. Fixes: 8eb07b187000 ("powerpc/mm: Dump linux pagetables") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25powerpc/64e: Fix hang when debugging programs with relocated kernelLiuHailong
commit fd615f69a18a9d4aa5ef02a1dc83f319f75da8e7 upstream. Debug interrupts can be taken during interrupt entry, since interrupt entry does not automatically turn them off. The kernel will check whether the faulting instruction is between [interrupt_base_book3e, __end_interrupts], and if so clear MSR[DE] and return. However, when the kernel is built with CONFIG_RELOCATABLE, it can't use LOAD_REG_IMMEDIATE(r14,interrupt_base_book3e) and LOAD_REG_IMMEDIATE(r15,__end_interrupts), as they ignore relocation. Thus, if the kernel is actually running at a different address than it was built at, the address comparison will fail, and the exception entry code will hang at kernel_dbg_exc. r2(toc) is also not usable here, as r2 still holds data from the interrupted context, so LOAD_REG_ADDR() doesn't work either. So we use the *name@got* to get the EV of two labels directly. Test programs test.c shows as follows: int main(int argc, char *argv[]) { if (access("/proc/sys/kernel/perf_event_paranoid", F_OK) == -1) printf("Kernel doesn't have perf_event support\n"); } Steps to reproduce the bug, for example: 1) ./gdb ./test 2) (gdb) b access 3) (gdb) r 4) (gdb) s Signed-off-by: Liu Hailong <liu.hailong6@zte.com.cn> Signed-off-by: Jiang Xuexin <jiang.xuexin@zte.com.cn> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Reviewed-by: Liu Song <liu.song11@zte.com.cn> Reviewed-by: Huang Jian <huang.jian@zte.com.cn> [scottwood: cleaned up commit message, and specified bad behavior as a hang rather than an oops to correspond to mainline kernel behavior] Fixes: 1cb6e0649248 ("powerpc/book3e: support CONFIG_RELOCATABLE") Signed-off-by: Scott Wood <oss@buserror.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25powerpc/powernv: Fix TCE kill on NVLink2Alistair Popple
commit 6b3d12a948d27977816a15eb48409a298902a548 upstream. Commit 616badd2fb49 ("powerpc/powernv: Use OPAL call for TCE kill on NVLink2") forced all TCE kills to go via the OPAL call for NVLink2. However the PHB3 implementation of TCE kill was still being called directly from some functions which in some circumstances caused a machine check. This patch adds an equivalent IODA2 version of the function which uses the correct invalidation method depending on PHB model and changes all external callers to use it instead. Fixes: 616badd2fb49 ("powerpc/powernv: Use OPAL call for TCE kill on NVLink2") Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25powerpc/iommu: Do not call PageTransHuge() on tail pagesAlexey Kardashevskiy
commit e889e96e98e8da97bd39e46b7253615eabe14397 upstream. The CMA pages migration code does not support compound pages at the moment so it performs few tests before proceeding to actual page migration. One of the tests - PageTransHuge() - has VM_BUG_ON_PAGE(PageTail()) as it is designed to be called on head pages only. Since we also test for PageCompound(), and it contains PageTail() and PageHead(), we can simplify the check by leaving just PageCompound() and therefore avoid possible VM_BUG_ON_PAGE. Fixes: 2e5bbb5461f1 ("KVM: PPC: Book3S HV: Migrate pinned pages out of CMA") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25powerpc/sysfs: Fix reference leak of cpu device_nodes present at bootTyrel Datwyler
commit e76ca27790a514590af782f83f6eae49e0ccf8c9 upstream. For CPUs present at boot each logical CPU acquires a reference to the associated device node of the core. This happens in register_cpu() which is called by topology_init(). The result of this is that we end up with a reference held by each thread of the core. However, these references are never freed if the CPU core is DLPAR removed. This patch fixes the reference leaks by acquiring and releasing the references in the CPU hotplug callbacks un/register_cpu_online(). With this patch symmetric reference counting is observed with both CPUs present at boot, and those DLPAR added after boot. Fixes: f86e4718f24b ("driver/core: cpu: initialize of_node in cpu's device struture") Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25powerpc/pseries: Fix of_node_put() underflow during DLPAR removeTyrel Datwyler
commit 68baf692c435339e6295cb470ea5545cbc28160e upstream. Historically struct device_node references were tracked using a kref embedded as a struct field. Commit 75b57ecf9d1d ("of: Make device nodes kobjects so they show up in sysfs") (Mar 2014) refactored device_nodes to be kobjects such that the device tree could by more simply exposed to userspace using sysfs. Commit 0829f6d1f69e ("of: device_node kobject lifecycle fixes") (Mar 2014) followed up these changes to better control the kobject lifecycle and in particular the referecne counting via of_node_get(), of_node_put(), and of_node_init(). A result of this second commit was that it introduced an of_node_put() call when a dynamic node is detached, in of_node_remove(), that removes the initial kobj reference created by of_node_init(). Traditionally as the original dynamic device node user the pseries code had assumed responsibilty for releasing this final reference in its platform specific DLPAR detach code. This patch fixes a refcount underflow introduced by commit 0829f6d1f6, and recently exposed by the upstreaming of the recount API. Messages like the following are no longer seen in the kernel log with this patch following DLPAR remove operations of cpus and pci devices. rpadlpar_io: slot PHB 72 removed refcount_t: underflow; use-after-free. ------------[ cut here ]------------ WARNING: CPU: 5 PID: 3335 at lib/refcount.c:128 refcount_sub_and_test+0xf4/0x110 Fixes: 0829f6d1f69e ("of: device_node kobject lifecycle fixes") Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> [mpe: Make change log commit references more verbose] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25powerpc/book3s/mce: Move add_taint() later in virtual modeMahesh Salgaonkar
commit d93b0ac01a9ce276ec39644be47001873d3d183c upstream. machine_check_early() gets called in real mode. The very first time when add_taint() is called, it prints a warning which ends up calling opal call (that uses OPAL_CALL wrapper) for writing it to console. If we get a very first machine check while we are in opal we are doomed. OPAL_CALL overwrites the PACASAVEDMSR in r13 and in this case when we are done with MCE handling the original opal call will use this new MSR on it's way back to opal_return. This usually leads to unexpected behaviour or the kernel to panic. Instead move the add_taint() call later in the virtual mode where it is safe to call. This is broken with current FW level. We got lucky so far for not getting very first MCE hit while in OPAL. But easily reproducible on Mambo. Fixes: 27ea2c420cad ("powerpc: Set the correct kernel taint on machine check errors.") Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25powerpc/eeh: Avoid use after free in eeh_handle_special_event()Russell Currey
commit daeba2956f32f91f3493788ff6ee02fb1b2f02fa upstream. eeh_handle_special_event() is called when an EEH event is detected but can't be narrowed down to a specific PE. This function looks through every PE to find one in an erroneous state, then calls the regular event handler eeh_handle_normal_event() once it knows which PE has an error. However, if eeh_handle_normal_event() found that the PE cannot possibly be recovered, it will free it, rendering the passed PE stale. This leads to a use after free in eeh_handle_special_event() as it attempts to clear the "recovering" state on the PE after eeh_handle_normal_event() returns. Thus, make sure the PE is valid when attempting to clear state in eeh_handle_special_event(). Fixes: 8a6b1bc70dbb ("powerpc/eeh: EEH core to handle special event") Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Russell Currey <ruscur@russell.cc> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25powerpc/mm: Ensure IRQs are off in switch_mm()David Gibson
commit 9765ad134a00a01cbcc69c78ff6defbfad209bc5 upstream. powerpc expects IRQs to already be (soft) disabled when switch_mm() is called, as made clear in the commit message of 9c1e105238c4 ("powerpc: Allow perf_counters to access user memory at interrupt time"). Aside from any race conditions that might exist between switch_mm() and an IRQ, there is also an unconditional hard_irq_disable() in switch_slb(). If that isn't followed at some point by an IRQ enable then interrupts will remain disabled until we return to userspace. It is true that when switch_mm() is called from the scheduler IRQs are off, but not when it's called by use_mm(). Looking closer we see that last year in commit f98db6013c55 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler") this was made more explicit by the addition of switch_mm_irqs_off() which is now called by the scheduler, vs switch_mm() which is used by use_mm(). Arguably it is a bug in use_mm() to call switch_mm() in a different context than it expects, but fixing that will take time. This was discovered recently when vhost started throwing warnings such as: BUG: sleeping function called from invalid context at kernel/mutex.c:578 in_atomic(): 0, irqs_disabled(): 1, pid: 10768, name: vhost-10760 no locks held by vhost-10760/10768. irq event stamp: 10 hardirqs last enabled at (9): _raw_spin_unlock_irq+0x40/0x80 hardirqs last disabled at (10): switch_slb+0x2e4/0x490 softirqs last enabled at (0): copy_process+0x5e8/0x1260 softirqs last disabled at (0): (null) Call Trace: show_stack+0x88/0x390 (unreliable) dump_stack+0x30/0x44 __might_sleep+0x1c4/0x2d0 mutex_lock_nested+0x74/0x5c0 cgroup_attach_task_all+0x5c/0x180 vhost_attach_cgroups_work+0x58/0x80 [vhost] vhost_worker+0x24c/0x3d0 [vhost] kthread+0xec/0x100 ret_from_kernel_thread+0x5c/0xd4 Prior to commit 04b96e5528ca ("vhost: lockless enqueuing") (Aug 2016) the vhost_worker() would do a spin_unlock_irq() not long after calling use_mm(), which had the effect of reenabling IRQs. Since that commit removed the locking in vhost_worker() the body of the vhost_worker() loop now runs with interrupts off causing the warnings. This patch addresses the problem by making the powerpc code mirror the x86 code, ie. we disable interrupts in switch_mm(), and optimise the scheduler case by defining switch_mm_irqs_off(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [mpe: Flesh out/rewrite change log, add stable] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25s390/cputime: fix incorrect system timeMartin Schwidefsky
commit 07a63cbe8bcb6ba72fb989dcab1ec55ec6c36c7e upstream. git commit c5328901aa1db134 "[S390] entry[64].S improvements" removed the update of the exit_timer lowcore field from the critical section cleanup of the .Lsysc_restore/.Lsysc_done and .Lio_restore/.Lio_done blocks. If the PSW is updated by the critical section cleanup to point to user space again, the interrupt entry code will do a vtime calculation after the cleanup completed with an exit_timer value which has *not* been updated. Due to this incorrect system time deltas are calculated. If an interrupt occured with an old PSW between .Lsysc_restore/.Lsysc_done or .Lio_restore/.Lio_done update __LC_EXIT_TIMER with the system entry time of the interrupt. Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25s390/kdump: Add final noteMichael Holzheu
commit dcc00b79fc3d076832f7240de8870f492629b171 upstream. Since linux v3.14 with commit 38dfac843cb6d7be1 ("vmcore: prevent PT_NOTE p_memsz overflow during header update") on s390 we get the following message in the kdump kernel: Warning: Exceeded p_memsz, dropping PT_NOTE entry n_namesz=0x6b6b6b6b, n_descsz=0x6b6b6b6b The reason for this is that we don't create a final zero note in the ELF header which the proc/vmcore code uses to find out the end of the notes section (see also kernel/kexec_core.c:final_note()). It still worked on s390 by chance because we (most of the time?) have the byte pattern 0x6b6b6b6b after the notes section which also makes the notes parsing code stop in update_note_header_size_elf64() because 0x6b6b6b6b is interpreded as note size: if ((real_sz + sz) > max_sz) { pr_warn("Warning: Exceeded p_memsz, dropping P ...); break; } So fix this and add the missing final note to the ELF header. We don't have to adjust the memory size for ELF header ("alloc_size") because the new ELF note still fits into the 0x1000 base memory. Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25x86: fix 32-bit case of __get_user_asm_u64()Linus Torvalds
commit 33c9e9729033387ef0521324c62e7eba529294af upstream. The code to fetch a 64-bit value from user space was entirely buggered, and has been since the code was merged in early 2016 in commit b2f680380ddf ("x86/mm/32: Add support for 64-bit __get_user() on 32-bit kernels"). Happily the buggered routine is almost certainly entirely unused, since the normal way to access user space memory is just with the non-inlined "get_user()", and the inlined version didn't even historically exist. The normal "get_user()" case is handled by external hand-written asm in arch/x86/lib/getuser.S that doesn't have either of these issues. There were two independent bugs in __get_user_asm_u64(): - it still did the STAC/CLAC user space access marking, even though that is now done by the wrapper macros, see commit 11f1a4b9755f ("x86: reorganize SMAP handling in user space accesses"). This didn't result in a semantic error, it just means that the inlined optimized version was hugely less efficient than the allegedly slower standard version, since the CLAC/STAC overhead is quite high on modern Intel CPU's. - the double register %eax/%edx was marked as an output, but the %eax part of it was touched early in the asm, and could thus clobber other inputs to the asm that gcc didn't expect it to touch. In particular, that meant that the generated code could look like this: mov (%eax),%eax mov 0x4(%eax),%edx where the load of %edx obviously was _supposed_ to be from the 32-bit word that followed the source of %eax, but because %eax was overwritten by the first instruction, the source of %edx was basically random garbage. The fixes are trivial: remove the extraneous STAC/CLAC entries, and mark the 64-bit output as early-clobber to let gcc know that no inputs should alias with the output register. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25KVM: X86: Fix read out-of-bounds vulnerability in kvm pio emulationWanpeng Li
commit cbfc6c9184ce71b52df4b1d82af5afc81a709178 upstream. Huawei folks reported a read out-of-bounds vulnerability in kvm pio emulation. - "inb" instruction to access PIT Mod/Command register (ioport 0x43, write only, a read should be ignored) in guest can get a random number. - "rep insb" instruction to access PIT register port 0x43 can control memcpy() in emulator_pio_in_emulated() to copy max 0x400 bytes but only read 1 bytes, which will disclose the unimportant kernel memory in host but no crash. The similar test program below can reproduce the read out-of-bounds vulnerability: void hexdump(void *mem, unsigned int len) { unsigned int i, j; for(i = 0; i < len + ((len % HEXDUMP_COLS) ? (HEXDUMP_COLS - len % HEXDUMP_COLS) : 0); i++) { /* print offset */ if(i % HEXDUMP_COLS == 0) { printf("0x%06x: ", i); } /* print hex data */ if(i < len) { printf("%02x ", 0xFF & ((char*)mem)[i]); } else /* end of block, just aligning for ASCII dump */ { printf(" "); } /* print ASCII dump */ if(i % HEXDUMP_COLS == (HEXDUMP_COLS - 1)) { for(j = i - (HEXDUMP_COLS - 1); j <= i; j++) { if(j >= len) /* end of block, not really printing */ { putchar(' '); } else if(isprint(((char*)mem)[j])) /* printable char */ { putchar(0xFF & ((char*)mem)[j]); } else /* other char */ { putchar('.'); } } putchar('\n'); } } } int main(void) { int i; if (iopl(3)) { err(1, "set iopl unsuccessfully\n"); return -1; } static char buf[0x40]; /* test ioport 0x40,0x41,0x42,0x43,0x44,0x45 */ memset(buf, 0xab, sizeof(buf)); asm volatile("push %rdi;"); asm volatile("mov %0, %%rdi;"::"q"(buf)); asm volatile ("mov $0x40, %rdx;"); asm volatile ("in %dx,%al;"); asm volatile ("stosb;"); asm volatile ("mov $0x41, %rdx;"); asm volatile ("in %dx,%al;"); asm volatile ("stosb;"); asm volatile ("mov $0x42, %rdx;"); asm volatile ("in %dx,%al;"); asm volatile ("stosb;"); asm volatile ("mov $0x43, %rdx;"); asm volatile ("in %dx,%al;"); asm volatile ("stosb;"); asm volatile ("mov $0x44, %rdx;"); asm volatile ("in %dx,%al;"); asm volatile ("stosb;"); asm volatile ("mov $0x45, %rdx;"); asm volatile ("in %dx,%al;"); asm volatile ("stosb;"); asm volatile ("pop %rdi;"); hexdump(buf, 0x40); printf("\n"); /* ins port 0x40 */ memset(buf, 0xab, sizeof(buf)); asm volatile("push %rdi;"); asm volatile("mov %0, %%rdi;"::"q"(buf)); asm volatile ("mov $0x20, %rcx;"); asm volatile ("mov $0x40, %rdx;"); asm volatile ("rep insb;"); asm volatile ("pop %rdi;"); hexdump(buf, 0x40); printf("\n"); /* ins port 0x43 */ memset(buf, 0xab, sizeof(buf)); asm volatile("push %rdi;"); asm volatile("mov %0, %%rdi;"::"q"(buf)); asm volatile ("mov $0x20, %rcx;"); asm volatile ("mov $0x43, %rdx;"); asm volatile ("rep insb;"); asm volatile ("pop %rdi;"); hexdump(buf, 0x40); printf("\n"); return 0; } The vcpu->arch.pio_data buffer is used by both in/out instrutions emulation w/o clear after using which results in some random datas are left over in the buffer. Guest reads port 0x43 will be ignored since it is write only, however, the function kernel_pio() can't distigush this ignore from successfully reads data from device's ioport. There is no new data fill the buffer from port 0x43, however, emulator_pio_in_emulated() will copy the stale data in the buffer to the guest unconditionally. This patch fixes it by clearing the buffer before in instruction emulation to avoid to grant guest the stale data in the buffer. In addition, string I/O is not supported for in kernel device. So there is no iteration to read ioport %RCX times for string I/O. The function kernel_pio() just reads one round, and then copy the io size * %RCX to the guest unconditionally, actually it copies the one round ioport data w/ other random datas which are left over in the vcpu->arch.pio_data buffer to the guest. This patch fixes it by introducing the string I/O support for in kernel device in order to grant the right ioport datas to the guest. Before the patch: 0x000000: fe 38 93 93 ff ff ab ab .8...... 0x000008: ab ab ab ab ab ab ab ab ........ 0x000010: ab ab ab ab ab ab ab ab ........ 0x000018: ab ab ab ab ab ab ab ab ........ 0x000020: ab ab ab ab ab ab ab ab ........ 0x000028: ab ab ab ab ab ab ab ab ........ 0x000030: ab ab ab ab ab ab ab ab ........ 0x000038: ab ab ab ab ab ab ab ab ........ 0x000000: f6 00 00 00 00 00 00 00 ........ 0x000008: 00 00 00 00 00 00 00 00 ........ 0x000010: 00 00 00 00 4d 51 30 30 ....MQ00 0x000018: 30 30 20 33 20 20 20 20 00 3 0x000020: ab ab ab ab ab ab ab ab ........ 0x000028: ab ab ab ab ab ab ab ab ........ 0x000030: ab ab ab ab ab ab ab ab ........ 0x000038: ab ab ab ab ab ab ab ab ........ 0x000000: f6 00 00 00 00 00 00 00 ........ 0x000008: 00 00 00 00 00 00 00 00 ........ 0x000010: 00 00 00 00 4d 51 30 30 ....MQ00 0x000018: 30 30 20 33 20 20 20 20 00 3 0x000020: ab ab ab ab ab ab ab ab ........ 0x000028: ab ab ab ab ab ab ab ab ........ 0x000030: ab ab ab ab ab ab ab ab ........ 0x000038: ab ab ab ab ab ab ab ab ........ After the patch: 0x000000: 1e 02 f8 00 ff ff ab ab ........ 0x000008: ab ab ab ab ab ab ab ab ........ 0x000010: ab ab ab ab ab ab ab ab ........ 0x000018: ab ab ab ab ab ab ab ab ........ 0x000020: ab ab ab ab ab ab ab ab ........ 0x000028: ab ab ab ab ab ab ab ab ........ 0x000030: ab ab ab ab ab ab ab ab ........ 0x000038: ab ab ab ab ab ab ab ab ........ 0x000000: d2 e2 d2 df d2 db d2 d7 ........ 0x000008: d2 d3 d2 cf d2 cb d2 c7 ........ 0x000010: d2 c4 d2 c0 d2 bc d2 b8 ........ 0x000018: d2 b4 d2 b0 d2 ac d2 a8 ........ 0x000020: ab ab ab ab ab ab ab ab ........ 0x000028: ab ab ab ab ab ab ab ab ........ 0x000030: ab ab ab ab ab ab ab ab ........ 0x000038: ab ab ab ab ab ab ab ab ........ 0x000000: 00 00 00 00 00 00 00 00 ........ 0x000008: 00 00 00 00 00 00 00 00 ........ 0x000010: 00 00 00 00 00 00 00 00 ........ 0x000018: 00 00 00 00 00 00 00 00 ........ 0x000020: ab ab ab ab ab ab ab ab ........ 0x000028: ab ab ab ab ab ab ab ab ........ 0x000030: ab ab ab ab ab ab ab ab ........ 0x000038: ab ab ab ab ab ab ab ab ........ Reported-by: Moguofang <moguofang@huawei.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Moguofang <moguofang@huawei.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25KVM: x86: Fix potential preemption when get the current kvmclock timestampWanpeng Li
commit e2c2206a18993bc9f62393d49c7b2066c3845b25 upstream. BUG: using __this_cpu_read() in preemptible [00000000] code: qemu-system-x86/2809 caller is __this_cpu_preempt_check+0x13/0x20 CPU: 2 PID: 2809 Comm: qemu-system-x86 Not tainted 4.11.0+ #13 Call Trace: dump_stack+0x99/0xce check_preemption_disabled+0xf5/0x100 __this_cpu_preempt_check+0x13/0x20 get_kvmclock_ns+0x6f/0x110 [kvm] get_time_ref_counter+0x5d/0x80 [kvm] kvm_hv_process_stimers+0x2a1/0x8a0 [kvm] ? kvm_hv_process_stimers+0x2a1/0x8a0 [kvm] ? kvm_arch_vcpu_ioctl_run+0xac9/0x1ce0 [kvm] kvm_arch_vcpu_ioctl_run+0x5bf/0x1ce0 [kvm] kvm_vcpu_ioctl+0x384/0x7b0 [kvm] ? kvm_vcpu_ioctl+0x384/0x7b0 [kvm] ? __fget+0xf3/0x210 do_vfs_ioctl+0xa4/0x700 ? __fget+0x114/0x210 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc2 RIP: 0033:0x7f9d164ed357 ? __this_cpu_preempt_check+0x13/0x20 This can be reproduced by run kvm-unit-tests/hyperv_stimer.flat w/ CONFIG_PREEMPT and CONFIG_DEBUG_PREEMPT enabled. Safe access to per-CPU data requires a couple of constraints, though: the thread working with the data cannot be preempted and it cannot be migrated while it manipulates per-CPU variables. If the thread is preempted, the thread that replaces it could try to work with the same variables; migration to another CPU could also cause confusion. However there is no preemption disable when reads host per-CPU tsc rate to calculate the current kvmclock timestamp. This patch fixes it by utilizing get_cpu/put_cpu pair to guarantee both __this_cpu_read() and rdtsc() are not preempted. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>