summaryrefslogtreecommitdiff
path: root/arch/powerpc/mm
AgeCommit message (Collapse)Author
2020-05-26powerpc/8xx: Drop special handling of Linear and IMMR mappings in I/D TLB ↵Christophe Leroy
handlers Up to now, linear and IMMR mappings are managed via huge TLB entries through specific code directly in TLB miss handlers. This implies some patching of the TLB miss handlers at startup, and a lot of dedicated code. Remove all this specific dedicated code. For now we are back to normal handling via standard 4k pages. In the next patches, linear memory mapping and IMMR mapping will be managed through huge pages. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/221b7e3ead80a5969629938c023f8cfe45fdd2fb.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26powerpc/8xx: Always pin TLBs at startup.Christophe Leroy
At startup, map 32 Mbytes of memory through 4 pages of 8M, and PIN them inconditionnaly. They need to be pinned because KASAN is using page tables early and the TLBs might be dynamically replaced otherwise. Remove RSV4I flag after installing mappings unless CONFIG_PIN_TLB_XXXX is selected. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/b27c5767d18053b59f7eefddc189fcc3acf7b9c2.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26powerpc/8xx: Don't set IMMR map anymore at bootChristophe Leroy
Only early debug requires IMMR to be mapped early. No need to set it up and pin it in assembly. Map it through page tables at udbg init when necessary. If CONFIG_PIN_TLB_IMMR is selected, pin it once we don't need the 32 Mb pinned RAM anymore. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/13c1e8539fdf363d3146f4884e5c3c76c6c308b5.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26powerpc/8xx: Only 8M pages are hugepte pages nowChristophe Leroy
512k pages are now standard pages, so only 8M pages are hugepte. No more handling of normal page tables through hugepd allocation and freeing, and hugepte helpers can also be simplified. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/2c6135d57fb76eebf70673fbac3dc9e740767879.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26powerpc/8xx: Manage 512k huge pages as standard pages.Christophe Leroy
At the time being, 512k huge pages are handled through hugepd page tables. The PMD entry is flagged as a hugepd pointer and it means that only 512k hugepages can be managed in that 4M block. However, the hugepd table has the same size as a normal page table, and 512k entries can therefore be nested with normal pages. On the 8xx, TLB loading is performed by software and allthough the page tables are organised to match the L1 and L2 level defined by the HW, all TLB entries have both L1 and L2 independent entries. It means that even if two TLB entries are associated with the same PMD entry, they can be loaded with different values in L1 part. The L1 entry contains the page size (PS field): - 00 for 4k and 16 pages - 01 for 512k pages - 11 for 8M pages By adding a flag for hugepages in the PTE (_PAGE_HUGE) and copying it into the lower bit of PS, we can then manage 512k pages with normal page tables: - PMD entry has PS=11 for 8M pages - PMD entry has PS=00 for other pages. As a PMD entry covers 4M areas, a PMD will either point to a hugepd table having a single entry to an 8M page, or the PMD will point to a standard page table which will have either entries to 4k or 16k or 512k pages. For 512k pages, as the L1 entry will not know it is a 512k page before the PTE is read, there will be 128 entries in the PTE as if it was 4k pages. But when loading the TLB, it will be flagged as a 512k page. Note that we can't use pmd_ptr() in asm/nohash/32/pgtable.h because it is not defined yet. In ITLB miss, we keep the possibility to opt it out as when kernel text is pinned and no user hugepages are used, we can save several instruction by not using r11. In DTLB miss, that's just one instruction so it's not worth bothering with it. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/002819e8e166bf81d24b24782d98de7c40905d8f.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26powerpc/mm: Reduce hugepd size for 8M hugepages on 8xxChristophe Leroy
Commit 55c8fc3f4930 ("powerpc/8xx: reintroduce 16K pages with HW assistance") redefined pte_t as a struct of 4 pte_basic_t, because in 16K pages mode there are four identical entries in the page table. But hugepd entries for 8M pages require only one entry of size pte_basic_t. So there is no point in creating a cache for 4 entries page tables. Calculate PTE_T_ORDER using the size of pte_basic_t instead of pte_t. Define specific huge_pte helpers (set_huge_pte_at(), huge_pte_clear(), huge_ptep_set_wrprotect()) to write the pte in a single entry instead of using set_pte_at() which writes 4 identical entries in 16k pages mode. Also make sure that __ptep_set_access_flags() properly handle the huge_pte case. Define set_pte_filter() inline otherwise GCC doesn't inline it anymore because it is now used twice, and that gives a pretty suboptimal code because of pte_t being a struct of 4 entries. Those functions are also used for 512k pages which only require one entry as well allthough replicating it four times was harmless as 512k pages entries are spread every 128 bytes in the table. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/43050d1a0c2d6e1541cab9c1126fc80bc7015ebd.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26powerpc/mm: Fix conditions to perform MMU specific management by blocks on ↵Christophe Leroy
PPC32. Setting init mem to NX shall depend on sinittext being mapped by block, not on stext being mapped by block. Setting text and rodata to RO shall depend on stext being mapped by block, not on sinittext being mapped by block. Fixes: 63b2bc619565 ("powerpc/mm/32s: Use BATs for STRICT_KERNEL_RWX") Cc: stable@vger.kernel.org Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/7d565fb8f51b18a3d98445a830b2f6548cb2da2a.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26powerpc/mm: Allocate static page tables for fixmapChristophe Leroy
Allocate static page tables for the fixmap area. This allows setting mappings through page tables before memblock is ready. That's needed to use early_ioremap() early and to use standard page mappings with fixmap. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/4f4b1412d34de6801b8e925cb88fc69d056ff536.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26powerpc/32s: Don't warn when mapping RO data ROX.Christophe Leroy
Mapping RO data as ROX is not an issue since that data cannot be modified to introduce an exploit. PPC64 accepts to have RO data mapped ROX, as a trade off between kernel size and strictness of protection. On PPC32, kernel size is even more critical as amount of memory is usually small. Depending on the number of available IBATs, the last IBATs might overflow the end of text. Only warn if it crosses the end of RO data. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/6499f8eeb2a36330e5c9fc1cee9a79374875bd54.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26powerpc/ptdump: Handle hugepd at PGD levelChristophe Leroy
The 8xx is about to map kernel linear space and IMMR using huge pages. In order to display those pages properly, ptdump needs to handle hugepd tables at PGD level. For the time being do it only at PGD level. Further patches may add handling of hugepd tables at lower level for other platforms when needed in the future. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/630728289158dcfeb06b14d40ed7c4c4e7148cf1.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26powerpc/ptdump: Properly handle non standard page sizeChristophe Leroy
In order to properly display information regardless of the page size, it is necessary to take into account real page size. Fixes: cabe8138b23c ("powerpc: dump as a single line areas mapping a single physical page.") Cc: stable@vger.kernel.org Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/a53b2a0ffd042a8d85464bf90d55bc5b970e00a1.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26powerpc/ptdump: Standardise display of BAT flagsChristophe Leroy
Display BAT flags the same way as page flags: rwx and wimg Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/a07585f353c167b8db9597d83f992a5cb4fbf4c4.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26powerpc/ptdump: Display size of BATsChristophe Leroy
Display the size of areas mapped with BATs. For that, the size display for pages is refactorised. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/acf764eee231f0358e66ca9e819f052804055acc.1589866984.git.christophe.leroy@csgroup.eu
2020-05-26powerpc/ptdump: Add _PAGE_COHERENT flagChristophe Leroy
For platforms using shared.c (4xx, Book3e, Book3s/32), also handle the _PAGE_COHERENT flag which corresponds to the M bit of the WIMG flags. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> [mpe: Make it more verbose, use "coherent" rather than "m"] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/324c3d860717e8e91fca3bb6c0f8b23e1644a404.1589866984.git.christophe.leroy@csgroup.eu
2020-05-20powerpc/kasan: Declare kasan_init_region() weakChristophe Leroy
In order to alloc sub-arches to alloc KASAN regions using optimised methods (Huge pages on 8xx, BATs on BOOK3S, ...), declare kasan_init_region() weak. Also make kasan_init_shadow_page_tables() accessible from outside, so that it can be called from the specific kasan_init_region() functions if needed. And populate remaining KASAN address space only once performed the region mapping, to allow 8xx to allocate hugepd instead of standard page tables for mapping via 8M hugepages. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/3c1ce419fa1b5a4171b92d7fb16455ca17e1b96d.1589866984.git.christophe.leroy@csgroup.eu
2020-05-20powerpc/kasan: Refactor update of early shadow mappingsChristophe Leroy
kasan_remap_early_shadow_ro() and kasan_unmap_early_shadow_vmalloc() are both updating the early shadow mapping: the first one sets the mapping read-only while the other clears the mapping. Refactor and create kasan_update_early_region() Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/8c496c0828de2608c7c940c45525d177e91b6f1b.1589866984.git.christophe.leroy@csgroup.eu
2020-05-20powerpc/kasan: Remove unnecessary page table lockingChristophe Leroy
Commit 45ff3c559585 ("powerpc/kasan: Fix parallel loading of modules.") added spinlocks to manage parallele module loading. Since then commit 47febbeeec44 ("powerpc/32: Force KASAN_VMALLOC for modules") converted the module loading to KASAN_VMALLOC. The spinlocking has then become unneeded and can be removed to simplify kasan_init_shadow_page_tables() Also remove inclusion of linux/moduleloader.h and linux/vmalloc.h which are not needed anymore since the removal of modules management. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/81a4d3aee8b82bc1355595935c8f4ad9d3b22a83.1589866984.git.christophe.leroy@csgroup.eu
2020-05-20powerpc/kasan: Fix shadow pages allocation failureChristophe Leroy
Doing kasan pages allocation in MMU_init is too early, kernel doesn't have access yet to the entire memory space and memblock_alloc() fails when the kernel is a bit big. Do it from kasan_init() instead. Fixes: 2edb16efc899 ("powerpc/32: Add KASAN support") Cc: stable@vger.kernel.org Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/c24163ee5d5f8cdf52fefa45055ceb35435b8f15.1589866984.git.christophe.leroy@csgroup.eu
2020-05-20powerpc/kasan: Fix error detection on memory allocationChristophe Leroy
In case (k_start & PAGE_MASK) doesn't equal (kstart), 'va' will never be NULL allthough 'block' is NULL Check the return of memblock_alloc() directly instead of the resulting address in the loop. Fixes: 509cd3f2b473 ("powerpc/32: Simplify KASAN init") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/7cb8ca82042bfc45a5cfe726c921cd7e7eeb12a3.1589866984.git.christophe.leroy@csgroup.eu
2020-05-20powerpc/64s/hash: Add stress_slb kernel boot option to increase SLB faultsNicholas Piggin
This option increases the number of SLB misses by limiting the number of kernel SLB entries, and increased flushing of cached lookaside information. This helps stress test difficult to hit paths in the kernel. Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Relocate the code into arch/powerpc/mm, s/torture/stress/] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200511125825.3081305-1-mpe@ellerman.id.au
2020-05-20powerpc/book3s64/radix/tlb: Determine hugepage flush correctlyAneesh Kumar K.V
With a 64K page size flush with start and end: (start, end) = (721f680d0000, 721f680e0000) results in: (hstart, hend) = (721f68200000, 721f68000000) ie. hstart is above hend, which indicates no huge page flush is needed. However the current logic incorrectly sets hflush = true in this case, because hstart != hend. That causes us to call __tlbie_va_range() passing hstart/hend, to do a huge page flush even though we don't need to. __tlbie_va_range() will skip the actual tlbie operation for start > end. But it will still end up calling fixup_tlbie_va_range() and doing the TLB fixups in there, which is harmless but unnecessary work. Reported-by: Bharata B Rao <bharata@linux.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Drop else case, hflush is already false, flesh out change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200513030616.152288-1-aneesh.kumar@linux.ibm.com
2020-05-20Revert "powerpc/32s: reorder Linux PTE bits to better match Hash PTE bits."Christophe Leroy
This reverts commit 697ece78f8f749aeea40f2711389901f0974017a. The implementation of SWAP on powerpc requires page protection bits to not be one of the least significant PTE bits. Until the SWAP implementation is changed and this requirement voids, we have to keep at least _PAGE_RW outside of the 3 last bits. For now, revert to previous PTE bits order. A further rework may come later. Fixes: 697ece78f8f7 ("powerpc/32s: reorder Linux PTE bits to better match Hash PTE bits.") Reported-by: Rui Salvaterra <rsalvaterra@gmail.com> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/b34706f8de87f84d135abb5f3ede6b6f16fb1f41.1589969799.git.christophe.leroy@csgroup.eu
2020-05-19powerpc: Add a probe_user_read_inst() functionJordan Niethe
Introduce a probe_user_read_inst() function to use in cases where probe_user_read() is used for getting an instruction. This will be more useful for prefixed instructions. Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Reviewed-by: Alistair Popple <alistair@popple.id.au> [mpe: Don't write to *inst on error, fold in __user annotations] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200506034050.24806-14-jniethe5@gmail.com
2020-05-19powerpc: Use a datatype for instructionsJordan Niethe
Currently unsigned ints are used to represent instructions on powerpc. This has worked well as instructions have always been 4 byte words. However, ISA v3.1 introduces some changes to instructions that mean this scheme will no longer work as well. This change is Prefixed Instructions. A prefixed instruction is made up of a word prefix followed by a word suffix to make an 8 byte double word instruction. No matter the endianness of the system the prefix always comes first. Prefixed instructions are only planned for powerpc64. Introduce a ppc_inst type to represent both prefixed and word instructions on powerpc64 while keeping it possible to exclusively have word instructions on powerpc32. Signed-off-by: Jordan Niethe <jniethe5@gmail.com> [mpe: Fix compile error in emulate_spe()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200506034050.24806-12-jniethe5@gmail.com
2020-05-19powerpc: Use a function for getting the instruction op codeJordan Niethe
In preparation for using a data type for instructions that can not be directly used with the '>>' operator use a function for getting the op code of an instruction. Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Reviewed-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200506034050.24806-9-jniethe5@gmail.com
2020-05-19powerpc: Use an accessor for instructionsJordan Niethe
In preparation for introducing a more complicated instruction type to accommodate prefixed instructions use an accessor for getting an instruction as a u32. Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200506034050.24806-8-jniethe5@gmail.com
2020-05-19powerpc: Use a macro for creating instructions from u32sJordan Niethe
In preparation for instructions having a more complex data type start using a macro, ppc_inst(), for making an instruction out of a u32. A macro is used so that instructions can be used as initializer elements. Currently this does nothing, but it will allow for creating a data type that can represent prefixed instructions. Signed-off-by: Jordan Niethe <jniethe5@gmail.com> [mpe: Change include guard to _ASM_POWERPC_INST_H] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Alistair Popple <alistair@popple.id.au> Link: https://lore.kernel.org/r/20200506034050.24806-7-jniethe5@gmail.com
2020-05-15powerpc/mm: Replace zero-length array with flexible-arrayGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200507185755.GA15014@embeddedor
2020-05-11powerpc: Replace _ALIGN_UP() by ALIGN()Christophe Leroy
_ALIGN_UP() is specific to powerpc ALIGN() is generic and does the same Replace _ALIGN_UP() by ALIGN() Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Joel Stanley <joel@jms.id.au> Link: https://lore.kernel.org/r/8a6d7e45f7904c73a0af539642d3962e2a3c7268.1587407777.git.christophe.leroy@c-s.fr
2020-05-11powerpc: Replace _ALIGN_DOWN() by ALIGN_DOWN()Christophe Leroy
_ALIGN_DOWN() is specific to powerpc ALIGN_DOWN() is generic and does the same Replace _ALIGN_DOWN() by ALIGN_DOWN() Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Joel Stanley <joel@jms.id.au> Link: https://lore.kernel.org/r/3911a86d6b5bfa7ad88cd7c82416fbe6bb47e793.1587407777.git.christophe.leroy@c-s.fr
2020-05-05powerpc/mm/book3s64: Fix MADV_DONTNEED and parallel page fault raceAneesh Kumar K.V
MADV_DONTNEED holds mmap_sem in read mode and that implies a parallel page fault is possible and the kernel can end up with a level 1 PTE entry (THP entry) converted to a level 0 PTE entry without flushing the THP TLB entry. Most architectures including POWER have issues with kernel instantiating a level 0 PTE entry while holding level 1 TLB entries. The code sequence I am looking at is down_read(mmap_sem) down_read(mmap_sem) zap_pmd_range() zap_huge_pmd() pmd lock held pmd_cleared table details added to mmu_gather pmd_unlock() insert a level 0 PTE entry() tlb_finish_mmu(). Fix this by forcing a tlb flush before releasing pmd lock if this is not a fullmm invalidate. We can safely skip this invalidate for task exit case (fullmm invalidate) because in that case we are sure there can be no parallel fault handlers. This do change the Qemu guest RAM del/unplug time as below 128 core, 496GB guest: Without patch: munmap start: timer = 196449 ms, PID=6681 munmap finish: timer = 196488 ms, PID=6681 - delta = 39ms With patch: munmap start: timer = 196345 ms, PID=6879 munmap finish: timer = 196714 ms, PID=6879 - delta = 369ms Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-23-aneesh.kumar@linux.ibm.com
2020-05-05powerpc/mm/book3s64: Avoid sending IPI on clearing PMDAneesh Kumar K.V
Now that all the lockless page table walk is careful w.r.t the PTE address returned, we can now revert commit: 13bd817bb884 ("powerpc/thp: Serialize pmd clear against a linux page table walk.") We also drop the equivalent IPI from other pte updates routines. We still keep IPI in hash pmdp collapse and that is to take care of parallel hash page table insert. The radix pmdp collapse flush can possibly be removed once I am sure generic code doesn't have the any expectations around parallel gup walk. This speeds up Qemu guest RAM del/unplug time as below 128 core, 496GB guest: Without patch: munmap start: timer = 13162 ms, PID=7684 munmap finish: timer = 95312 ms, PID=7684 - delta = 82150 ms With patch: munmap start: timer = 196449 ms, PID=6681 munmap finish: timer = 196488 ms, PID=6681 - delta = 39ms Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-21-aneesh.kumar@linux.ibm.com
2020-05-05powerpc/book3s64/hash: Use the pte_t address from the callerAneesh Kumar K.V
Don't fetch the pte value using lockless page table walk. Instead use the value from the caller. hash_preload is called with ptl lock held. So it is safe to use the pte_t address directly. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-6-aneesh.kumar@linux.ibm.com
2020-05-05powerpc/hash64: Restrict page table lookup using init_mm with ↵Aneesh Kumar K.V
__flush_hash_table_range This is only used with init_mm currently. Walking init_mm is much simpler because we don't need to handle concurrent page table like other mm_context Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-5-aneesh.kumar@linux.ibm.com
2020-05-05powerpc/mm/hash64: use _PAGE_PTE when checking for pte_presentAneesh Kumar K.V
This makes the pte_present check stricter by checking for additional _PAGE_PTE bit. A level 1 pte pointer (THP pte) can be switched to a pointer to level 0 pte page table page by following two operations. 1) THP split. 2) madvise(MADV_DONTNEED) in parallel to page fault. A lockless page table walk need to make sure we can handle such changes gracefully. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-4-aneesh.kumar@linux.ibm.com
2020-05-05powerpc/pkeys: Check vma before returning key fault error to the userAneesh Kumar K.V
If multiple threads in userspace keep changing the protection keys mapping a range, there can be a scenario where kernel takes a key fault but the pkey value found in the siginfo struct is a permissive one. This can confuse the userspace as shown in the below test case. /* use this to control the number of test iterations */ static void pkeyreg_set(int pkey, unsigned long rights) { unsigned long reg, shift; shift = (NR_PKEYS - pkey - 1) * PKEY_BITS_PER_PKEY; asm volatile("mfspr %0, 0xd" : "=r"(reg)); reg &= ~(((unsigned long) PKEY_BITS_MASK) << shift); reg |= (rights & PKEY_BITS_MASK) << shift; asm volatile("mtspr 0xd, %0" : : "r"(reg)); } static unsigned long pkeyreg_get(void) { unsigned long reg; asm volatile("mfspr %0, 0xd" : "=r"(reg)); return reg; } static int sys_pkey_mprotect(void *addr, size_t len, int prot, int pkey) { return syscall(SYS_pkey_mprotect, addr, len, prot, pkey); } static int sys_pkey_alloc(unsigned long flags, unsigned long access_rights) { return syscall(SYS_pkey_alloc, flags, access_rights); } static int sys_pkey_free(int pkey) { return syscall(SYS_pkey_free, pkey); } static int faulting_pkey; static int permissive_pkey; static pthread_barrier_t pkey_set_barrier; static pthread_barrier_t mprotect_barrier; static void pkey_handle_fault(int signum, siginfo_t *sinfo, void *ctx) { unsigned long pkeyreg; /* FIXME: printf is not signal-safe but for the current purpose, it gets the job done. */ printf("pkey: exp = %d, got = %d\n", faulting_pkey, sinfo->si_pkey); fflush(stdout); assert(sinfo->si_code == SEGV_PKUERR); assert(sinfo->si_pkey == faulting_pkey); /* clear pkey permissions to let the faulting instruction continue */ pkeyreg_set(faulting_pkey, 0x0); } static void *do_mprotect_fault(void *p) { unsigned long rights, pkeyreg, pgsize; unsigned int i; void *region; int pkey; srand(time(NULL)); pgsize = sysconf(_SC_PAGESIZE); rights = PKEY_DISABLE_WRITE; region = p; /* allocate key, no permissions */ assert((pkey = sys_pkey_alloc(0, PKEY_DISABLE_ACCESS)) > 0); pkeyreg_set(4, 0x0); /* cache the pkey here as the faulting pkey for future reference in the signal handler */ faulting_pkey = pkey; printf("%s: faulting pkey = %d\n", __func__, faulting_pkey); /* try to allocate, mprotect and free pkeys repeatedly */ for (i = 0; i < NUM_ITERATIONS; i++) { /* sync up with the other thread here */ pthread_barrier_wait(&pkey_set_barrier); /* make sure that the pkey used by the non-faulting thread is made permissive for this thread's context too so that no faults are triggered because it still might have been set to a restrictive value */ // pkeyreg_set(permissive_pkey, 0x0); /* sync up with the other thread here */ pthread_barrier_wait(&mprotect_barrier); /* perform mprotect */ assert(!sys_pkey_mprotect(region, pgsize, PROT_READ | PROT_WRITE, pkey)); /* choose a random byte from the protected region and attempt to write to it, this will generate a fault */ *((char *) region + (rand() % pgsize)) = rand(); /* restore pkey permissions as the signal handler may have cleared the bit out for the sake of continuing */ pkeyreg_set(pkey, PKEY_DISABLE_WRITE); } /* free pkey */ sys_pkey_free(pkey); return NULL; } static void *do_mprotect_nofault(void *p) { unsigned long pgsize; unsigned int i, j; void *region; int pkey; pgsize = sysconf(_SC_PAGESIZE); region = p; /* try to allocate, mprotect and free pkeys repeatedly */ for (i = 0; i < NUM_ITERATIONS; i++) { /* allocate pkey, all permissions */ assert((pkey = sys_pkey_alloc(0, 0)) > 0); permissive_pkey = pkey; /* sync up with the other thread here */ pthread_barrier_wait(&pkey_set_barrier); pthread_barrier_wait(&mprotect_barrier); /* perform mprotect on the common page, no faults will be triggered as this is most permissive */ assert(!sys_pkey_mprotect(region, pgsize, PROT_READ | PROT_WRITE, pkey)); /* free pkey */ assert(!sys_pkey_free(pkey)); } return NULL; } int main(int argc, char **argv) { pthread_t fault_thread, nofault_thread; unsigned long pgsize; struct sigaction act; pthread_attr_t attr; cpu_set_t fault_cpuset, nofault_cpuset; unsigned int i; void *region; /* allocate memory region to protect */ pgsize = sysconf(_SC_PAGESIZE); assert(region = memalign(pgsize, pgsize)); CPU_ZERO(&fault_cpuset); CPU_SET(0, &fault_cpuset); CPU_ZERO(&nofault_cpuset); CPU_SET(8, &nofault_cpuset); assert(!pthread_attr_init(&attr)); /* setup sigsegv signal handler */ act.sa_handler = 0; act.sa_sigaction = pkey_handle_fault; assert(!sigprocmask(SIG_SETMASK, 0, &act.sa_mask)); act.sa_flags = SA_SIGINFO; act.sa_restorer = 0; assert(!sigaction(SIGSEGV, &act, NULL)); /* setup barrier for the two threads */ pthread_barrier_init(&pkey_set_barrier, NULL, 2); pthread_barrier_init(&mprotect_barrier, NULL, 2); /* setup and start threads */ assert(!pthread_create(&fault_thread, &attr, &do_mprotect_fault, region)); assert(!pthread_setaffinity_np(fault_thread, sizeof(cpu_set_t), &fault_cpuset)); assert(!pthread_create(&nofault_thread, &attr, &do_mprotect_nofault, region)); assert(!pthread_setaffinity_np(nofault_thread, sizeof(cpu_set_t), &nofault_cpuset)); /* cleanup */ assert(!pthread_attr_destroy(&attr)); assert(!pthread_join(fault_thread, NULL)); assert(!pthread_join(nofault_thread, NULL)); assert(!pthread_barrier_destroy(&pkey_set_barrier)); assert(!pthread_barrier_destroy(&mprotect_barrier)); free(region); puts("PASS"); return EXIT_SUCCESS; } The above test can result the below failure without this patch. pkey: exp = 3, got = 3 pkey: exp = 3, got = 4 a.out: pkey-siginfo-race.c:100: pkey_handle_fault: Assertion `sinfo->si_pkey == faulting_pkey' failed. Aborted Check for vma access before considering this a key fault. If vma pkey allow access retry the acess again. Test case is written by Sandipan Das <sandipan@linux.ibm.com> hence added SOB from him. Signed-off-by: Sandipan Das <sandipan@linux.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-3-aneesh.kumar@linux.ibm.com
2020-05-05powerpc/pkeys: Avoid using lockless page table walkAneesh Kumar K.V
Fetch pkey from vma instead of linux page table. Also document the fact that in some cases the pkey returned in siginfo won't be the same as the one we took keyfault on. Even with linux page table walk, we can end up in a similar scenario. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-2-aneesh.kumar@linux.ibm.com
2020-04-22powerpc/8xx: Fix STRICT_KERNEL_RWX startup test failureChristophe Leroy
WRITE_RO lkdtm test works. But when selecting CONFIG_DEBUG_RODATA_TEST, the kernel reports rodata_test: test data was not read only This is because when rodata test runs, there are still old entries in TLB. Flush TLB after setting kernel pages RO or NX. Fixes: d5f17ee96447 ("powerpc/8xx: don't disable large TLBs with CONFIG_STRICT_KERNEL_RWX") Cc: stable@vger.kernel.org # v5.1+ Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/485caac75f195f18c11eb077b0031fdd2bb7fb9e.1587361039.git.christophe.leroy@c-s.fr
2020-04-10mm/memory_hotplug: add pgprot_t to mhp_paramsLogan Gunthorpe
devm_memremap_pages() is currently used by the PCI P2PDMA code to create struct page mappings for IO memory. At present, these mappings are created with PAGE_KERNEL which implies setting the PAT bits to be WB. However, on x86, an mtrr register will typically override this and force the cache type to be UC-. In the case firmware doesn't set this register it is effectively WB and will typically result in a machine check exception when it's accessed. Other arches are not currently likely to function correctly seeing they don't have any MTRR registers to fall back on. To solve this, provide a way to specify the pgprot value explicitly to arch_add_memory(). Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a simple change to pass the pgprot_t down to their respective functions which set up the page tables. For x86_32, set the page tables explicitly using _set_memory_prot() (seeing they are already mapped). For ia64, s390 and sh, reject anything but PAGE_KERNEL settings -- this should be fine, for now, seeing these architectures don't support ZONE_DEVICE. A check in __add_pages() is also added to ensure the pgprot parameter was set for all arches. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Eric Badger <ebadger@gigaio.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/20200306170846.9333-7-logang@deltatee.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-10powerpc/mm: thread pgprot_t through create_section_mapping()Logan Gunthorpe
In prepartion to support a pgprot_t argument for arch_add_memory(). Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Eric Badger <ebadger@gigaio.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/20200306170846.9333-6-logang@deltatee.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-10mm/memory_hotplug: rename mhp_restrictions to mhp_paramsLogan Gunthorpe
The mhp_restrictions struct really doesn't specify anything resembling a restriction anymore so rename it to be mhp_params as it is a list of extended parameters. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Eric Badger <ebadger@gigaio.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/20200306170846.9333-3-logang@deltatee.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-10mm/vma: introduce VM_ACCESS_FLAGSAnshuman Khandual
There are many places where all basic VMA access flags (read, write, exec) are initialized or checked against as a group. One such example is during page fault. Existing vma_is_accessible() wrapper already creates the notion of VMA accessibility as a group access permissions. Hence lets just create VM_ACCESS_FLAGS (VM_READ|VM_WRITE|VM_EXEC) which will not only reduce code duplication but also extend the VMA accessibility concept in general. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Salter <msalter@redhat.com> Cc: Nick Hu <nickhu@andestech.com> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rob Springer <rspringer@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Link: http://lkml.kernel.org/r/1583391014-8170-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-08Merge tag 'libnvdimm-for-5.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm and dax updates from Dan Williams: "There were multiple touches outside of drivers/nvdimm/ this round to add cross arch compatibility to the devm_memremap_pages() interface, enhance numa information for persistent memory ranges, and add a zero_page_range() dax operation. This cycle I switched from the patchwork api to Konstantin's b4 script for collecting tags (from x86, PowerPC, filesystem, and device-mapper folks), and everything looks to have gone ok there. This has all appeared in -next with no reported issues. Summary: - Add support for region alignment configuration and enforcement to fix compatibility across architectures and PowerPC page size configurations. - Introduce 'zero_page_range' as a dax operation. This facilitates filesystem-dax operation without a block-device. - Introduce phys_to_target_node() to facilitate drivers that want to know resulting numa node if a given reserved address range was onlined. - Advertise a persistence-domain for of_pmem and papr_scm. The persistence domain indicates where cpu-store cycles need to reach in the platform-memory subsystem before the platform will consider them power-fail protected. - Promote numa_map_to_online_node() to a cross-kernel generic facility. - Save x86 numa information to allow for node-id lookups for reserved memory ranges, deploy that capability for the e820-pmem driver. - Pick up some miscellaneous minor fixes, that missed v5.6-final, including a some smatch reports in the ioctl path and some unit test compilation fixups. - Fixup some flexible-array declarations" * tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits) dax: Move mandatory ->zero_page_range() check in alloc_dax() dax,iomap: Add helper dax_iomap_zero() to zero a range dax: Use new dax zero page method for zeroing a page dm,dax: Add dax zero_page_range operation s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver dax, pmem: Add a dax operation zero_page_range pmem: Add functions for reading/writing page to/from pmem libnvdimm: Update persistence domain value for of_pmem and papr_scm device tools/test/nvdimm: Fix out of tree build libnvdimm/region: Fix build error libnvdimm/region: Replace zero-length array with flexible-array member libnvdimm/label: Replace zero-length array with flexible-array member ACPI: NFIT: Replace zero-length array with flexible-array member libnvdimm/region: Introduce an 'align' attribute libnvdimm/region: Introduce NDD_LABELING libnvdimm/namespace: Enforce memremap_compat_align() libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid libnvdimm: Out of bounds read in __nd_ioctl() acpi/nfit: improve bounds checking for 'func' mm/memremap_pages: Introduce memremap_compat_align() ...
2020-04-07mm/vma: make vma_is_accessible() available for general useAnshuman Khandual
Lets move vma_is_accessible() helper to include/linux/mm.h which makes it available for general use. While here, this replaces all remaining open encodings for VMA access check with vma_is_accessible(). Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Guo Ren <guoren@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Guo Ren <guoren@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Burton <paulburton@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Nick Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-05Merge tag 'powerpc-5.7-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Slightly late as I had to rebase mid-week to insert a bug fix: - A large series from Nick for 64-bit to further rework our exception vectors, and rewrite portions of the syscall entry/exit and interrupt return in C. The result is much easier to follow code that is also faster in general. - Cleanup of our ptrace code to split various parts out that had become badly intertwined with #ifdefs over the years. - Changes to our NUMA setup under the PowerVM hypervisor which should hopefully avoid non-sensical topologies which can lead to warnings from the workqueue code and other problems. - MAINTAINERS updates to remove some of our old orphan entries and update the status of others. - Quite a few other small changes and fixes all over the map. Thanks to: Abdul Haleem, afzal mohammed, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Balamuruhan S, Cédric Le Goater, Chen Zhou, Christophe JAILLET, Christophe Leroy, Christoph Hellwig, Clement Courbet, Daniel Axtens, David Gibson, Douglas Miller, Fabiano Rosas, Fangrui Song, Ganesh Goudar, Gautham R. Shenoy, Greg Kroah-Hartman, Greg Kurz, Gustavo Luiz Duarte, Hari Bathini, Ilie Halip, Jan Kara, Joe Lawrence, Joe Perches, Kajol Jain, Larry Finger, Laurentiu Tudor, Leonardo Bras, Libor Pechacek, Madhavan Srinivasan, Mahesh Salgaonkar, Masahiro Yamada, Masami Hiramatsu, Mauricio Faria de Oliveira, Michael Neuling, Michal Suchanek, Mike Rapoport, Nageswara R Sastry, Nathan Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Po-Hsu Lin, Pratik Rajesh Sampat, Rasmus Villemoes, Ravi Bangoria, Roman Bolshakov, Sam Bobroff, Sandipan Das, Santosh S, Sedat Dilek, Segher Boessenkool, Shilpasri G Bhat, Sourabh Jain, Srikar Dronamraju, Stephen Rothwell, Tyrel Datwyler, Vaibhav Jain, YueHaibing" * tag 'powerpc-5.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (158 commits) powerpc: Make setjmp/longjmp signature standard powerpc/cputable: Remove unnecessary copy of cpu_spec->oprofile_type powerpc: Suppress .eh_frame generation powerpc: Drop -fno-dwarf2-cfi-asm powerpc/32: drop unused ISA_DMA_THRESHOLD powerpc/powernv: Add documentation for the opal sensor_groups sysfs interfaces selftests/powerpc: Fix try-run when source tree is not writable powerpc/vmlinux.lds: Explicitly retain .gnu.hash powerpc/ptrace: move ptrace_triggered() into hw_breakpoint.c powerpc/ptrace: create ppc_gethwdinfo() powerpc/ptrace: create ptrace_get_debugreg() powerpc/ptrace: split out ADV_DEBUG_REGS related functions. powerpc/ptrace: move register viewing functions out of ptrace.c powerpc/ptrace: split out TRANSACTIONAL_MEM related functions. powerpc/ptrace: split out SPE related functions. powerpc/ptrace: split out ALTIVEC related functions. powerpc/ptrace: split out VSX related functions. powerpc/ptrace: drop PARAMETER_SAVE_AREA_OFFSET powerpc/ptrace: drop unnecessary #ifdefs CONFIG_PPC64 powerpc/ptrace: remove unused header includes ...
2020-04-02mm: allow VM_FAULT_RETRY for multiple timesPeter Xu
The idea comes from a discussion between Linus and Andrea [1]. Before this patch we only allow a page fault to retry once. We achieved this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing handle_mm_fault() the second time. This was majorly used to avoid unexpected starvation of the system by looping over forever to handle the page fault on a single page. However that should hardly happen, and after all for each code path to return a VM_FAULT_RETRY we'll first wait for a condition (during which time we should possibly yield the cpu) to happen before VM_FAULT_RETRY is really returned. This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY flag when we receive VM_FAULT_RETRY. It means that the page fault handler now can retry the page fault for multiple times if necessary without the need to generate another page fault event. Meanwhile we still keep the FAULT_FLAG_TRIED flag so page fault handler can still identify whether a page fault is the first attempt or not. Then we'll have these combinations of fault flags (only considering ALLOW_RETRY flag and TRIED flag): - ALLOW_RETRY and !TRIED: this means the page fault allows to retry, and this is the first try - ALLOW_RETRY and TRIED: this means the page fault allows to retry, and this is not the first try - !ALLOW_RETRY and !TRIED: this means the page fault does not allow to retry at all - !ALLOW_RETRY and TRIED: this is forbidden and should never be used In existing code we have multiple places that has taken special care of the first condition above by checking against (fault_flags & FAULT_FLAG_ALLOW_RETRY). This patch introduces a simple helper to detect the first retry of a page fault by checking against both (fault_flags & FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now even the 2nd try will have the ALLOW_RETRY set, then use that helper in all existing special paths. One example is in __lock_page_or_retry(), now we'll drop the mmap_sem only in the first attempt of page fault and we'll keep it in follow up retries, so old locking behavior will be retained. This will be a nice enhancement for current code [2] at the same time a supporting material for the future userfaultfd-writeprotect work, since in that work there will always be an explicit userfault writeprotect retry for protected pages, and if that cannot resolve the page fault (e.g., when userfaultfd-writeprotect is used in conjunction with swapped pages) then we'll possibly need a 3rd retry of the page fault. It might also benefit other potential users who will have similar requirement like userfault write-protection. GUP code is not touched yet and will be covered in follow up patch. Please read the thread below for more information. [1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/ [2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Brian Geffon <bgeffon@google.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm: introduce FAULT_FLAG_DEFAULTPeter Xu
Although there're tons of arch-specific page fault handlers, most of them are still sharing the same initial value of the page fault flags. Say, merely all of the page fault handlers would allow the fault to be retried, and they also allow the fault to respond to SIGKILL. Let's define a default value for the fault flags to replace those initial page fault flags that were copied over. With this, it'll be far easier to introduce new fault flag that can be used by all the architectures instead of touching all the archs. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Brian Geffon <bgeffon@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220160238.9694-1-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02powerpc/mm: use helper fault_signal_pending()Peter Xu
Let powerpc code to use the new helper, by moving the signal handling earlier before the retry logic. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Brian Geffon <bgeffon@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220160222.9422-1-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/vma: make vma_is_foreign() available for general useAnshuman Khandual
Idea of a foreign VMA with respect to the present context is very generic. But currently there are two identical definitions for this in powerpc and x86 platforms. Lets consolidate those redundant definitions while making vma_is_foreign() available for general use later. This should not cause any functional change. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Link: http://lkml.kernel.org/r/1582782965-3274-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-25powerpc/32s: reorder Linux PTE bits to better match Hash PTE bits.Christophe Leroy
Reorder Linux PTE bits to (almost) match Hash PTE bits. RW Kernel : PP = 00 RO Kernel : PP = 00 RW User : PP = 01 RO User : PP = 11 So naturally, we should have _PAGE_USER = 0x001 _PAGE_RW = 0x002 Today 0x001 and 0x002 and _PAGE_PRESENT and _PAGE_HASHPTE which both are software only bits. Switch _PAGE_USER and _PAGE_PRESET Switch _PAGE_RW and _PAGE_HASHPTE This allows to remove a few insns. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/c4d6c18a7f8d9d3b899bc492f55fbc40ef38896a.1583861325.git.christophe.leroy@c-s.fr