summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2016-01-10Merge branch 'for-4.5/block-dax' into for-4.5/libnvdimmlibnvdimm-for-4.5Dan Williams
2016-01-09block: kill disk_{check|set|clear|alloc}_badblocksDan Williams
These actions are completely managed by a block driver or can use the badblocks api directly. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09libnvdimm, pmem: nvdimm_read_bytes() badblocks supportDan Williams
Support badblock checking in all the pmem read paths that do not go through the block layer. This protects info block reads (btt or pfn) as well as data reads to a pmem namespace via a btt instance. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09pmem, dax: disable dax in the presence of bad blocksDan Williams
Longer term teach dax to punch "error" holes in mapping requests and deliver SIGBUS to applications that consume a bad pmem page. For now, simply disable the dax performance optimization in the presence of known errors. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09pmem: fail io-requests to known bad blocksDan Williams
Check the sectors specified in a read bio to see if they hit a known bad block, and return an error code pmem_do_bvec(). Note that the ->rw_page() is not in a position to return errors. For now, copy the same layering violation present in zram_rw_page() to avoid crashes of the form: kernel BUG at mm/filemap.c:822! [..] Call Trace: [<ffffffff811c540e>] page_endio+0x1e/0x60 [<ffffffff81290d29>] mpage_end_io+0x39/0x60 [<ffffffff8141c4ef>] bio_endio+0x3f/0x60 [<ffffffffa005c491>] pmem_make_request+0x111/0x230 [nd_pmem] ...i.e. unlock a page that was already unlocked via pmem_rw_page() => page_endio(). Reported-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09libnvdimm: convert to statically allocated badblocksDan Williams
If a device will ever have badblocks it should always have a badblocks instance available. So, similar to md, embed a badblocks instance in pmem_device. This reduces pointer chasing in the i/o fast path, and simplifies the init path. Reported-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09libnvdimm: don't fail init for full badblocks listDan Williams
If the badblocks list runs out of space it simply means that software is unable to intercept all errors. This is no different than the latent discovery of new badblocks case and should not be an initialization failure condition. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09block, badblocks: introduce devm_init_badblocksDan Williams
Provide a devres interface for initializing a badblocks instance. The pmem driver has several scenarios where it will be beneficial to have this structure automatically freed when the device is disabled / fails probe. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09block: clarify badblocks lifetimeDan Williams
The badblocks list attached to a gendisk is allocated by the driver which equates to the driver owning the lifetime of the object. Do not automatically free it in del_gendisk(). This is in preparation for expanding the use of badblocks in libnvdimm drivers and introducing devm_init_badblocks(). Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09badblocks: rename badblocks_free to badblocks_exitDan Williams
For symmetry with badblocks_init() make it clear that this path only destroys incremental allocations of a badblocks instance, and does not free the badblocks instance itself. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.hDan Williams
nd-core.h is private to the libnvdimm core internals and should not be used by drivers. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09libnvdimm: Add a poison list and export badblocksVishal Verma
During region creation, perform Address Range Scrubs (ARS) for the SPA (System Physical Address) ranges to retrieve known poison locations from firmware. Add a new data structure 'nd_poison' which is used as a list in nvdimm_bus to store these poison locations. When creating a pmem namespace, if there is any known poison associated with its physical address space, convert the poison ranges to bad sectors that are exposed using the badblocks interface. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09nfit_test: Enable DSMs for all test NFITsDan Williams
In preparation for getting a poison list using ARS DSMs, enable DSMs for all manufactured NFITs supplied by the test framework. Also, supply valid response data for ars_status. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09md: convert to use the generic badblocks codeVishal Verma
Retain badblocks as part of rdev, but use the accessor functions from include/linux/badblocks for all manipulation. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09block: Add badblock management for gendisksVishal Verma
NVDIMM devices, which can behave more like DRAM rather than block devices, may develop bad cache lines, or 'poison'. A block device exposed by the pmem driver can then consume poison via a read (or write), and cause a machine check. On platforms without machine check recovery features, this would mean a crash. The block device maintaining a runtime list of all known sectors that have poison can directly avoid this, and also provide a path forward to enable proper handling/recovery for DAX faults on such a device. Use the new badblock management interfaces to add a badblocks list to gendisks. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09badblocks: Add core badblock management codeVishal Verma
Take the core badblocks implementation from md, and make it generally available. This follows the same style as kernel implementations of linked lists, rb-trees etc, where you can have a structure that can be embedded anywhere, and accessor functions to manipulate the data. The only changes in this copy of the code are ones to generalize function/variable names from md-specific ones. Also add init and free functions. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09block: fix del_gendisk() vs blkdev_ioctl crashDan Williams
When tearing down a block device early in its lifetime, userspace may still be performing discovery actions like blkdev_ioctl() to re-read partitions. The nvdimm_revalidate_disk() implementation depends on disk->driverfs_dev to be valid at entry. However, it is set to NULL in del_gendisk() and fatally this is happening *before* the disk device is deleted from userspace view. There's no reason for del_gendisk() to clear ->driverfs_dev. That device is the parent of the disk. It is guaranteed to not be freed until the disk, as a child, drops its ->parent reference. We could also fix this issue locally in nvdimm_revalidate_disk() by using disk_to_dev(disk)->parent, but lets fix it globally since ->driverfs_dev follows the lifetime of the parent. Longer term we should probably just add a @parent parameter to add_disk(), and stop carrying this pointer in the gendisk. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffffa00340a8>] nvdimm_revalidate_disk+0x18/0x90 [libnvdimm] CPU: 2 PID: 538 Comm: systemd-udevd Tainted: G O 4.4.0-rc5 #2257 [..] Call Trace: [<ffffffff8143e5c7>] rescan_partitions+0x87/0x2c0 [<ffffffff810f37f9>] ? __lock_is_held+0x49/0x70 [<ffffffff81438c62>] __blkdev_reread_part+0x72/0xb0 [<ffffffff81438cc5>] blkdev_reread_part+0x25/0x40 [<ffffffff8143982d>] blkdev_ioctl+0x4fd/0x9c0 [<ffffffff811246c9>] ? current_kernel_time64+0x69/0xd0 [<ffffffff812916dd>] block_ioctl+0x3d/0x50 [<ffffffff81264c38>] do_vfs_ioctl+0x308/0x560 [<ffffffff8115dbd1>] ? __audit_syscall_entry+0xb1/0x100 [<ffffffff810031d6>] ? do_audit_syscall_entry+0x66/0x70 [<ffffffff81264f09>] SyS_ioctl+0x79/0x90 [<ffffffff81902672>] entry_SYSCALL_64_fastpath+0x12/0x76 Reported-by: Robert Hu <robert.hu@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09block: enable dax for raw block devicesDan Williams
If an application wants exclusive access to all of the persistent memory provided by an NVDIMM namespace it can use this raw-block-dax facility to forgo establishing a filesystem. This capability is targeted primarily to hypervisors wanting to provision persistent memory for guests. It can be disabled / enabled dynamically via the new BLKDAXSET ioctl. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: kbuild test robot <fengguang.wu@intel.com> Reviewed-by: Jan Kara <jack@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09block: introduce bdev_file_inode()Dan Williams
Similar to the file_inode() helper, provide a helper to lookup the inode for a raw block device itself. Cc: Al Viro <viro@zeniv.linux.org.uk> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09restrict /dev/mem to idle io memory rangesDan Williams
This effectively promotes IORESOURCE_BUSY to IORESOURCE_EXCLUSIVE semantics by default. If userspace really believes it is safe to access the memory region it can also perform the extra step of disabling an active driver. This protects device address ranges with read side effects and otherwise directs userspace to use the driver. Persistent memory presents a large "mistake surface" to /dev/mem as now accidental writes can corrupt a filesystem. In general if a device driver is busily using a memory region it already informs other parts of the kernel to not touch it via request_mem_region(). /dev/mem should honor the same safety restriction by default. Debugging a device driver from userspace becomes more difficult with this enabled. Any application using /dev/mem or mmap of sysfs pci resources will now need to perform the extra step of either: 1/ Disabling the driver, for example: echo <device id> > /dev/bus/<parent bus>/drivers/<driver name>/unbind 2/ Rebooting with "iomem=relaxed" on the command line 3/ Recompiling with CONFIG_IO_STRICT_DEVMEM=n Traditional users of /dev/mem like dosemu are unaffected because the first 1MB of memory is not subject to the IO_STRICT_DEVMEM restriction. Legacy X configurations use /dev/mem to talk to graphics hardware, but that functionality has since moved to kernel graphics drivers. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Russell King <linux@arm.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Ingo Molnar <mingo@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09arch: consolidate CONFIG_STRICT_DEVM in lib/Kconfig.debugDan Williams
Let all the archs that implement devmem_is_allowed() opt-in to a common definition of CONFIG_STRICT_DEVM in lib/Kconfig.debug. Cc: Kees Cook <keescook@chromium.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "David S. Miller" <davem@davemloft.net> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> [heiko: drop 'default y' for s390] Acked-by: Ingo Molnar <mingo@redhat.com> Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-05libnvdimm: fix namespace object confusion in is_uuid_busy()Dan Williams
When btt devices were re-worked to be child devices of regions this routine was overlooked. It mistakenly attempts to_nd_namespace_pmem() or to_nd_namespace_blk() conversions on btt and pfn devices. By luck to date we have happened to be hitting valid memory leading to a uuid miscompare, but a recent change to struct nd_namespace_common causes: BUG: unable to handle kernel NULL pointer dereference at 0000000000000001 IP: [<ffffffff814610dc>] memcmp+0xc/0x40 [..] Call Trace: [<ffffffffa0028631>] is_uuid_busy+0xc1/0x2a0 [libnvdimm] [<ffffffffa0028570>] ? to_nd_blk_region+0x50/0x50 [libnvdimm] [<ffffffff8158c9c0>] device_for_each_child+0x50/0x90 Cc: <stable@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-24libnvdimm, pfn: move 'memory mode' indication to sysfsDan Williams
'Memory mode' is defined as the capability of a DAX mapping to be the source/target of DMA and other "direct I/O" scenarios. While it currently requires allocating 'struct page' for each page frame of persistent memory in the namespace it will not always be the case. Work continues on reducing the kernel's dependency on 'struct page'. Let's not maintain a suffix that is expected to lose meaning over time. In other words a future 'raw mode' pmem namespace may be as capable as today's 'memory mode' namespace. Undo the encoding of the mode in the device name and leave it to other tooling to determine the mode of the namespace from its attributes. Reported-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-24tools/testing/libnvdimm: cleanup mock resource lookupDan Williams
Push the locking around get_nfit_res() into get_nfit_res(). Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-24libnvdimm, pfn: fix nd_pfn_validate() return value handlingDan Williams
The -ENODEV case indicates that the info-block needs to established. All other return codes cause nd_pfn_init() to abort. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-15libnvdimm, pfn: enable pfn sysfs interface unit testingDan Williams
The unit test infrastructure uses CMA and real memory to emulate nvdimm resources. The call to devm_memremap_pages() can simply be mocked in the same manner as memremap and we mock phys_to_pfn_t() to clear PFN_MAP since these resources are not registered with in the pgmap_radix. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-13Linux 4.4-rc5v4.4-rc5Linus Torvalds
2015-12-13sched/wait: Fix the signal handling fixPeter Zijlstra
Jan Stancek reported that I wrecked things for him by fixing things for Vladimir :/ His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which should not be possible, however my previous patch made this possible by unconditionally checking signal_pending(). We cannot use current->state as was done previously, because the instruction after the store to that variable it can be changed. We must instead pass the initial state along and use that. Fixes: 68985633bccb ("sched/wait: Fix signal handling in bit wait helpers") Reported-by: Jan Stancek <jstancek@redhat.com> Reported-by: Chris Mason <clm@fb.com> Tested-by: Jan Stancek <jstancek@redhat.com> Tested-by: Vladimir Murzin <vladimir.murzin@arm.com> Tested-by: Chris Mason <clm@fb.com> Reviewed-by: Paul Turner <pjt@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: tglx@linutronix.de Cc: Oleg Nesterov <oleg@redhat.com> Cc: hpa@zytor.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-13Merge tag 'nfs-for-4.4-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfix from Trond Myklebust: "SUNRPC: Fix a NFSv4.1 callback channel regression" * tag 'nfs-for-4.4-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: Fix callback channel
2015-12-13Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixlets from Thomas Gleixner: "Two trivial fixes which add missing header fileas and forward declarations so the code will compile even when the magic include chains are different" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v3: Add missing include for barrier.h irqchip/gic-v3: Add missing struct device_node declaration
2015-12-13Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A single fix to unbreak a clocksource driver which has more than 32bit counter width" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: Mmio: remove artificial 32bit limitation
2015-12-13Merge tag 'char-misc-4.4-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull fpga driver fixes from Greg KH: "Only two small fpga driver fixes here, both have been in linux-next for a while, and resolve some reported issues" * tag 'char-misc-4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: fpga manager: Fix firmware resource leak on error fpga manager: remove label
2015-12-13Merge tag 'staging-4.4-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging driver fixes from Greg KH: "Here are a few staging and IIO driver fixes for 4.4-rc5. All of them resolve reported problems and have been in linux-next for a while. Nothing major here, just small fixes where needed" * tag 'staging-4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: staging: lustre: echo_copy.._lsm() dereferences userland pointers directly iio: adc: spmi-vadc: add missing of_node_put iio: fix some warning messages iio: light: apds9960: correct ->last_busy count iio: lidar: return -EINVAL on invalid signal staging: iio: dummy: complete IIO events delivery to userspace
2015-12-13Merge tag 'usb-4.4-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB driver fixes from Greg KH: "Here are a number of small USB fixes for 4.4-rc5. All of them have been in linux-next. The majority are gadget and phy issues, with a few new quirks and device ids added as well" * tag 'usb-4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (32 commits) USB: add quirk for devices with broken LPM xhci: fix usb2 resume timing and races. usb: musb: fail with error when no DMA controller set usb: gadget: uvc: fix permissions of configfs attributes usb: musb: core: Fix pm runtime for deferred probe usb: phy: msm: fix a possible NULL dereference USB: host: ohci-at91: fix a crash in ohci_hcd_at91_overcurrent_irq usb: Quiet down false peer failure messages usb: xhci: fix config fail of FS hub behind a HS hub with MTT xhci: Fix memory leak in xhci_pme_acpi_rtd3_enable() usb: Use the USB_SS_MULT() macro to decode burst multiplier for log message USB: whci-hcd: add check for dma mapping error usb: core : hub: Fix BOS 'NULL pointer' kernel panic USB: quirks: Apply ALWAYS_POLL to all ELAN devices usb-storage: Fix scsi-sd failure "Invalid field in cdb" for USB adapter JMicron USB: quirks: Fix another ELAN touchscreen usb: dwc3: gadget: don't prestart interrupt endpoints USB: serial: Another Infineon flash loader USB ID USB: cdc_acm: Ignore Infineon Flash Loader utility USB: cp210x: Remove CP2110 ID from compatibility list ...
2015-12-13libnvdimm, pfn: fix pfn seed creationDan Williams
Similar to btt, plant a new pfn seed when the existing one is activated. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-13libnvdimm, pfn: add parent uuid validation Dan Williams
Track and check the uuid of the namespace hosting a pfn instance. This forces the pfn info block to be invalidated if the namespace is re-configured with a different uuid. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-12Merge tag 'fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull ARM SoC fixes from Arnd Bergmann: "Here are a bunch of small bug fixes for various ARM platforms, nothing really sticks out this week, most of either fixes bugs in code that was just added in 4.4, or that has been broken for many years without anyone noticing. at91/sama5d2: - fix sama5de hardware setup of sd/mmc interface - proper selection of pinctrl drivers. PIO4 is necessary for sama5d2 berlin: - fix incorrect clock input for SDIO exynos: - Fix potential NULL pointer dereference in Exynos PMU driver. imx: - Fix vf610 SAI clock configuration bug which is discovered by the newly added master mode support in SAI audio driver. - Fix buggy L2 cache latency values in vf610 device trees, which may cause system hang when cpu runs at a higher frequency. ixp4xx: - fix prototypes for readl/writel functions ls2080a: - use little-endian register access for GPIO and SDHCI omap: - Fix clock source for ARM TWD and global timers on am437x - Always select REGULATOR_FIXED_VOLTAGE for omap2+ instead of when MACH_OMAP3_PANDORA is selected - Fix SPI DMA handles for dm816x as only some were mapped - Fix up mbox cells for dm816x to make mailbox usable pxa: - use PWM lookup table for all ezx machines s3c24xx: - Remove incorrect __init annotation from s3c24xx cpufreq driver structures. versatile: - fix PCI IRQ mapping on Versatile PB" * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: ls2080a/dts: Add little endian property for GPIO IP block dt-bindings: define little-endian property for QorIQ GPIO ARM64: dts: ls2080a: fix eSDHC endianness ARM: dts: vf610: use reset values for L2 cache latencies ARM: pxa: use PWM lookup table for all machines ARM: dts: berlin: add 2nd clock for BG2Q sdhci0 and sdhci1 ARM: dts: berlin: correct BG2Q's sdhci2 2nd clock ARM: dts: am4372: fix clock source for arm twd and global timers ARM: at91: fix pinctrl driver selection ARM: at91/dt: add always-on to 1.8V regulator ARM: dts: vf610: fix clock definition for SAI2 ARM: imx: clk-vf610: fix SAI clock tree ARM: ixp4xx: fix read{b,w,l} return types irqchip/versatile-fpga: Fix PCI IRQ mapping on Versatile PB ARM: OMAP2+: enable REGULATOR_FIXED_VOLTAGE ARM: dts: add dm816x missing spi DT dma handles ARM: dts: add dm816x missing #mbox-cells cpufreq: s3c24xx: Do not mark s3c2410_plls_add as __init ARM: EXYNOS: Fix potential NULL pointer access in exynos_sys_powerdown_conf
2015-12-12libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZEDan Williams
When setting aside capacity for struct page it must be aligned to the largest mapping size that is to be made available via DAX. Make the alignment configurable to enable support for 1GiB page-size mappings. The offset for PFN_MODE_RAM may now be larger than SZ_8K, so fixup the offset check in nvdimm_namespace_attach_pfn(). Reported-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-12Merge tag 'powerpc-4.4-4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - opal-irqchip: Fix double endian conversion from Alistair Popple - cxl: Set endianess of kernel contexts from Frederic Barrat - sbc8641: drop bogus PHY IRQ entries from DTS file from Paul Gortmaker - Revert "powerpc/eeh: Don't unfreeze PHB PE after reset" from Andrew Donnellan * tag 'powerpc-4.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: Revert "powerpc/eeh: Don't unfreeze PHB PE after reset" powerpc/sbc8641: drop bogus PHY IRQ entries from DTS file cxl: Set endianess of kernel contexts powerpc/opal-irqchip: Fix double endian conversion
2015-12-12Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "17 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: MIPS: fix DMA contiguous allocation sh64: fix __NR_fgetxattr ocfs2: fix SGID not inherited issue mm/oom_kill.c: avoid attempting to kill init sharing same memory drivers/base/memory.c: prohibit offlining of memory blocks with missing sections tmpfs: fix shmem_evict_inode() warnings on i_blocks mm/hugetlb.c: fix resv map memory leak for placeholder entries mm: hugetlb: call huge_pte_alloc() only if ptep is null kernel: remove stop_machine() Kconfig dependency mm: kmemleak: mark kmemleak_init prototype as __init mm: fix kerneldoc on mem_cgroup_replace_page osd fs: __r4w_get_page rely on PageUptodate for uptodate MAINTAINERS: make Vladimir co-maintainer of the memory controller mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress mm: fix swapped Movable and Reclaimable in /proc/pagetypeinfo memcg: fix memory.high target mm: hugetlb: fix hugepage memory leak caused by wrong reserve count
2015-12-12Merge branch 'parisc-4.4-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc fixes from Helge Deller: "Fix the boot crash on Mako machines with Huge Pages, prevent a panic with SATA controllers (and others) by correctly calculating the IOMMU space, hook up the mlock2 syscall and drop unneeded code in the parisc pci code" * 'parisc-4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Disable huge pages on Mako machines parisc: Wire up mlock2 syscall parisc: Remove unused pcibios_init_bus() parisc iommu: fix panic due to trying to allocate too large region
2015-12-12Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer fixes from Jens Axboe: "A set of fixes for the current series. This contains: - A bunch of fixes for lightnvm, should be the last round for this series. From Matias and Wenwei. - A writeback detach inode fix from Ilya, also marked for stable. - A block (though it says SCSI) fix for an OOPS in SCSI runtime power management. - Module init error path fixes for null_blk from Minfei" * 'for-linus' of git://git.kernel.dk/linux-block: null_blk: Fix error path in module initialization lightnvm: do not compile in debugging by default lightnvm: prevent gennvm module unload on use lightnvm: fix media mgr registration lightnvm: replace req queue with nvmdev for lld lightnvm: comments on constants lightnvm: check mm before use lightnvm: refactor spin_unlock in gennvm_get_blk lightnvm: put blks when luns configure failed lightnvm: use flags in rrpc_get_blk block: detach bdev inode from its wb in __blkdev_put() SCSI: Fix NULL pointer dereference in runtime PM
2015-12-12Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Catalin Marinas: - Update the linker script to use L1_CACHE_BYTES instead of hard-coded 64. We recently changed L1_CACHE_BYTES to 128 - Improve race condition reporting on set_pte_at() and change the BUG to WARN_ONCE. With hardware update of the accessed/dirty state, we need to ensure that set_pte_at() does not inadvertently override hardware updated state. The patch also makes the checks ignore !pte_valid() new entries * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: Improve error reporting on set_pte_at() checks arm64: update linker script to increased L1_CACHE_BYTES value
2015-12-12MIPS: fix DMA contiguous allocationQais Yousef
Recent changes to how GFP_ATOMIC is defined seems to have broken the condition to use mips_alloc_from_contiguous() in mips_dma_alloc_coherent(). I couldn't bottom out the exact change but I think it's this commit d0164adc89f6 ("mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd"). GFP_ATOMIC has multiple bits set and the check for !(gfp & GFP_ATOMIC) isn't enough. The reason behind this condition is to check whether we can potentially do a sleeping memory allocation. Use gfpflags_allow_blocking() instead which should be more robust. Signed-off-by: Qais Yousef <qais.yousef@imgtec.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12sh64: fix __NR_fgetxattrDmitry V. Levin
According to arch/sh/kernel/syscalls_64.S and common sense, __NR_fgetxattr has to be defined to 259, but it doesn't. Instead, it's defined to 269, which is of course used by another syscall, __NR_sched_setaffinity in this case. This bug was found by strace test suite. Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> Acked-by: Geert Uytterhoeven <geert+renesas@glider.be> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12ocfs2: fix SGID not inherited issueJunxiao Bi
Commit 8f1eb48758aa ("ocfs2: fix umask ignored issue") introduced an issue, SGID of sub dir was not inherited from its parents dir. It is because SGID is set into "inode->i_mode" in ocfs2_get_init_inode(), but is overwritten by "mode" which don't have SGID set later. Fixes: 8f1eb48758aa ("ocfs2: fix umask ignored issue") Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Acked-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12mm/oom_kill.c: avoid attempting to kill init sharing same memoryChen Jie
It's possible that an oom killed victim shares an ->mm with the init process and thus oom_kill_process() would end up trying to kill init as well. This has been shown in practice: Out of memory: Kill process 9134 (init) score 3 or sacrifice child Killed process 9134 (init) total-vm:1868kB, anon-rss:84kB, file-rss:572kB Kill process 1 (init) sharing same memory ... Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 And this will result in a kernel panic. If a process is forked by init and selected for oom kill while still sharing init_mm, then it's likely this system is in a recoverable state. However, it's better not to try to kill init and allow the machine to panic due to unkillable processes. [rientjes@google.com: rewrote changelog] [akpm@linux-foundation.org: fix inverted test, per Ben] Signed-off-by: Chen Jie <chenjie6@huawei.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12drivers/base/memory.c: prohibit offlining of memory blocks with missing sectionsSeth Jennings
Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64 systems") and 982792c782ef ("x86, mm: probe memory block size for generic x86 64bit") introduced large block sizes for x86. This made it possible to have multiple sections per memory block where previously, there was a only every one section per block. Since blocks consist of contiguous ranges of section, there can be holes in the blocks where sections are not present. If one attempts to offline such a block, a crash occurs since the code is not designed to deal with this. This patch is a quick fix to gaurd against the crash by not allowing blocks with non-present sections to be offlined. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=107781 Signed-off-by: Seth Jennings <sjennings@variantweb.net> Reported-by: Andrew Banman <abanman@sgi.com> Cc: Daniel J Blueman <daniel@numascale.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Greg KH <greg@kroah.com> Cc: Russ Anderson <rja@sgi.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12tmpfs: fix shmem_evict_inode() warnings on i_blocksHugh Dickins
Dmitry Vyukov provides a little program, autogenerated by syzkaller, which races a fault on a mapping of a sparse memfd object, against truncation of that object below the fault address: run repeatedly for a few minutes, it reliably generates shmem_evict_inode()'s WARN_ON(inode->i_blocks). (But there's nothing specific to memfd here, nor to the fstat which it happened to use to generate the fault: though that looked suspicious, since a shmem_recalc_inode() had been added there recently. The same problem can be reproduced with open+unlink in place of memfd_create, and with fstatfs in place of fstat.) v3.7 commit 0f3c42f522dc ("tmpfs: change final i_blocks BUG to WARNING") explains one cause of such a warning (a race with shmem_writepage to swap), and possible solutions; but we never took it further, and this syzkaller incident turns out to have a different cause. shmem_getpage_gfp()'s error recovery, when a freshly allocated page is then found to be beyond eof, looks plausible - decrementing the alloced count that was just before incremented - but in fact can go wrong, if a racing thread (the truncator, for example) gets its shmem_recalc_inode() in just after our delete_from_page_cache(). delete_from_page_cache() decrements nrpages, that shmem_recalc_inode() will balance the books by decrementing alloced itself, then our decrement of alloced take it one too low: leading to the WARNING when the object is finally evicted. Once the new page has been exposed in the page cache, shmem_getpage_gfp() must leave it to shmem_recalc_inode() itself to get the accounting right in all cases (and not fall through from "trunc:" to "decused:"). Adjust that error recovery block; and the reinitialization of info and sbinfo can be removed too. While we're here, fix shmem_writepage() to avoid the original issue: it will be safe against a racing shmem_recalc_inode(), if it merely increments swapped before the shmem_delete_from_page_cache() which decrements nrpages (but it must then do its own shmem_recalc_inode() before that, while still in balance, instead of after). (Aside: why do we shmem_recalc_inode() here in the swap path? Because its raison d'etre is to cope with clean sparse shmem pages being reclaimed behind our back: so here when swapping is a good place to look for that case.) But I've not now managed to reproduce this bug, even without the patch. I don't see why I didn't do that earlier: perhaps inhibited by the preference to eliminate shmem_recalc_inode() altogether. Driven by this incident, I do now have a patch to do so at last; but still want to sit on it for a bit, there's a couple of questions yet to be resolved. Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12mm/hugetlb.c: fix resv map memory leak for placeholder entriesMike Kravetz
Dmitry Vyukov reported the following memory leak unreferenced object 0xffff88002eaafd88 (size 32): comm "a.out", pid 5063, jiffies 4295774645 (age 15.810s) hex dump (first 32 bytes): 28 e9 4e 63 00 88 ff ff 28 e9 4e 63 00 88 ff ff (.Nc....(.Nc.... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: kmalloc include/linux/slab.h:458 region_chg+0x2d4/0x6b0 mm/hugetlb.c:398 __vma_reservation_common+0x2c3/0x390 mm/hugetlb.c:1791 vma_needs_reservation mm/hugetlb.c:1813 alloc_huge_page+0x19e/0xc70 mm/hugetlb.c:1845 hugetlb_no_page mm/hugetlb.c:3543 hugetlb_fault+0x7a1/0x1250 mm/hugetlb.c:3717 follow_hugetlb_page+0x339/0xc70 mm/hugetlb.c:3880 __get_user_pages+0x542/0xf30 mm/gup.c:497 populate_vma_page_range+0xde/0x110 mm/gup.c:919 __mm_populate+0x1c7/0x310 mm/gup.c:969 do_mlock+0x291/0x360 mm/mlock.c:637 SYSC_mlock2 mm/mlock.c:658 SyS_mlock2+0x4b/0x70 mm/mlock.c:648 Dmitry identified a potential memory leak in the routine region_chg, where a region descriptor is not free'ed on an error path. However, the root cause for the above memory leak resides in region_del. In this specific case, a "placeholder" entry is created in region_chg. The associated page allocation fails, and the placeholder entry is left in the reserve map. This is "by design" as the entry should be deleted when the map is released. The bug is in the region_del routine which is used to delete entries within a specific range (and when the map is released). region_del did not handle the case where a placeholder entry exactly matched the start of the range range to be deleted. In this case, the entry would not be deleted and leaked. The fix is to take these special placeholder entries into account in region_del. The region_chg error path leak is also fixed. Fixes: feba16e25a57 ("mm/hugetlb: add region_del() to delete a specific range of entries") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: <stable@vger.kernel.org> [4.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>