summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2013-11-05tree-wide: use reinit_completion instead of INIT_COMPLETIONWolfram Sang
Use this new function to make code more comprehensible, since we are reinitialzing the completion, not initializing. Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Acked-by: Linus Walleij <linus.walleij@linaro.org> (personally at LCE13) Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05seq_file: remove "%n" usage from seq_file usersTetsuo Handa
All seq_printf() users are using "%n" for calculating padding size, convert them to use seq_setwidth() / seq_pad() pair. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Joe Perches <joe@perches.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05seq_file: introduce seq_setwidth() and seq_pad()Tetsuo Handa
There are several users who want to know bytes written by seq_*() for alignment purpose. Currently they are using %n format for knowing it because seq_*() returns 0 on success. This patch introduces seq_setwidth() and seq_pad() for allowing them to align without using %n format. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Joe Perches <joe@perches.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05mm, hugetlb: convert hugetlbfs to use split pmd lockKirill A. Shutemov
Hugetlb supports multiple page sizes. We use split lock only for PMD level, but not for PUD. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Alex Thorlton <athorlton@sgi.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Jones <davej@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Robin Holt <robinmholt@gmail.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05mm, thp: change pmd_trans_huge_lock() to return taken lockKirill A. Shutemov
With split ptlock it's important to know which lock pmd_trans_huge_lock() took. This patch adds one more parameter to the function to return the lock. In most places migration to new api is trivial. Exception is move_huge_pmd(): we need to take two locks if pmd tables are different. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Alex Thorlton <athorlton@sgi.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Jones <davej@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Robin Holt <robinmholt@gmail.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05mm: convert mm->nr_ptes to atomic_long_tKirill A. Shutemov
With split page table lock for PMD level we can't hold mm->page_table_lock while updating nr_ptes. Let's convert it to atomic_long_t to avoid races. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Alex Thorlton <athorlton@sgi.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Jones <davej@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Robin Holt <robinmholt@gmail.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05Merge branch 'akpm-current/current'Stephen Rothwell
Conflicts: arch/x86/mm/init.c drivers/net/wireless/rt2x00/rt2800pci.c fs/bio-integrity.c mm/bounce.c
2013-11-05devpts: plug the memory leak in kill_sbIlija Hadzic
When devpts is unmounted, there may be a no-longer-used IDR tree hanging off the superblock we are about to kill. This needs to be cleaned up before destroying the SB. The leak is usually not a big deal because unmounting devpts is typically done when shutting down the whole machine. However, shutting down an LXC container instead of a physical machine exposes the problem (the garbage is detectable with kmemleak). Signed-off-by: Ilija Hadzic <ihadzic@research.bell-labs.com> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05exec/ptrace: fix get_dumpable() incorrect testsKees Cook
The get_dumpable() return value is not boolean. Most users of the function actually want to be testing for non-SUID_DUMP_USER(1) rather than SUID_DUMP_DISABLE(0). The SUID_DUMP_ROOT(2) is also considered a protected state. Almost all places did this correctly, excepting the two places fixed in this patch. Wrong logic: if (dumpable == SUID_DUMP_DISABLE) { /* be protective */ } or if (dumpable == 0) { /* be protective */ } or if (!dumpable) { /* be protective */ } Correct logic: if (dumpable != SUID_DUMP_USER) { /* be protective */ } or if (dumpable != 1) { /* be protective */ } Without this patch, if the system had set the sysctl fs/suid_dumpable=2, a user was able to ptrace attach to processes that had dropped privileges to that user. (This may have been partially mitigated if Yama was enabled.) The macros have been moved into the file that declares get/set_dumpable(), which means things like the ia64 code can see them too. CVE-2013-2929 Reported-by: Vasily Kulikov <segoon@openwall.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05procfs: clean up proc_reg_get_unmapped_area for 80-column limitHATAYAMA Daisuke
Clean up proc_reg_get_unmapped_area due to its 80-column limit violation. Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05fat: permit to return phy block number by fibmap in fallocated regionNamjae Jeon
Make the fibmap call the return the proper physical block number for any offset request in the fallocated range. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05fat: fallback to buffered write in case of fallocatded region on direct IONamjae Jeon
For normal cases of direct IO write, trying to seek to location greater than file size, makes it fall back to buffered write to fill that region. Similarly, in case for write in Fallocated region, make it fall to buffered write. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05fat: zero out seek range on _fat_get_blockNamjae Jeon
For normal buffered write operations, normally if we try to write to an offset > than file size, it does a cont_expand_zero till that offset. Now, in case of fallocated regions, since the blocks are already allocated. So, make it zero out that buffers for those blocks till the seek'ed offset. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05fat: add fat_fallocate operationNamjae Jeon
Implement preallocation via the fallocate syscall on VFAT partitions. This patch is based on an earlier patch of the same name which had some issues detailed below and did not get accepted. Refer https://lkml.org/lkml/2007/12/22/130. a) The preallocated space was not persistent when the FALLOC_FL_KEEP_SIZE flag was set. It will deallocate cluster at evict time. b) There was no need to zero out the clusters when the flag was set Instead of doing an expanding truncate, just allocate clusters and add them to the fat chain. This reduces preallocation time. Compatibility with windows: There are no issues when FALLOC_FL_KEEP_SIZE is not set because it just does an expanding truncate. Thus reading from the preallocated area on windows returns null until data is written to it. When a file with preallocated area using the FALLOC_FL_KEEP_SIZE was written to on windows, the windows driver freed-up the preallocated clusters and allocated new clusters for the new data. The freed up clusters gets reflected in the free space available for the partition which can be seen from the Volume properties. The windows chkdsk tool also does not report any errors on a disk containing files with preallocated space. And there is also no issue using linux fat fsck. because discard preallocated clusters at repair time. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05fat: add i_disksize to represent uninitialized sizeNamjae Jeon
Add i_disksize to represent uninitialized allocated size. And mmu_private represent initialized allocated size. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05hfsplus-implement-attributes-file-creation-functionality-v2Vyacheslav Dubeyko
* SETOFFSET macro was removed (Andrew Morton). * Record offsets logic was reworked in hfsplus_init_header_node() (Al Viro). * AttributesFile's header node's bitmap is initialized by memset() (Al Viro). * Logic of checking attributes tree state before AttributesFile creation was reworked for HFSPLUS_CREATING_ATTR_TREE case (Andrew Morton). * Setting error code was added after read_mapping_page() call in hfsplus_create_attributes_file() method (Andrew Morton). Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hin-Tak Leung <htl10@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05hfsplus: implement attributes file creation functionalityVyacheslav Dubeyko
Implement functionality of creation AttributesFile metadata file on HFS+ volume in the case of absence of it. It makes trying to open AttributesFile's B-tree during mount of HFS+ volume. If HFS+ volume hasn't AttributesFile then a pointer on AttributesFile's B-tree keeps as NULL. Thereby, when it is discovered absence of AttributesFile on HFS+ volume in the begin of xattr creation operation then AttributesFile will be created. The creation of AttributesFile will have success in the case of availability (2 * clump) free blocks on HFS+ volume. Otherwise, creation operation is ended with error (-ENOSPC). Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: Hin-Tak Leung <htl10@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05hfsplus-implement-attributes-files-header-node-initialization-code-v2Vyacheslav Dubeyko
* SETOFFSET macro was removed (Andrew Morton). * Record offsets logic was reworked in hfsplus_init_header_node() (Al Viro). * AttributesFile's header node's bitmap is initialized by memset() (Al Viro). * Logic of checking attributes tree state before AttributesFile creation was reworked for HFSPLUS_CREATING_ATTR_TREE case (Andrew Morton). * Setting error code was added after read_mapping_page() call in hfsplus_create_attributes_file() method (Andrew Morton). Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hin-Tak Leung <htl10@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05hfsplus: implement attributes file's header node initialization codeVyacheslav Dubeyko
Implement functionality of AttributesFile's header node initialization. Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hin-Tak Leung <htl10@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05hfsplus: add metadata file's clump size calculation functionalityVyacheslav Dubeyko
There are situation when HFS+ volume had been created without AttributesFile. Such situation can take place because of using old mkfs.hfs utility or creation HFS+ volume without taking in mind necessity to use xattrs. For example, Mac OS X 10.4 (Tiger) doesn't create AttributesFile during mkfs phase. Also it is a very frequent situation for the case of users that created HFS+ volumes under Linux. As a result, xattrs and POSIX ACLs on HFS+ volume are unavailable for such users. This patchset implements functionality of AttributesFile creation on HFS+ volume in the case of this metadata file absence during operation of xattr creation. This patch: Add functionality of metadata file's clump size calculation. Operation of AttributesFile creation needs in clump size setting. This value will be used when AttributesFile will be extended. This code is adopted from code of newfs_hfs utility of diskdev_cmds packet http://opensource.apple.com/tarballs/diskdev_cmds/. Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hin-Tak Leung <htl10@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05autofs4: translate pids to the right namespace for the daemonMiklos Szeredi
The PID and the TGID of the process triggering the mount are sent to the daemon. Currently the global pid values are sent (ones valid in the initial pid namespace) but this is wrong if the autofs daemon itself is not running in the initial pid namespace. So send the pid values that are valid in the namespace of the autofs daemon. The namespace to use is taken from the oz_pgrp pid pointer, which was set at mount time to the mounting process' pid namespace. If the pid translation fails (the triggering process is in an unrelated pid namespace) then the automount fails with ENOENT. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Ian Kent <raven@themaw.net> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05autofs4: allow autofs to work outside the initial PID namespaceSukadev Bhattiprolu
Enable autofs4 to work in a "container". oz_pgrp is converted from pid_t to struct pid and this is stored at mount time based on the "pgrp=" option or if the option is missing then the current pgrp. The "pgrp=" option is interpreted in the PID namespace of the current process. This option is flawed in that it doesn't carry the namespace information, so it should be deprecated. AFAICS the autofs daemon always sends the current pgrp, which is the default anyway. The oz_pgrp is also set from the AUTOFS_DEV_IOCTL_SETPIPEFD_CMD ioctl. This ioctl sets oz_pgrp to the current pgrp. It is not allowed to change the pid namespace. oz_pgrp is used mainly to determine whether the process traversing the autofs mount tree is the autofs daemon itself or not. This function now compares the pid pointers instead of the pid_t values. One other use of oz_pgrp is in autofs4_show_options. There is shows the virtual pid number (i.e. the one that is valid inside the PID namespace of the calling process) For debugging printk convert oz_pgrp to the value in the initial pid namespace. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Ian Kent <raven@themaw.net> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05binfmt_elf.c: use get_random_int() to fix entropy depletingJeff Liu
Entropy is quickly depleted under normal operations like ls(1), cat(1), etc... between 2.6.30 to current mainline, for instance: $ cat /proc/sys/kernel/random/entropy_avail 3428 $ cat /proc/sys/kernel/random/entropy_avail 2911 $cat /proc/sys/kernel/random/entropy_avail 2620 We observed this problem has been occurring since 2.6.30 with fs/binfmt_elf.c: create_elf_tables()->get_random_bytes(), introduced by f06295b44c296c8f ("ELF: implement AT_RANDOM for glibc PRNG seeding"). /* * Generate 16 random bytes for userspace PRNG seeding. */ get_random_bytes(k_rand_bytes, sizeof(k_rand_bytes)); The patch introduces a wrapper around get_random_int() which has lower overhead than calling get_random_bytes() directly. With this patch applied: $ cat /proc/sys/kernel/random/entropy_avail 2731 $ cat /proc/sys/kernel/random/entropy_avail 2802 $ cat /proc/sys/kernel/random/entropy_avail 2878 Analyzed by John Sobecki. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <aedilger@gmail.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Arnd Bergmann <arnn@arndb.de> Cc: John Sobecki <john.sobecki@oracle.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05epoll-do-not-take-global-epmutex-for-simple-topologies-fixAndrew Morton
- use `bool' for boolean variables - remove unneeded/undesirable cast of void* - add missed ep_scan_ready_list() kerneldoc Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Eric Wong <normalperson@yhbt.net> Cc: Jason Baron <jbaron@akamai.com> Cc: Nathan Zimmer <nzimmer@sgi.com> Cc: Nelson Elhage <nelhage@nelhage.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05epoll: do not take global 'epmutex' for simple topologiesJason Baron
When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached directly to a wakeup source, we do not need to take the global 'epmutex', unless the epoll file descriptor is nested. The purpose of taking the 'epmutex' on add is to prevent complex topologies such as loops and deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD operations. However, for the simple case of an epoll file descriptor attached directly to a wakeup source (with no nesting), we do not need to hold the 'epmutex'. This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb performance: "On the 16 socket run the performance went from 35k jOPS to 125k jOPS. In addition the benchmark when from scaling well on 10 sockets to scaling well on just over 40 sockets. ... Currently the benchmark stops scaling at around 40-44 sockets but it seems like I found a second unrelated bottleneck." Signed-off-by: Jason Baron <jbaron@akamai.com> Tested-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Eric Wong <normalperson@yhbt.net> Cc: Nelson Elhage <nelhage@nelhage.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05epoll: remove the on_list check for 'struct epitem'Jason Baron
By removing the 'int on_list' field from 'struct epitem', we avoid hitting the BUILD_BUG_ON() for 'struct epitem' being larger than 128 bytes. In file included from include/linux/init.h:4:0, from fs/eventpoll.c:14: fs/eventpoll.c: In function 'eventpoll_init': include/linux/compiler.h:321:20: error: call to '__compiletime_assert_2137' declared with attribute error: BUILD_BUG_ON failed: sizeof(void *) <= 8 && sizeof(struct epitem) > 128 prefix ## suffix(); \ The check to make sure that the 'struct epitem' was actually linked via epi->fllink was added to avoid having the list removal primitives called twice for the same 'struct epitem'. However, the double call possibility was removed by 'Subject: epoll: optimize EPOLL_CTL_DEL using rcu'. There, the call to 'list_del_init()' in eventpoll_release_file() was removed (we now rely on the list delete happening entirely in 'ep_remove()', which is called from eventpoll_release_file()). There is also the question as to whether multiple ep_remove() calls could happen concurrently. This can not happen since EPOLL_CTL_DEL can't race with eventpoll_release_file() or ep_free() - it has to do an fget() to proceed. Further, eventpoll_release_file() can not race with ep_free(), since they both acquire the 'epmutex'. Signed-off-by: Jason Baron <jbaron@akamai.com> Reported-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Nathan Zimmer <nzimmer@sgi.com> Cc: Eric Wong <normalperson@yhbt.net> Cc: Nelson Elhage <nelhage@nelhage.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05epoll: optimize EPOLL_CTL_DEL using rcuJason Baron
Nathan Zimmer found that once we get over 10+ cpus, the scalability of SPECjbb falls over due to the contention on the global 'epmutex', which is taken in on EPOLL_CTL_ADD and EPOLL_CTL_DEL operations. Patch #1 removes the 'epmutex' lock completely from the EPOLL_CTL_DEL path by using rcu to guard against any concurrent traversals. Patch #2 remove the 'epmutex' lock from EPOLL_CTL_ADD operations for simple topologies. IE when adding a link from an epoll file descriptor to a wakeup source, where the epoll file descriptor is not nested. This patch (of 2): Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by converting the file->f_ep_links list into an rcu one. In this way, we can traverse the epoll network on the add path in parallel with deletes. Since deletes can't create loops or worse wakeup paths, this is safe. This patch in combination with the patch "epoll: Do not take global 'epmutex' for simple topologies", shows a dramatic performance improvement in scalability for SPECjbb. Signed-off-by: Jason Baron <jbaron@akamai.com> Tested-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Eric Wong <normalperson@yhbt.net> Cc: Nelson Elhage <nelhage@nelhage.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05cramfs: mark as obsoleteMichael Opdenacker
Who needs cramfs when you have squashfs? At least, we should warn people that cramfs is obsolete. Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05writeback-do-not-sync-data-dirtied-after-sync-start-fix-3Jan Kara
Fixup sync_inodes_sb() comment Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05writeback-do-not-sync-data-dirtied-after-sync-start-fix-2.txtJan Kara
On Wed 09-10-13 14:21:25, Andrew Morton wrote: > On Wed, 9 Oct 2013 17:03:25 +0200 Jan Kara <jack@suse.cz> wrote: > > > From: Jan Kara <jack@suse.cz> > > Date: Wed, 9 Oct 2013 15:41:50 +0200 > > Subject: [PATCH] writeback: Use older_than_this_is_set instead of magic > > older_than_this == 0 > > > > Currently we use 0 as a special value of work->older_than_this to > > indicate that wb_writeback() should set work->older_that_this to current > > time. This works but it is a bit magic. So use a special flag in > > work_struct for that. > > OK. > > > - if (!work->older_than_this) > > + if (!work->older_than_this_is_set) > > work->older_than_this = jiffies; > > It would be logical although presumably unneeded to set > older_than_this_is_set here? Yes. Updated. > > Also fixup writeback from workqueue rescuer to include all inodes. > > There's nothing in the patch which matches this sentence? The sentence is about the hunk below. writeback_inodes_wb() is special in that it directly calls queue_io() (everything else goes through wb_writeback()) and my previous patch thus resulted in using 0 as an older_than_this value => likely we wouldn't queue any inodes for writeback. I've added WARN_ON_ONCE into move_expired_inodes() to increase a chance of catching such mistakes in future (although in this particular case it wouldn't really help because writeback_inodes_wb() gets hardly ever called). Currently we use 0 as a special value of work->older_than_this to indicate that wb_writeback() should set work->older_that_this to current time. This works but it is a bit magic. So use a special flag in work_struct for that. Also fixup writeback from workqueue rescuer (writeback_inodes_wb()) to include all inodes. Currently it would use 0 as an older_than_this value thus queue_io() would likely not queue any inodes for writeback. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Dave Chinner <dchinner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05writeback: use older_than_this_is_set instead of magic older_than_this == 0Jan Kara
Currently we use 0 as a special value of work->older_than_this to indicate that wb_writeback() should set work->older_that_this to current time. This works but it is a bit magic. So use a special flag in work_struct for that. Also fixup writeback from workqueue rescuer to include all inodes. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Dave Chinner <dchinner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05writeback: do not sync data dirtied after sync startJan Kara
When there are processes heavily creating small files while sync(2) is running, it can easily happen that quite some new files are created between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen especially if there are several busy filesystems (remember that sync traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one fs before starting it on another fs). Because WB_SYNC_ALL pass is slow (e.g. causes a transaction commit and cache flush for each inode in ext3), resulting sync(2) times are rather large. The following script reproduces the problem: function run_writers { for (( i = 0; i < 10; i++ )); do mkdir $1/dir$i for (( j = 0; j < 40000; j++ )); do dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null done & done } for dir in "$@"; do run_writers $dir done sleep 40 time sync ====== Fix the problem by disregarding inodes dirtied after sync(2) was called in the WB_SYNC_ALL pass. To allow for this, sync_inodes_sb() now takes a time stamp when sync has started which is used for setting up work for flusher threads. To give some numbers, when above script is run on two ext4 filesystems on simple SATA drive, the average sync time from 10 runs is 267.549 seconds with standard deviation 104.799426. With the patched kernel, the average sync time from 10 runs is 2.995 seconds with standard deviation 0.096. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05/proc/pid/smaps: show VM_SOFTDIRTY flag in VmFlags lineNaoya Horiguchi
This flag shows that the VMA is "newly created" and thus represents "dirty" in the task's VM. You can clear it by "echo 4 > /proc/pid/clear_refs." Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05mm, mempolicy: make mpol_to_str robust and always succeedDavid Rientjes
mpol_to_str() should not fail. Currently, it either fails because the string buffer is too small or because a string hasn't been defined for a mempolicy mode. If a new mempolicy mode is introduced and no string is defined for it, just warn and return "unknown". If the buffer is too small, just truncate the string and return, the same behavior as snprintf(). This also fixes a bug where there was no NULL-byte termination when doing *p++ = '=' and *p++ ':' and maxlen has been reached. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Chen Gang <gang.chen@asianux.com> Cc: Rik van Riel <riel@redhat.com> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05mm: use pgdat_end_pfn() to simplify the code in othersXishi Qiu
Use "pgdat_end_pfn()" instead of "pgdat->node_start_pfn + pgdat->node_spanned_pages". Simplify the code, no functional change. Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05posix_acl: uninliningAndrew Morton
Uninline vast tracts of nested inline functions in include/linux/posix_acl.h. This reduces the text+data+bss size of x86_64 allyesconfig vmlinux by 8026 bytes. The patch also regularises the positioning of the EXPORT_SYMBOLs in posix_acl.c. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: J. Bruce Fields <bfields@fieldses.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Tested-by: Benny Halevy <bhalevy@primarydata.com> Cc: Benny Halevy <bhalevy@panasas.com> Cc: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05anon_inodefs: forbid open via /procOleg Nesterov
open("/proc/pid/$anon-fd") should fail, we can't create the new file with correct f_op/etc correctly. Currently this creates the bogus file with the empty anon_inode_fops, this is harmless but still wrong and misleading. Add anon_inode_fops->anon_open() which simply returns ENXIO like sock_no_open() does in this case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05fs/bio-integrity.c: remove duplicated codeGu Zheng
Most code of function bio_integrity_verify and bio_integrity_generate is the same, so introduce a common function bio_integrity_generate_verify() to remove the reduplicate code. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05ocfs2: update inode size after zeroing the holeJunxiao Bi
fs-writeback will release the dirty pages without page lock whose offset are over inode size, the release happens at block_write_full_page_endio(). If not update, dirty pages in file holes may be released before flushed to the disk, then file holes will contain some non-zero data, this will cause sparse file md5sum error. To reproduce the bug, find a big sparse file with many holes, like vm image file, its actual size should be bigger than available mem size to make writeback work more frequently, tar it with -S option, then keep untar it and check its md5sum again and again until you get a wrong md5sum. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Younger Liu <younger.liu@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05ocfs2: fix issue that ocfs2_setattr() does not deal with new_i_size==i_sizeYounger Liu
The issue scenario is as following: - Create a small file and fallocate a large disk space for a file with FALLOC_FL_KEEP_SIZE option. - ftruncate the file back to the original size again. but the disk free space is not changed back. This is a real bug that be fixed in this patch. In order to solve the issue above, we modified ocfs2_setattr(), if attr->ia_size != i_size_read(inode), It calls ocfs2_truncate_file(), and truncate disk space to attr->ia_size. Signed-off-by: Younger Liu <younger.liu@huawei.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Tested-by: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Reviewed-by: Jensen <shencanquan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05ocfs2: llseek requires ocfs2 inode lock for the file in SEEK_ENDJensen
llseek requires ocfs2 inode lock for updating the file size in SEEK_END. because the file size maybe update on another node. This bug can be reproduce the following scenario: at first, we dd a test fileA, the file size is 10k. on NodeA: --------- 1) open the test fileA, lseek the end of file. and print the position. 2) close the test fileA on NodeB: 1) open the test fileA, append the 5k data to test FileA. 2) lseek the end of file. and print the position. 3) close file. At first we run the test program1 on NodeA , the result is 10k. And then run the test program2 on NodeB, the result is 15k. At last, we run the test program1 on NodeA again, the result is 10k. After applying this patch the three step result is 15k. Signed-off-by: Jensen <shencanquan@huawei.com> Cc: Jie Liu <jeff.liu@oracle.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05ocfs2: should call ocfs2_journal_access_di() before ocfs2_delete_entry() in ↵Younger Liu
ocfs2_orphan_del() While deleting a file into orphan dir in ocfs2_orphan_del(), it calls ocfs2_delete_entry() before ocfs2_journal_access_di(). If ocfs2_delete_entry() succeeded and ocfs2_journal_access_di() failed, there would be a inconsistency: the file is deleted from orphan dir, but orphan dir dinode is not updated. So we need to call ocfs2_journal_access_di() before ocfs2_orphan_del(). Signed-off-by: Younger Liu <younger.liu@huawei.com> Reviewed-by: Jensen <shencanquan@huawei.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05ocfs2: use the new DLM operation callbacks while requesting new lockspaceGoldwyn Rodrigues
Attempt to use the new DLM operations. If it is not supported, use the traditional ocfs2_controld. To exchange ocfs2 versioning, we use the LVB of the version dlm lock. It first attempts to take the lock in EX mode (non-blocking). If successful (which means it is the first mount), it writes the version number and downconverts to PR lock. If it is unsuccessful, it reads the version from the lock. If this becomes the standard (with o2cb as well), it could simplify userspace tools to check if the filesystem is mounted on other nodes. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05ocfs2: framework for version LVBGoldwyn Rodrigues
Use the native DLM locks for version control negotiation. Most of the framework is taken from gfs2/lock_dlm.c Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05ocfs2: pass ocfs2_cluster_connection to ocfs2_this_nodeGoldwyn Rodrigues
This is done to differentiate between using and not using controld and use the connection information accordingly. We need to be backward compatible. So, we use a new enum ocfs2_connection_type to identify when controld is used and when it is not. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05ocfs2: shift allocation ocfs2_live_connection to user_connect()Goldwyn Rodrigues
We perform this because the DLM recovery callbacks will require the ocfs2_live_connection structure to record the node information when dlm_new_lockspace() is updated. [AKPM] rc initialization is not required because it assigned in case of errors. It will be cleared by compiler anyways. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05ocfs2: add DLM recovery callbacksGoldwyn Rodrigues
These are the callbacks called by the fs/dlm code in case the membership changes. If there is a failure while/during calling any of these, the DLM creates a new membership and relays to the rest of the nodes. recover_prep() is called when DLM understands a node is down. recover_slot() is called once all nodes have acknowledged recover_prep and recovery can begin. recover_done() is called once the recovery is complete. It returns the new membership. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05ocfs2: add clustername to cluster connectionGoldwyn Rodrigues
This is an effort of removing ocfs2_controld.pcmk and getting ocfs2 DLM handling up to the times with respect to DLM (>=4.0.1) and corosync (2.3.x). AFAIK, cman also is being phased out for a unified corosync cluster stack. fs/dlm performs all the functions with respect to fencing and node management and provides the API's to do so for ocfs2. For all future references, DLM stands for fs/dlm code. The advantages are: + No need to run an additional userspace daemon (ocfs2_controld) + No contrrold devince handling and controld protocol + Shifting responsibilities of node management to DLM layer For backward compatibility, we are keeping the controld handling code. Once enough time has passed we can remove a significant portion of the code. This feature requires modification in the userspace ocfs2-tools. The changes can be found at: https://github.com/goldwynr/ocfs2-tools branch: nocontrold Currently, not many checks are present in the userspace code, but that would change soon. These changes were developed on linux-stable 3.11.y. However, the changes are applicable to the current upstream as well. If you wish to give the entire kernel a spin, the link is: https://github.com/goldwynr/linux-stable branch: nocontrold This patch (of 6): Add clustername to cluster connection. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05ocfs2: fix possible double free in ocfs2_write_begin_nolockXue jiufei
When ocfs2_write_cluster_by_desc() failed in ocfs2_write_begin_nolock() because of ENOSPC, it goes to out_quota, freeing data_ac(meta_ac). Then it calls ocfs2_try_to_free_truncate_log() to free space. If enough space freed, it will try to write again. Unfortunately, some error happenes before ocfs2_lock_allocators(), it goes to out and free data_ac(meta_ac) again. Signed-off-by: joyce <xuejiufei@huawei.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2013-11-05ocfs2: add missing errno in ocfs2_ioctl_move_extents()Younger Liu
If the file is not regular or writeable, it should return errno(EPERM). This patch is based on 85a258b70d ("ocfs2: fix error handling in ocfs2_ioctl_move_extents()"). Signed-off-by: Younger Liu <younger.liu@huawei.com> Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>