summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2014-11-14xfs: track bulkstat progress by aginoDave Chinner
commit 002758992693ae63c04122603ea9261a0a58d728 upstream. The bulkstat main loop progress is tracked by the "lastino" variable, which is a full 64 bit inode. However, the loop actually works on agno/agino pairs, and so there's a significant disconnect between the rest of the loop and the main cursor. Convert this to use the agino, and pass the agino into the chunk formatting function and convert it too. This gets rid of the inconsistency in the loop processing, and finally makes it simple for us to skip inodes at any point in the loop simply by incrementing the agino cursor. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14xfs: bulkstat error handling is brokenDave Chinner
commit febe3cbe38b0bc0a925906dc90e8d59048851f87 upstream. The error propagation is a horror - xfs_bulkstat() returns a rval variable which is only set if there are formatter errors. Any sort of btree walk error or corruption will cause the bulkstat walk to terminate but will not pass an error back to userspace. Worse is the fact that formatter errors will also be ignored if any inodes were correctly formatted into the user buffer. Hence bulkstat can fail badly yet still report success to userspace. This causes significant issues with xfsdump not dumping everything in the filesystem yet reporting success. It's not until a restore fails that there is any indication that the dump was bad and tha bulkstat failed. This patch now triggers xfsdump to fail with bulkstat errors rather than silently missing files in the dump. This now causes bulkstat to fail when the lastino cookie does not fall inside an existing inode chunk. The pre-3.17 code tolerated that error by allowing the code to move to the next inode chunk as the agino target is guaranteed to fall into the next btree record. With the fixes up to this point in the series, xfsdump now passes on the troublesome filesystem image that exposes all these bugs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14xfs: bulkstat main loop logic is a messDave Chinner
commit 6e57c542cb7e0e580eb53ae76a77875c7d92b4b1 upstream. There are a bunch of variables tha tare more wildy scoped than they need to be, obfuscated user buffer checks and tortured "next inode" tracking. This all needs cleaning up to expose the real issues that need fixing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14xfs: bulkstat chunk-formatter has issuesDave Chinner
commit 2b831ac6bc87d3cbcbb1a8816827b6923403e461 upstream. The loop construct has issues: - clustidx is completely unused, so remove it. - the loop tries to be smart by terminating when the "freecount" tells it that all inodes are free. Just drop it as in most cases we have to scan all inodes in the chunk anyway. - move the "user buffer left" condition check to the only point where we consume space int eh user buffer. - move the initialisation of agino out of the loop, leaving just a simple loop control logic using the clusteridx. Also, double handling of the user buffer variables leads to problems tracking the current state - use the cursor variables directly rather than keeping local copies and then having to update the cursor before returning. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14xfs: bulkstat chunk formatting cursor is brokenDave Chinner
commit bf4a5af20d25ecc8876978ad34b8db83b4235f3c upstream. The xfs_bulkstat_agichunk formatting cursor takes buffer values from the main loop and passes them via the structure to the chunk formatter, and the writes the changed values back into the main loop local variables. Unfortunately, this complex dance is full of corner cases that aren't handled correctly. The biggest problem is that it is double handling the information in both the main loop and the chunk formatting function, leading to inconsistent updates and endless loops where progress is not made. To fix this, push the struct xfs_bulkstat_agichunk outwards to be the primary holder of user buffer information. this removes the double handling in the main loop. Also, pass the last inode processed by the chunk formatter as a separate parameter as it purely an output variable and is not related to the user buffer consumption cursor. Finally, the chunk formatting code is not shared by anyone, so make it local to xfs_itable.c. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14xfs: bulkstat btree walk doesn't terminateDave Chinner
commit afa947cb52a8e73fe71915a0b0af6fcf98dfbe1a upstream. The bulkstat code has several different ways of detecting the end of an AG when doing a walk. They are not consistently detected, and the code that checks for the end of AG conditions is not consistently coded. Hence the are conditions where the walk code can get stuck in an endless loop making no progress and not triggering any termination conditions. Convert all the "tmp/i" status return codes from btree operations to a common name (stat) and apply end-of-ag detection to these operations consistently. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14xfs: Check error during inode btree iteration in xfs_bulkstat()Jan Kara
commit 7a19dee116c8fae7ba7a778043c245194289f5a2 upstream. xfs_bulkstat() doesn't check error return from xfs_btree_increment(). In case of specific fs corruption that could result in xfs_bulkstat() entering an infinite loop because we would be looping over the same chunk over and over again. Fix the problem by checking the return value and terminating the loop properly. Coverity-id: 1231338 Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Jie Liu <jeff.u.liu@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14xfs: bulkstat doesn't release AGI buffer on errorDave Chinner
commit a6bbce54efa9145dbcf3029c885549f7ebc40a3b upstream. The recent refactoring of the bulkstat code left a small landmine in the code. If a inobt read fails, then the tree walk is aborted and returns without releasing the AGI buffer or freeing the cursor. This can lead to a subsequent bulkstat call hanging trying to grab the AGI buffer again. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanupChris Mason
commit 6e5aafb27419f32575b27ef9d6a31e5d54661aca upstream. If we hit any errors in btrfs_lookup_csums_range, we'll loop through all the csums we allocate and free them. But the code was using list_entry incorrectly, and ended up trying to free the on-stack list_head instead. This bug came from commit 0678b6185 btrfs: Don't BUG_ON kzalloc error in btrfs_lookup_csums_range() Signed-off-by: Chris Mason <clm@fb.com> Reported-by: Erik Berg <btrfs@slipsprogrammoer.no> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14fix breakage in o2net_send_tcp_msg()Al Viro
commit 7e8631e8b9d4e9f698c09c7e7309c96249180ff9 upstream. uninitialized msghdr. Broken in "ocfs2: don't open-code kernel_recvmsg()" by me ;-/ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14quota: Properly return errors from dquot_writeback_dquots()Jan Kara
commit 474d2605d119479e5aa050f738632e63589d4bb5 upstream. Due to a switched left and right side of an assignment, dquot_writeback_dquots() never returned error. This could result in errors during quota writeback to not be reported to userspace properly. Fix it. Coverity-id: 1226884 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext3: Don't check quota format when there are no quota filesJan Kara
commit 7938db449bbc55bbeb164bec7af406212e7e98f1 upstream. The check whether quota format is set even though there are no quota files with journalled quota is pointless and it actually makes it impossible to turn off journalled quotas (as there's no way to unset journalled quota format). Just remove the check. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14nfsd4: fix crash on unknown operation numberJ. Bruce Fields
commit 51904b08072a8bf2b9ed74d1bd7a5300a614471d upstream. Unknown operation numbers are caught in nfsd4_decode_compound() which sets op->opnum to OP_ILLEGAL and op->status to nfserr_op_illegal. The error causes the main loop in nfsd4_proc_compound() to skip most processing. But nfsd4_proc_compound also peeks ahead at the next operation in one case and doesn't take similar precautions there. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14nfsd4: fix response size estimation for OP_SEQUENCEJ. Bruce Fields
commit d1d84c9626bb3a519863b3ffc40d347166f9fb83 upstream. We added this new estimator function but forgot to hook it up. The effect is that NFSv4.1 (and greater) won't do zero-copy reads. The estimate was also wrong by 8 bytes. Fixes: ccae70a9ee41 "nfsd4: estimate sequence response size" Reported-by: Chuck Lever <chucklever@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14fix inode leaks on d_splice_alias() failure exitsAl Viro
commit 51486b900ee92856b977eacfc5bfbe6565028070 upstream. d_splice_alias() callers expect it to either stash the inode reference into a new alias, or drop the inode reference. That makes it possible to just return d_splice_alias() result from ->lookup() instance, without any extra housekeeping required. Unfortunately, that should include the failure exits. If d_splice_alias() returns an error, it leaves the dentry it has been given negative and thus it *must* drop the inode reference. Easily fixed, but it goes way back and will need backporting. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: prevent bugon on race between write/fcntlDmitry Monakhov
commit a41537e69b4aa43f0fea02498c2595a81267383b upstream. O_DIRECT flags can be toggeled via fcntl(F_SETFL). But this value checked twice inside ext4_file_write_iter() and __generic_file_write() which result in BUG_ON inside ext4_direct_IO. Let's initialize iocb->private unconditionally. TESTCASE: xfstest:generic/036 https://patchwork.ozlabs.org/patch/402445/ #TYPICAL STACK TRACE: kernel BUG at fs/ext4/inode.c:2960! invalid opcode: 0000 [#1] SMP Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod CPU: 6 PID: 5505 Comm: aio-dio-fcntl-r Not tainted 3.17.0-rc2-00176-gff5c017 #161 Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011 task: ffff88080e95a7c0 ti: ffff88080f908000 task.ti: ffff88080f908000 RIP: 0010:[<ffffffff811fabf2>] [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0 RSP: 0018:ffff88080f90bb58 EFLAGS: 00010246 RAX: 0000000000000400 RBX: ffff88080fdb2a28 RCX: 00000000a802c818 RDX: 0000040000080000 RSI: ffff88080d8aeb80 RDI: 0000000000000001 RBP: ffff88080f90bbc8 R08: 0000000000000000 R09: 0000000000001581 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88080d8aeb80 R13: ffff88080f90bbf8 R14: ffff88080fdb28c8 R15: ffff88080fdb2a28 FS: 00007f23b2055700(0000) GS:ffff880818400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f23b2045000 CR3: 000000080cedf000 CR4: 00000000000407e0 Stack: ffff88080f90bb98 0000000000000000 7ffffffffffffffe ffff88080fdb2c30 0000000000000200 0000000000000200 0000000000000001 0000000000000200 ffff88080f90bbc8 ffff88080fdb2c30 ffff88080f90be08 0000000000000200 Call Trace: [<ffffffff8112ca9d>] generic_file_direct_write+0xed/0x180 [<ffffffff8112f2b2>] __generic_file_write_iter+0x222/0x370 [<ffffffff811f495b>] ext4_file_write_iter+0x34b/0x400 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff810abd94>] ? __lock_acquire+0x274/0x700 [<ffffffff811f4610>] ? ext4_unwritten_wait+0xb0/0xb0 [<ffffffff811bd756>] aio_run_iocb+0x286/0x410 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190 [<ffffffff811bc05b>] ? lookup_ioctx+0x4b/0xf0 [<ffffffff811bde3b>] do_io_submit+0x55b/0x740 [<ffffffff811bdcaa>] ? do_io_submit+0x3ca/0x740 [<ffffffff811be030>] SyS_io_submit+0x10/0x20 [<ffffffff815ce192>] system_call_fastpath+0x16/0x1b Code: 01 48 8b 80 f0 01 00 00 48 8b 18 49 8b 45 10 0f 85 f1 01 00 00 48 03 45 c8 48 3b 43 48 0f 8f e3 01 00 00 49 83 7c 24 18 00 75 04 <0f> 0b eb fe f0 ff 83 ec 01 00 00 49 8b 44 24 18 8b 00 85 c0 89 RIP [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0 RSP <ffff88080f90bb58> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: enable journal checksum when metadata checksum feature enabledDarrick J. Wong
commit 98c1a7593fa355fda7f5a5940c8bf5326ca964ba upstream. If metadata checksumming is turned on for the FS, we need to tell the journal to use checksumming too. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: fix overflow when updating superblock backups after resizeJan Kara
commit 9378c6768e4fca48971e7b6a9075bc006eda981d upstream. When there are no meta block groups update_backups() will compute the backup block in 32-bit arithmetics thus possibly overflowing the block number and corrupting the filesystem. OTOH filesystems without meta block groups larger than 16 TB should be rare. Fix the problem by doing the counting in 64-bit arithmetics. Coverity-id: 741252 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: fix oops when loading block bitmap failedJan Kara
commit 599a9b77ab289d85c2d5c8607624efbe1f552b0f upstream. When we fail to load block bitmap in __ext4_new_inode() we will dereference NULL pointer in ext4_journal_get_write_access(). So check for error from ext4_read_block_bitmap(). Coverity-id: 989065 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: check s_chksum_driver when looking for bg csum presenceDarrick J. Wong
commit 813d32f91333e4c33d5a19b67167c4bae42dae75 upstream. Convert the ext4_has_group_desc_csum predicate to look for a checksum driver instead of the metadata_csum flag and change the bg checksum calculation function to look for GDT_CSUM before taking the crc16 path. Without this patch, if we mount with ^uninit_bg,^metadata_csum and later metadata_csum gets turned on by accident, the block group checksum functions will incorrectly assume that checksumming is enabled (metadata_csum) but that crc16 should be used (!s_chksum_driver). This is totally wrong, so fix the predicate and the checksum formula selection. (Granted, if the metadata_csum feature bit gets enabled on a live FS then something underhanded is going on, but we could at least avoid writing garbage into the on-disk fields.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: move error report out of atomic context in ext4_init_block_bitmap()Dmitry Monakhov
commit aef4885ae14f1df75b58395c5314d71f613d26d9 upstream. Error report likely result in IO so it is bad idea to do it from atomic context. This patch should fix following issue: BUG: sleeping function called from invalid context at include/linux/buffer_head.h:349 in_atomic(): 1, irqs_disabled(): 0, pid: 137, name: kworker/u128:1 5 locks held by kworker/u128:1/137: #0: ("writeback"){......}, at: [<ffffffff81085618>] process_one_work+0x228/0x4d0 #1: ((&(&wb->dwork)->work)){......}, at: [<ffffffff81085618>] process_one_work+0x228/0x4d0 #2: (jbd2_handle){......}, at: [<ffffffff81242622>] start_this_handle+0x712/0x7b0 #3: (&ei->i_data_sem){......}, at: [<ffffffff811fa387>] ext4_map_blocks+0x297/0x430 #4: (&(&bgl->locks[i].lock)->rlock){......}, at: [<ffffffff811f3180>] ext4_read_block_bitmap_nowait+0x5d0/0x630 CPU: 3 PID: 137 Comm: kworker/u128:1 Not tainted 3.17.0-rc2-00184-g82752e4 #165 Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011 Workqueue: writeback bdi_writeback_workfn (flush-1:0) 0000000000000411 ffff880813777288 ffffffff815c7fdc ffff880813777288 ffff880813a8bba0 ffff8808137772a8 ffffffff8108fb30 ffff880803e01e38 ffff880803e01e38 ffff8808137772c8 ffffffff811a8d53 ffff88080ecc6000 Call Trace: [<ffffffff815c7fdc>] dump_stack+0x51/0x6d [<ffffffff8108fb30>] __might_sleep+0xf0/0x100 [<ffffffff811a8d53>] __sync_dirty_buffer+0x43/0xe0 [<ffffffff811a8e03>] sync_dirty_buffer+0x13/0x20 [<ffffffff8120f581>] ext4_commit_super+0x1d1/0x230 [<ffffffff8120fa03>] save_error_info+0x23/0x30 [<ffffffff8120fd06>] __ext4_error+0xb6/0xd0 [<ffffffff8120f260>] ? ext4_group_desc_csum+0x140/0x190 [<ffffffff811f2d8c>] ext4_read_block_bitmap_nowait+0x1dc/0x630 [<ffffffff8122e23a>] ext4_mb_init_cache+0x21a/0x8f0 [<ffffffff8113ae95>] ? lru_cache_add+0x55/0x60 [<ffffffff8112e16c>] ? add_to_page_cache_lru+0x6c/0x80 [<ffffffff8122eaa0>] ext4_mb_init_group+0x190/0x280 [<ffffffff8122ec51>] ext4_mb_good_group+0xc1/0x190 [<ffffffff8123309a>] ext4_mb_regular_allocator+0x17a/0x410 [<ffffffff8122c821>] ? ext4_mb_use_preallocated+0x31/0x380 [<ffffffff81233535>] ? ext4_mb_new_blocks+0x205/0x8e0 [<ffffffff8116ed5c>] ? kmem_cache_alloc+0xfc/0x180 [<ffffffff812335b0>] ext4_mb_new_blocks+0x280/0x8e0 [<ffffffff8116f2c4>] ? __kmalloc+0x144/0x1c0 [<ffffffff81221797>] ? ext4_find_extent+0x97/0x320 [<ffffffff812257f4>] ext4_ext_map_blocks+0xbc4/0x1050 [<ffffffff811fa387>] ? ext4_map_blocks+0x297/0x430 [<ffffffff811fa3ab>] ext4_map_blocks+0x2bb/0x430 [<ffffffff81200e43>] ? ext4_init_io_end+0x23/0x50 [<ffffffff811feb44>] ext4_writepages+0x564/0xaf0 [<ffffffff815cde3b>] ? _raw_spin_unlock+0x2b/0x40 [<ffffffff810ac7bd>] ? lock_release_non_nested+0x2fd/0x3c0 [<ffffffff811a009e>] ? writeback_sb_inodes+0x10e/0x490 [<ffffffff811a009e>] ? writeback_sb_inodes+0x10e/0x490 [<ffffffff811377e3>] do_writepages+0x23/0x40 [<ffffffff8119c8ce>] __writeback_single_inode+0x9e/0x280 [<ffffffff811a026b>] writeback_sb_inodes+0x2db/0x490 [<ffffffff811a0664>] wb_writeback+0x174/0x2d0 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190 [<ffffffff811a0863>] wb_do_writeback+0xa3/0x200 [<ffffffff811a0a40>] bdi_writeback_workfn+0x80/0x230 [<ffffffff81085618>] ? process_one_work+0x228/0x4d0 [<ffffffff810856cd>] process_one_work+0x2dd/0x4d0 [<ffffffff81085618>] ? process_one_work+0x228/0x4d0 [<ffffffff81085c1d>] worker_thread+0x35d/0x460 [<ffffffff810858c0>] ? process_one_work+0x4d0/0x4d0 [<ffffffff810858c0>] ? process_one_work+0x4d0/0x4d0 [<ffffffff8108a885>] kthread+0xf5/0x100 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff8108a790>] ? __init_kthread_worker+0x70/0x70 [<ffffffff815ce2ac>] ret_from_fork+0x7c/0xb0 [<ffffffff8108a790>] ? __init_kthread_work Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: Replace open coded mdata csum feature to helper functionDmitry Monakhov
commit 9aa5d32ba269bec0e7eaba2697a986a7b0bc8528 upstream. Besides the fact that this replacement improves code readability it also protects from errors caused direct EXT4_S(sb)->s_es manipulation which may result attempt to use uninitialized csum machinery. #Testcase_BEGIN IMG=/dev/ram0 MNT=/mnt mkfs.ext4 $IMG mount $IMG $MNT #Enable feature directly on disk, on mounted fs tune2fs -O metadata_csum $IMG # Provoke metadata update, likey result in OOPS touch $MNT/test umount $MNT #Testcase_END # Replacement script @@ expression E; @@ - EXT4_HAS_RO_COMPAT_FEATURE(E, EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) + ext4_has_metadata_csum(E) https://bugzilla.kernel.org/show_bug.cgi?id=82201 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: fix reservation overflow in ext4_da_write_beginEric Sandeen
commit 0ff8947fc5f700172b37cbca811a38eb9cb81e08 upstream. Delalloc write journal reservations only reserve 1 credit, to update the inode if necessary. However, it may happen once in a filesystem's lifetime that a file will cross the 2G threshold, and require the LARGE_FILE feature to be set in the superblock as well, if it was not set already. This overruns the transaction reservation, and can be demonstrated simply on any ext4 filesystem without the LARGE_FILE feature already set: dd if=/dev/zero of=testfile bs=1 seek=2147483646 count=1 \ conv=notrunc of=testfile sync dd if=/dev/zero of=testfile bs=1 seek=2147483647 count=1 \ conv=notrunc of=testfile leads to: EXT4-fs: ext4_do_update_inode:4296: aborting transaction: error 28 in __ext4_handle_dirty_super EXT4-fs error (device loop0) in ext4_do_update_inode:4301: error 28 EXT4-fs error (device loop0) in ext4_reserve_inode_write:4757: Readonly filesystem EXT4-fs error (device loop0) in ext4_dirty_inode:4876: error 28 EXT4-fs error (device loop0) in ext4_da_write_end:2685: error 28 Adjust the number of credits based on whether the flag is already set, and whether the current write may extend past the LARGE_FILE limit. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: don't orphan or truncate the boot loader inodeTheodore Ts'o
commit e2bfb088fac03c0f621886a04cffc7faa2b49b1d upstream. The boot loader inode (inode #5) should never be visible in the directory hierarchy, but it's possible if the file system is corrupted that there will be a directory entry that points at inode #5. In order to avoid accidentally trashing it, when such a directory inode is opened, the inode will be marked as a bad inode, so that it's not possible to modify (or read) the inode from userspace. Unfortunately, when we unlink this (invalid/illegal) directory entry, we will put the bad inode on the ophan list, and then when try to unlink the directory, we don't actually remove the bad inode from the orphan list before freeing in-memory inode structure. This means the in-memory orphan list is corrupted, leading to a kernel oops. In addition, avoid truncating a bad inode in ext4_destroy_inode(), since truncating the boot loader inode is not a smart thing to do. Reported-by: Sami Liedes <sami.liedes@iki.fi> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: add ext4_iget_normal() which is to be used for dir tree lookupsTheodore Ts'o
commit f4bb2981024fc91b23b4d09a8817c415396dbabb upstream. If there is a corrupted file system which has directory entries that point at reserved, metadata inodes, prohibit them from being used by treating them the same way we treat Boot Loader inodes --- that is, mark them to be bad inodes. This prohibits them from being opened, deleted, or modified via chmod, chown, utimes, etc. In particular, this prevents a corrupted file system which has a directory entry which points at the journal inode from being deleted and its blocks released, after which point Much Hilarity Ensues. Reported-by: Sami Liedes <sami.liedes@iki.fi> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: grab missed write_count for EXT4_IOC_SWAP_BOOTDmitry Monakhov
commit 3e67cfad22230ebed85c56cbe413876f33fea82b upstream. Otherwise this provokes complain like follows: WARNING: CPU: 12 PID: 5795 at fs/ext4/ext4_jbd2.c:48 ext4_journal_check_start+0x4e/0xa0() Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod CPU: 12 PID: 5795 Comm: python Not tainted 3.17.0-rc2-00175-gae5344f #158 Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011 0000000000000030 ffff8808116cfd28 ffffffff815c7dfc 0000000000000030 0000000000000000 ffff8808116cfd68 ffffffff8106ce8c ffff8808116cfdc8 ffff880813b16000 ffff880806ad6ae8 ffffffff81202008 0000000000000000 Call Trace: [<ffffffff815c7dfc>] dump_stack+0x51/0x6d [<ffffffff8106ce8c>] warn_slowpath_common+0x8c/0xc0 [<ffffffff81202008>] ? ext4_ioctl+0x9e8/0xeb0 [<ffffffff8106ceda>] warn_slowpath_null+0x1a/0x20 [<ffffffff8122867e>] ext4_journal_check_start+0x4e/0xa0 [<ffffffff81228c10>] __ext4_journal_start_sb+0x90/0x110 [<ffffffff81202008>] ext4_ioctl+0x9e8/0xeb0 [<ffffffff8107b0bd>] ? ptrace_stop+0x24d/0x2f0 [<ffffffff81088530>] ? alloc_pid+0x480/0x480 [<ffffffff8107b1f2>] ? ptrace_do_notify+0x92/0xb0 [<ffffffff81186545>] do_vfs_ioctl+0x4e5/0x550 [<ffffffff815cdbcb>] ? _raw_spin_unlock_irq+0x2b/0x40 [<ffffffff81186603>] SyS_ioctl+0x53/0x80 [<ffffffff815ce2ce>] tracesys+0xd0/0xd5 Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: fix mmap data corruption when blocksize < pagesizeJan Kara
commit d6320cbfc92910a3e5f10c42d98c231c98db4f60 upstream. Use truncate_isize_extended() when hole is being created in a file so that ->page_mkwrite() will get called for the partial tail page if it is mmaped (see the first patch in the series for details). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: don't check quota format when there are no quota filesJan Kara
commit 279bf6d390933d5353ab298fcc306c391a961469 upstream. The check whether quota format is set even though there are no quota files with journalled quota is pointless and it actually makes it impossible to turn off journalled quotas (as there's no way to unset journalled quota format). Just remove the check. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: check EA value offset when loadingDarrick J. Wong
commit a0626e75954078cfacddb00a4545dde821170bc5 upstream. When loading extended attributes, check each entry's value offset to make sure it doesn't collide with the entries. Without this check it is easy to crash the kernel by mounting a malicious FS containing a file with an EA wherein e_value_offs = 0 and e_value_size > 0 and then deleting the EA, which corrupts the name list. (See the f_ea_value_crash test's FS image in e2fsprogs for an example.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14jbd2: free bh when descriptor block checksum failsDarrick J. Wong
commit 064d83892e9ba547f7d4eae22cbca066d95210ce upstream. Free the buffer head if the journal descriptor block fails checksum verification. This is the jbd2 port of the e2fsprogs patch "e2fsck: free bh on csum verify error in do_one_pass". Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14pstore: Fix duplicate {console,ftrace}-efi entriesValdis Kletnieks
commit d4bf205da618bbd0b038e404d646f14e76915718 upstream. The pstore filesystem still creates duplicate filename/inode pairs for some pstore types. Add the id to the filename to prevent that. Before patch: [/sys/fs/pstore] ls -li total 0 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi After: [/sys/fs/pstore] ls -li total 0 1232 -r--r--r--. 1 root root 148 Sep 29 17:09 console-efi-141202499100000 1231 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi-141202499200000 1230 -r--r--r--. 1 root root 148 Sep 29 17:44 console-efi-141202705400000 1229 -r--r--r--. 1 root root 67 Sep 29 17:44 console-efi-141202705500000 1228 -r--r--r--. 1 root root 67 Sep 29 20:42 console-efi-141203772600000 1227 -r--r--r--. 1 root root 148 Sep 29 23:42 console-efi-141204854900000 1226 -r--r--r--. 1 root root 67 Sep 29 23:42 console-efi-141204855000000 1225 -r--r--r--. 1 root root 148 Sep 29 23:59 console-efi-141204954200000 1224 -r--r--r--. 1 root root 67 Sep 29 23:59 console-efi-141204954400000 Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14mnt: Prevent pivot_root from creating a loop in the mount treeEric W. Biederman
commit 0d0826019e529f21c84687521d03f60cd241ca7d upstream. Andy Lutomirski recently demonstrated that when chroot is used to set the root path below the path for the new ``root'' passed to pivot_root the pivot_root system call succeeds and leaks mounts. In examining the code I see that starting with a new root that is below the current root in the mount tree will result in a loop in the mount tree after the mounts are detached and then reattached to one another. Resulting in all kinds of ugliness including a leak of that mounts involved in the leak of the mount loop. Prevent this problem by ensuring that the new mount is reachable from the current root of the mount tree. [Added stable cc. Fixes CVE-2014-7970. --Andy] Reported-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/87bnpmihks.fsf@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14kill wbuf_queued/wbuf_dwork_lockAl Viro
commit 99358a1ca53e8e6ce09423500191396f0e6584d2 upstream. schedule_delayed_work() happening when the work is already pending is a cheap no-op. Don't bother with ->wbuf_queued logics - it's both broken (cancelling ->wbuf_dwork leaves it set, as spotted by Jeff Harris) and pointless. It's cheaper to let schedule_delayed_work() handle that case. Reported-by: Jeff Harris <jefftharris@gmail.com> Tested-by: Jeff Harris <jefftharris@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14missing data dependency barrier in prepend_name()Al Viro
commit 6d13f69444bd3d4888e43f7756449748f5a98bad upstream. AFAICS, prepend_name() is broken on SMP alpha. Disclaimer: I don't have SMP alpha boxen to reproduce it on. However, it really looks like the race is real. CPU1: d_path() on /mnt/ramfs/<255-character>/foo CPU2: mv /mnt/ramfs/<255-character> /mnt/ramfs/<63-character> CPU2 does d_alloc(), which allocates an external name, stores the name there including terminating NUL, does smp_wmb() and stores its address in dentry->d_name.name. It proceeds to d_add(dentry, NULL) and d_move() old dentry over to that. ->d_name.name value ends up in that dentry. In the meanwhile, CPU1 gets to prepend_name() for that dentry. It fetches ->d_name.name and ->d_name.len; the former ends up pointing to new name (64-byte kmalloc'ed array), the latter - 255 (length of the old name). Nothing to force the ordering there, and normally that would be OK, since we'd run into the terminating NUL and stop. Except that it's alpha, and we'd need a data dependency barrier to guarantee that we see that store of NUL __d_alloc() has done. In a similar situation dentry_cmp() would survive; it does explicit smp_read_barrier_depends() after fetching ->d_name.name. prepend_name() doesn't and it risks walking past the end of kmalloc'ed object and possibly oops due to taking a page fault in kernel mode. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14vfs: fix data corruption when blocksize < pagesize for mmaped dataJan Kara
commit 90a8020278c1598fafd071736a0846b38510309c upstream. ->page_mkwrite() is used by filesystems to allocate blocks under a page which is becoming writeably mmapped in some process' address space. This allows a filesystem to return a page fault if there is not enough space available, user exceeds quota or similar problem happens, rather than silently discarding data later when writepage is called. However VFS fails to call ->page_mkwrite() in all the cases where filesystems need it when blocksize < pagesize. For example when blocksize = 1024, pagesize = 4096 the following is problematic: ftruncate(fd, 0); pwrite(fd, buf, 1024, 0); map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0); map[0] = 'a'; ----> page_mkwrite() for index 0 is called ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */ mremap(map, 1024, 10000, 0); map[4095] = 'a'; ----> no page_mkwrite() called At the moment ->page_mkwrite() is called, filesystem can allocate only one block for the page because i_size == 1024. Otherwise it would create blocks beyond i_size which is generally undesirable. But later at ->writepage() time, we also need to store data at offset 4095 but we don't have block allocated for it. This patch introduces a helper function filesystems can use to have ->page_mkwrite() called at all the necessary moments. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14UBIFS: fix free log space calculationArtem Bityutskiy
commit ba29e721eb2df6df8f33c1f248388bb037a47914 upstream. Hu (hujianyang <hujianyang@huawei.com>) discovered an issue in the 'empty_log_bytes()' function, which calculates how many bytes are left in the log: " If 'c->lhead_lnum + 1 == c->ltail_lnum' and 'c->lhead_offs == c->leb_size', 'h' would equalent to 't' and 'empty_log_bytes()' would return 'c->log_bytes' instead of 0. " At this point it is not clear what would be the consequences of this, and whether this may lead to any problems, but this patch addresses the issue just in case. Tested-by: hujianyang <hujianyang@huawei.com> Reported-by: hujianyang <hujianyang@huawei.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14UBIFS: fix a race conditionArtem Bityutskiy
commit 052c28073ff26f771d44ef33952a41d18dadd255 upstream. Hu (hujianyang@huawei.com) discovered a race condition which may lead to a situation when UBIFS is unable to mount the file-system after an unclean reboot. The problem is theoretical, though. In UBIFS, we have the log, which basically a set of LEBs in a certain area. The log has the tail and the head. Every time user writes data to the file-system, the UBIFS journal grows, and the log grows as well, because we append new reference nodes to the head of the log. So the head moves forward all the time, while the log tail stays at the same position. At any time, the UBIFS master node points to the tail of the log. When we mount the file-system, we scan the log, and we always start from its tail, because this is where the master node points to. The only occasion when the tail of the log changes is the commit operation. The commit operation has 2 phases - "commit start" and "commit end". The former is relatively short, and does not involve much I/O. During this phase we mostly just build various in-memory lists of the things which have to be written to the flash media during "commit end" phase. During the commit start phase, what we do is we "clean" the log. Indeed, the commit operation will index all the data in the journal, so the entire journal "disappears", and therefore the data in the log become unneeded. So we just move the head of the log to the next LEB, and write the CS node there. This LEB will be the tail of the new log when the commit operation finishes. When the "commit start" phase finishes, users may write more data to the file-system, in parallel with the ongoing "commit end" operation. At this point the log tail was not changed yet, it is the same as it had been before we started the commit. The log head keeps moving forward, though. The commit operation now needs to write the new master node, and the new master node should point to the new log tail. After this the LEBs between the old log tail and the new log tail can be unmapped and re-used again. And here is the possible problem. We do 2 operations: (a) We first update the log tail position in memory (see 'ubifs_log_end_commit()'). (b) And then we write the master node (see the big lock of code in 'do_commit()'). But nothing prevents the log head from moving forward between (a) and (b), and the log head may "wrap" now to the old log tail. And when the "wrap" happens, the contends of the log tail gets erased. Now a power cut happens and we are in trouble. We end up with the old master node pointing to the old tail, which was erased. And replay fails because it expects the master node to point to the correct log tail at all times. This patch merges the abovementioned (a) and (b) operations by moving the master node change code to the 'ubifs_log_end_commit()' function, so that it runs with the log mutex locked, which will prevent the log from being changed benween operations (a) and (b). Reported-by: hujianyang <hujianyang@huawei.com> Tested-by: hujianyang <hujianyang@huawei.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14fs: allow open(dir, O_TMPFILE|..., 0) with mode 0Eric Rannaud
commit 69a91c237ab0ebe4e9fdeaf6d0090c85275594ec upstream. The man page for open(2) indicates that when O_CREAT is specified, the 'mode' argument applies only to future accesses to the file: Note that this mode applies only to future accesses of the newly created file; the open() call that creates a read-only file may well return a read/write file descriptor. The man page for open(2) implies that 'mode' is treated identically by O_CREAT and O_TMPFILE. O_TMPFILE, however, behaves differently: int fd = open("/tmp", O_TMPFILE | O_RDWR, 0); assert(fd == -1); assert(errno == EACCES); int fd = open("/tmp", O_TMPFILE | O_RDWR, 0600); assert(fd > 0); For O_CREAT, do_last() sets acc_mode to MAY_OPEN only: if (*opened & FILE_CREATED) { /* Don't check for write permission, don't truncate */ open_flag &= ~O_TRUNC; will_truncate = false; acc_mode = MAY_OPEN; path_to_nameidata(path, nd); goto finish_open_created; } But for O_TMPFILE, do_tmpfile() passes the full op->acc_mode to may_open(). This patch lines up the behavior of O_TMPFILE with O_CREAT. After the inode is created, may_open() is called with acc_mode = MAY_OPEN, in do_tmpfile(). A different, but related glibc bug revealed the discrepancy: https://sourceware.org/bugzilla/show_bug.cgi?id=17523 The glibc lazily loads the 'mode' argument of open() and openat() using va_arg() only if O_CREAT is present in 'flags' (to support both the 2 argument and the 3 argument forms of open; same idea for openat()). However, the glibc ignores the 'mode' argument if O_TMPFILE is in 'flags'. On x86_64, for open(), it magically works anyway, as 'mode' is in RDX when entering open(), and is still in RDX on SYSCALL, which is where the kernel looks for the 3rd argument of a syscall. But openat() is not quite so lucky: 'mode' is in RCX when entering the glibc wrapper for openat(), while the kernel looks for the 4th argument of a syscall in R10. Indeed, the syscall calling convention differs from the regular calling convention in this respect on x86_64. So the kernel sees mode = 0 when trying to use glibc openat() with O_TMPFILE, and fails with EACCES. Signed-off-by: Eric Rannaud <e@nanocritical.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14fs: Fix theoretical division by 0 in super_cache_scan().Tetsuo Handa
commit 475d0db742e3755c6b267f48577ff7cbb7dfda0d upstream. total_objects could be 0 and is used as a denom. While total_objects is a "long", total_objects == 0 unlikely happens for 3.12 and later kernels because 32-bit architectures would not be able to hold (1 << 32) objects. However, total_objects == 0 may happen for kernels between 3.1 and 3.11 because total_objects in prune_super() was an "int" and (e.g.) x86_64 architecture might be able to hold (1 << 32) objects. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14fs: make cont_expand_zero interruptibleMikulas Patocka
commit c2ca0fcd202863b14bd041a7fece2e789926c225 upstream. This patch makes it possible to kill a process looping in cont_expand_zero. A process may spend a lot of time in this function, so it is desirable to be able to kill it. It happened to me that I wanted to copy a piece data from the disk to a file. By mistake, I used the "seek" parameter to dd instead of "skip". Due to the "seek" parameter, dd attempted to extend the file and became stuck doing so - the only possibility was to reset the machine or wait many hours until the filesystem runs out of space and cont_expand_zero fails. We need this patch to be able to terminate the process. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14lockd: Try to reconnect if statd has movedBenjamin Coddington
commit 173b3afceebe76fa2205b2c8808682d5b541fe3c upstream. If rpc.statd is restarted, upcalls to monitor hosts can fail with ECONNREFUSED. In that case force a lookup of statd's new port and retry the upcall. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30ima: pass 'opened' flag to identify newly created filesDmitry Kasatkin
commit 3034a146820c26fe6da66a45f6340fe87fe0983a upstream. Empty files and missing xattrs do not guarantee that a file was just created. This patch passes FILE_CREATED flag to IMA to reliably identify new files. Signed-off-by: Dmitry Kasatkin <d.kasatkin@samsung.com> Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30fanotify: enable close-on-exec on events' fd when requested in fanotify_init()Yann Droneaud
commit 0b37e097a648aa71d4db1ad108001e95b69a2da4 upstream. According to commit 80af258867648 ("fanotify: groups can specify their f_flags for new fd"), file descriptors created as part of file access notification events inherit flags from the event_f_flags argument passed to syscall fanotify_init(2)[1]. Unfortunately O_CLOEXEC is currently silently ignored. Indeed, event_f_flags are only given to dentry_open(), which only seems to care about O_ACCMODE and O_PATH in do_dentry_open(), O_DIRECT in open_check_o_direct() and O_LARGEFILE in generic_file_open(). It's a pity, since, according to some lookup on various search engines and http://codesearch.debian.net/, there's already some userspace code which use O_CLOEXEC: - in systemd's readahead[2]: fanotify_fd = fanotify_init(FAN_CLOEXEC|FAN_NONBLOCK, O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_NOATIME); - in clsync[3]: #define FANOTIFY_EVFLAGS (O_LARGEFILE|O_RDONLY|O_CLOEXEC) int fanotify_d = fanotify_init(FANOTIFY_FLAGS, FANOTIFY_EVFLAGS); - in examples [4] from "Filesystem monitoring in the Linux kernel" article[5] by Aleksander Morgado: if ((fanotify_fd = fanotify_init (FAN_CLOEXEC, O_RDONLY | O_CLOEXEC | O_LARGEFILE)) < 0) Additionally, since commit 48149e9d3a7e ("fanotify: check file flags passed in fanotify_init"). having O_CLOEXEC as part of fanotify_init() second argument is expressly allowed. So it seems expected to set close-on-exec flag on the file descriptors if userspace is allowed to request it with O_CLOEXEC. But Andrew Morton raised[6] the concern that enabling now close-on-exec might break existing applications which ask for O_CLOEXEC but expect the file descriptor to be inherited across exec(). In the other hand, as reported by Mihai Dontu[7] close-on-exec on the file descriptor returned as part of file access notify can break applications due to deadlock. So close-on-exec is needed for most applications. More, applications asking for close-on-exec are likely expecting it to be enabled, relying on O_CLOEXEC being effective. If not, it might weaken their security, as noted by Jan Kara[8]. So this patch replaces call to macro get_unused_fd() by a call to function get_unused_fd_flags() with event_f_flags value as argument. This way O_CLOEXEC flag in the second argument of fanotify_init(2) syscall is interpreted and close-on-exec get enabled when requested. [1] http://man7.org/linux/man-pages/man2/fanotify_init.2.html [2] http://cgit.freedesktop.org/systemd/systemd/tree/src/readahead/readahead-collect.c?id=v208#n294 [3] https://github.com/xaionaro/clsync/blob/v0.2.1/sync.c#L1631 https://github.com/xaionaro/clsync/blob/v0.2.1/configuration.h#L38 [4] http://www.lanedo.com/~aleksander/fanotify/fanotify-example.c [5] http://www.lanedo.com/2013/filesystem-monitoring-linux-kernel/ [6] http://lkml.kernel.org/r/20141001153621.65e9258e65a6167bf2e4cb50@linux-foundation.org [7] http://lkml.kernel.org/r/20141002095046.3715eb69@mdontu-l [8] http://lkml.kernel.org/r/20141002104410.GB19748@quack.suse.cz Link: http://lkml.kernel.org/r/cover.1411562410.git.ydroneaud@opteya.com Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed by: Heinrich Schuchardt <xypron.glpk@gmx.de> Tested-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Mihai Don\u021bu <mihai.dontu@gmail.com> Cc: Pádraig Brady <P@draigBrady.com> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Jan Kara <jack@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Michael Kerrisk-manpages <mtk.manpages@gmail.com> Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de> Cc: Richard Guy Briggs <rgb@redhat.com> Cc: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30xfs: fix agno increment in xfs_inumbers() loopEric Sandeen
commit a8b1ee8bafc765ebf029d03c5479a69aebff9693 upstream. caused a regression in xfs_inumbers, which in turn broke xfsdump, causing incomplete dumps. The loop in xfs_inumbers() needs to fill the user-supplied buffers, and iterates via xfs_btree_increment, reading new ags as needed. But the first time through the loop, if xfs_btree_increment() succeeds, we continue, which triggers the ++agno at the bottom of the loop, and we skip to soon to the next ag - without the proper setup under next_ag to read the next ag. Fix this by removing the agno increment from the loop conditional, and only increment agno if we have actually hit the code under the next_ag: target. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30xfs: ensure WB_SYNC_ALL writeback handles partial pages correctlyDave Chinner
commit 0d085a529b427d97710e6a41f8a4f23e1757cd12 upstream. XFS has been having trouble with stray delayed allocation extents beyond EOF for a long time. Recent changes to the collapse range code has triggered erroneous EBUSY errors on page invalidtion for block size smaller than page size filesystems. These have been caused by dirty buffers beyond EOF on a partial page which do not get written to disk during a sync. The issue is that write-ahead in xfs_cluster_write() finds such a partial page and handles it by leaving the page dirty but pushing it into a writeback state. This used to work just fine, as the write_cache_pages() code would then find the dirty partial page in the next mapping tree lookup as the dirty tag is still set. Unfortunately, when we moved to a mark and sweep approach to writeback to fix other writeback sync issues, we broken this. THe act of marking the page as under writeback now clears the TOWRITE tag in the radix tree, even though the page is still dirty. This causes the TOWRITE tag to be cleared, and hence the next lookup on the mapping tree does not find the dirty partial page and so doesn't try to write it again. This same writeback bug was found recently in ext4 and fixed in commit 1c8349a ("ext4: fix data integrity sync in ordered mode") without communication to the wider filesystem community. We can use exactly the same fix here so the TOWRITE flag is not cleared on partial page writes. cc: stable@vger.kernel.org # dependent on 1c8349a17137b93f0a83f276c764a6df1b9a116e Root-cause-found-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30udf: Fix loading of special inodesJan Kara
commit 6174c2eb8ecef271159bdcde460ce8af54d8f72f upstream. Some UDF media have special inodes (like VAT or metadata partition inodes) whose link_count is 0. Thus commit 4071b9136223 (udf: Properly detect stale inodes) broke loading these inodes because udf_iget() started returning -ESTALE for them. Since we still need to properly detect stale inodes queried by NFS, create two variants of udf_iget() - one which is used for looking up special inodes (which ignores link_count == 0) and one which is used for other cases which return ESTALE when link_count == 0. Fixes: 4071b913622316970d0e1919f7d82b4403fec5f2 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30ecryptfs: avoid to access NULL pointer when write metadata in xattrChao Yu
commit 35425ea2492175fd39f6116481fe98b2b3ddd4ca upstream. Christopher Head 2014-06-28 05:26:20 UTC described: "I tried to reproduce this on 3.12.21. Instead, when I do "echo hello > foo" in an ecryptfs mount with ecryptfs_xattr specified, I get a kernel crash: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8110eb39>] fsstack_copy_attr_all+0x2/0x61 PGD d7840067 PUD b2c3c067 PMD 0 Oops: 0002 [#1] SMP Modules linked in: nvidia(PO) CPU: 3 PID: 3566 Comm: bash Tainted: P O 3.12.21-gentoo-r1 #2 Hardware name: ASUSTek Computer Inc. G60JX/G60JX, BIOS 206 03/15/2010 task: ffff8801948944c0 ti: ffff8800bad70000 task.ti: ffff8800bad70000 RIP: 0010:[<ffffffff8110eb39>] [<ffffffff8110eb39>] fsstack_copy_attr_all+0x2/0x61 RSP: 0018:ffff8800bad71c10 EFLAGS: 00010246 RAX: 00000000000181a4 RBX: ffff880198648480 RCX: 0000000000000000 RDX: 0000000000000004 RSI: ffff880172010450 RDI: 0000000000000000 RBP: ffff880198490e40 R08: 0000000000000000 R09: 0000000000000000 R10: ffff880172010450 R11: ffffea0002c51e80 R12: 0000000000002000 R13: 000000000000001a R14: 0000000000000000 R15: ffff880198490e40 FS: 00007ff224caa700(0000) GS:ffff88019fcc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000000bb07f000 CR4: 00000000000007e0 Stack: ffffffff811826e8 ffff8800a39d8000 0000000000000000 000000000000001a ffff8800a01d0000 ffff8800a39d8000 ffffffff81185fd5 ffffffff81082c2c 00000001a39d8000 53d0abbc98490e40 0000000000000037 ffff8800a39d8220 Call Trace: [<ffffffff811826e8>] ? ecryptfs_setxattr+0x40/0x52 [<ffffffff81185fd5>] ? ecryptfs_write_metadata+0x1b3/0x223 [<ffffffff81082c2c>] ? should_resched+0x5/0x23 [<ffffffff8118322b>] ? ecryptfs_initialize_file+0xaf/0xd4 [<ffffffff81183344>] ? ecryptfs_create+0xf4/0x142 [<ffffffff810f8c0d>] ? vfs_create+0x48/0x71 [<ffffffff810f9c86>] ? do_last.isra.68+0x559/0x952 [<ffffffff810f7ce7>] ? link_path_walk+0xbd/0x458 [<ffffffff810fa2a3>] ? path_openat+0x224/0x472 [<ffffffff810fa7bd>] ? do_filp_open+0x2b/0x6f [<ffffffff81103606>] ? __alloc_fd+0xd6/0xe7 [<ffffffff810ee6ab>] ? do_sys_open+0x65/0xe9 [<ffffffff8157d022>] ? system_call_fastpath+0x16/0x1b RIP [<ffffffff8110eb39>] fsstack_copy_attr_all+0x2/0x61 RSP <ffff8800bad71c10> CR2: 0000000000000000 ---[ end trace df9dba5f1ddb8565 ]---" If we create a file when we mount with ecryptfs_xattr_metadata option, we will encounter a crash in this path: ->ecryptfs_create ->ecryptfs_initialize_file ->ecryptfs_write_metadata ->ecryptfs_write_metadata_to_xattr ->ecryptfs_setxattr ->fsstack_copy_attr_all It's because our dentry->d_inode used in fsstack_copy_attr_all is NULL, and it will be initialized when ecryptfs_initialize_file finish. So we should skip copying attr from lower inode when the value of ->d_inode is invalid. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30NFSv4.1/pnfs: replace broken pnfs_put_lseg_asyncTrond Myklebust
commit 6543f803670530f6aa93790d9fa116d8395a537d upstream. You cannot call pnfs_put_lseg_async() more than once per lseg, so it is really an inappropriate way to deal with a refcount issue. Instead, replace it with a function that decrements the refcount, and puts the final 'free' operation (which is incompatible with locks) on the workqueue. Cc: Weston Andros Adamson <dros@primarydata.com> Fixes: e6cf82d1830f: pnfs: add pnfs_put_lseg_async Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30NFS: Fix a bogus warning in nfs_generic_pgioTrond Myklebust
commit b8fb9c30f25e45dab5d2cd310ab6913b6861d00f upstream. It is OK for pageused == pagecount in the loop, as long as we don't add another entry to the *pages array. Move the test so that it only triggers in that case. Reported-by: Steve Dickson <SteveD@redhat.com> Fixes: bba5c1887a92 (nfs: disallow duplicate pages in pgio page vectors) Cc: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30NFS: Fix an uninitialised pointer Oops in the writeback error pathTrond Myklebust
commit 3caa0c6ed754d91b15266abf222498edbef982bd upstream. SteveD reports the following Oops: RIP: 0010:[<ffffffffa053461d>] [<ffffffffa053461d>] __put_nfs_open_context+0x1d/0x100 [nfs] RSP: 0018:ffff880fed687b90 EFLAGS: 00010286 RAX: 0000000000000024 RBX: 0000000000000000 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff880fed687bc0 R08: 0000000000000092 R09: 000000000000047a R10: 0000000000000000 R11: ffff880fed6878d6 R12: ffff880fed687d20 R13: ffff880fed687d20 R14: 0000000000000070 R15: ffffea000aa33ec0 FS: 00007fce290f0740(0000) GS:ffff8807ffc60000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000070 CR3: 00000007f2e79000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: 0000000000000000 ffff880036c5e510 ffff880fed687d20 ffff880fed687d20 ffff880036c5e200 ffffea000aa33ec0 ffff880fed687bd0 ffffffffa0534710 ffff880fed687be8 ffffffffa053d5f0 ffff880036c5e200 ffff880fed687c08 Call Trace: [<ffffffffa0534710>] put_nfs_open_context+0x10/0x20 [nfs] [<ffffffffa053d5f0>] nfs_pgio_data_destroy+0x20/0x40 [nfs] [<ffffffffa053d672>] nfs_pgio_error+0x22/0x40 [nfs] [<ffffffffa053d8f4>] nfs_generic_pgio+0x74/0x2e0 [nfs] [<ffffffffa06b18c3>] pnfs_generic_pg_writepages+0x63/0x210 [nfsv4] [<ffffffffa053d579>] nfs_pageio_doio+0x19/0x50 [nfs] [<ffffffffa053eb84>] nfs_pageio_complete+0x24/0x30 [nfs] [<ffffffffa053cb25>] nfs_direct_write_schedule_iovec+0x115/0x1f0 [nfs] [<ffffffffa053675f>] ? nfs_get_lock_context+0x4f/0x120 [nfs] [<ffffffffa053d252>] nfs_file_direct_write+0x262/0x420 [nfs] [<ffffffffa0532d91>] nfs_file_write+0x131/0x1d0 [nfs] [<ffffffffa0532c60>] ? nfs_need_sync_write.isra.17+0x40/0x40 [nfs] [<ffffffff812127b8>] do_io_submit+0x3b8/0x840 [<ffffffff81212c50>] SyS_io_submit+0x10/0x20 [<ffffffff81610f29>] system_call_fastpath+0x16/0x1b This is due to the calls to nfs_pgio_error() in nfs_generic_pgio(), which happen before the nfs_pgio_header's open context is referenced in nfs_pgio_rpcsetup(). Reported-by: Steve Dickson <SteveD@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>