summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-03-25ocfs2: extend enough credits for freeing one truncate record while replaying ↵Xue jiufei
truncate records Now function ocfs2_replay_truncate_records() first modifies tl_used, then calls ocfs2_extend_trans() to extend transactions for gd and alloc inode used for freeing clusters. jbd2_journal_restart() may be called and it may happen that tl_used in truncate log is decreased but the clusters are not freed, which means these clusters are lost. So we should avoid extending transactions in these two operations. Signed-off-by: joyce.xue <xuejiufei@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Acked-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: extend transaction for ocfs2_remove_rightmost_path() and ↵Xue jiufei
ocfs2_update_edge_lengths() before to avoid inconsistency between inode and et I found that jbd2_journal_restart() is called in some places without keeping things consistently before. However, jbd2_journal_restart() may commit the handle's transaction and restart another one. If the first transaction is committed successfully while another not, it may cause filesystem inconsistency or read only. This is an effort to fix this kind of problems. This patch (of 3): The following functions will be called while truncating an extent: ocfs2_remove_btree_range -> ocfs2_start_trans -> ocfs2_remove_extent -> ocfs2_truncate_rec -> ocfs2_extend_rotate_transaction -> jbd2_journal_restart if jbd2_journal_extend fail -> ocfs2_rotate_tree_left -> ocfs2_remove_rightmost_path -> ocfs2_extend_rotate_transaction -> ocfs2_unlink_subtree -> ocfs2_update_edge_lengths -> ocfs2_extend_trans -> jbd2_journal_restart if jbd2_journal_extend fail -> ocfs2_et_update_clusters -> ocfs2_commit_trans jbd2_journal_restart() may be called and it may happened that the buffers dirtied in ocfs2_truncate_rec() are committed while buffers dirtied in ocfs2_et_update_clusters() are not, the total clusters on extent tree and i_clusters in ocfs2_dinode is inconsistency. So the clusters got from ocfs2_dinode is incorrect, and it also cause read-only problem when call ocfs2_commit_truncate() with the error message: "Inode %llu has empty extent block at %llu". We should extend enough credits for function ocfs2_remove_rightmost_path and ocfs2_update_edge_lengths to avoid this inconsistency. Signed-off-by: joyce.xue <xuejiufei@huawei.com> Acked-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2/dlm: move lock to the tail of grant queue while doing in-place convertxuejiufei
We have found a bug when two nodes doing umount one after another. 1) Node 1 migrate a lockres that has 3 locks in grant queue such as N2(PR)<->N3(NL)<->N4(PR) to N2. After migration, lvb of the lock N3(NL) and N4(PR) are empty on node 2 because migration target do not copy lvb to these two lock. 2) Node 3 want to convert to PR, it can be granted in __dlmconvert_master(), and the order of these locks is unchanged. The lvb of the lock N3(PR) on node 2 is copyed from lockres in function dlm_update_lvb() while the lvb of lock N4(PR) is still empty. 3) Node 2 want to leave domain, it will migrate this lockres to node 3. Then node 2 will trigger the BUG in dlm_prepare_lvb_for_migration() when adding the lock N4(PR) to mres with the following message because the lvb of mres is already copied from lock N3(PR), but the lvb of lock N4(PR) is empty. "Mismatched lvb in lock cookie=%u:%llu, name=%.*s, node=%u" [akpm@linux-foundation.org: tweak comment] Signed-off-by: xuejiufei <xuejiufei@huawei.com> Acked-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: solve a problem of crossing the boundary in updating backupsjiangyiwen
In update_backups() there exists a problem of crossing the boundary as follows: we assume that lun will be resized to 1TB(cluster_size is 32kb), it will include 0~33554431 cluster, in update_backups func, it will backup super block in location of 1TB which is the 33554432th cluster, so the phenomenon of crossing the boundary happens. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: fix occurring deadlock by changing ocfs2_wq from global to localjiangyiwen
This patch fixes a deadlock, as follows: Node 1 Node 2 Node 3 1)volume a and b are only mount vol a only mount vol b mounted 2) start to mount b start to mount a 3) check hb of Node 3 check hb of Node 2 in vol a, qs_holds++ in vol b, qs_holds++ 4) -------------------- all nodes' network down -------------------- 5) progress of mount b the same situation as failed, and then call Node 2 ocfs2_dismount_volume. but the process is hung, since there is a work in ocfs2_wq cannot beo completed. This work is about vol a, because ocfs2_wq is global wq. BTW, this work which is scheduled in ocfs2_wq is ocfs2_orphan_scan_work, and the context in this work needs to take inode lock of orphan_dir, because lockres owner are Node 1 and all nodes' nework has been down at the same time, so it can't get the inode lock. 6) Why can't this node be fenced when network disconnected? Because the process of mount is hung what caused qs_holds is not equal 0. Because all works in the ocfs2_wq are relative to the super block. The solution is to change the ocfs2_wq from global to local. In other words, move it into struct ocfs2_super. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_listJoseph Qi
When master handles convert request, it queues ast first and then returns status. This may happen that the ast is sent before the request status because the above two messages are sent by two threads. And right after the ast is sent, if master down, it may trigger BUG in dlm_move_lockres_to_recovery_list in the requested node because ast handler moves it to grant list without clear lock->convert_pending. So remove BUG_ON statement and check if the ast is processed in dlmconvert_remote. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Tariq Saeed <tariq.x.saeed@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2/dlm: fix race between convert and recoveryJoseph Qi
There is a race window between dlmconvert_remote and dlm_move_lockres_to_recovery_list, which will cause a lock with OCFS2_LOCK_BUSY in grant list, thus system hangs. dlmconvert_remote { spin_lock(&res->spinlock); list_move_tail(&lock->list, &res->converting); lock->convert_pending = 1; spin_unlock(&res->spinlock); status = dlm_send_remote_convert_request(); >>>>>> race window, master has queued ast and return DLM_NORMAL, and then down before sending ast. this node detects master down and calls dlm_move_lockres_to_recovery_list, which will revert the lock to grant list. Then OCFS2_LOCK_BUSY won't be cleared as new master won't send ast any more because it thinks already be authorized. spin_lock(&res->spinlock); lock->convert_pending = 0; if (status != DLM_NORMAL) dlm_revert_pending_convert(res, lock); spin_unlock(&res->spinlock); } In this case, check if res->state has DLM_LOCK_RES_RECOVERING bit set (res is still in recovering) or res master changed (new master has finished recovery), reset the status to DLM_RECOVERING, then it will retry convert. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Tariq Saeed <tariq.x.saeed@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: fix a deadlock issue in ocfs2_dio_end_io_write()Ryan Ding
The code should call ocfs2_free_alloc_context() to free meta_ac & data_ac before calling ocfs2_run_deallocs(). Because ocfs2_run_deallocs() will acquire the system inode's i_mutex hold by meta_ac. So try to release the lock before ocfs2_run_deallocs(). Fixes: af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.") Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Acked-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: fix disk file size and memory file size mismatchRyan Ding
When doing append direct write in an already allocated cluster, and fast path in ocfs2_dio_get_block() is triggered, function ocfs2_dio_end_io_write() will be skipped as there is no context allocated. As a result, the disk file size will not be changed as it should be. The solution is to skip fast path when we are about to change file size. Fixes: af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.") Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Acked-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: take ip_alloc_sem in ocfs2_dio_get_block & ocfs2_dio_end_io_writeRyan Ding
Take ip_alloc_sem to prevent concurrent access to extent tree, which may cause the extent tree in an unstable state. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: fix ip_unaligned_aio deadlock with dio work queueRyan Ding
In the current implementation of unaligned aio+dio, lock order behave as follow: in user process context: -> call io_submit() -> get i_mutex <== window1 -> get ip_unaligned_aio -> submit direct io to block device -> release i_mutex -> io_submit() return in dio work queue context(the work queue is created in __blockdev_direct_IO): -> release ip_unaligned_aio <== window2 -> get i_mutex -> clear unwritten flag & change i_size -> release i_mutex There is a limitation to the thread number of dio work queue. 256 at default. If all 256 thread are in the above 'window2' stage, and there is a user process in the 'window1' stage, the system will became deadlock. Since the user process hold i_mutex to wait ip_unaligned_aio lock, while there is a direct bio hold ip_unaligned_aio mutex who is waiting for a dio work queue thread to be schedule. But all the dio work queue thread is waiting for i_mutex lock in 'window2'. This case only happened in a test which send a large number(more than 256) of aio at one io_submit() call. My design is to remove ip_unaligned_aio lock. Change it to a sync io instead. Just like ip_unaligned_aio lock, serialize the unaligned aio dio. [akpm@linux-foundation.org: remove OCFS2_IOCB_UNALIGNED_IO, per Junxiao Bi] Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: code clean up for direct ioRyan Ding
Clean up ocfs2_file_write_iter & ocfs2_prepare_inode_for_write: * remove append dio check: it will be checked in ocfs2_direct_IO() * remove file hole check: file hole is supported for now * remove inline data check: it will be checked in ocfs2_direct_IO() * remove the full_coherence check when append dio: we will get the inode_lock in ocfs2_dio_get_block, there is no need to fall back to buffer io to ensure the coherence semantics. Now the drop dio procedure is gone. :) [akpm@linux-foundation.org: remove unused label] Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: fix sparse file & data ordering issue in direct ioRyan Ding
There are mainly three issues in the direct io code path after commit 24c40b329e03 ("ocfs2: implement ocfs2_direct_IO_write"): * Does not support sparse file. * Does not support data ordering. eg: when write to a file hole, it will alloc extent first. If system crashed before io finished, data will corrupt. * Potential risk when doing aio+dio. The -EIOCBQUEUED return value is likely to be ignored by ocfs2_direct_IO_write(). To resolve above problems, re-design direct io code with following ideas: * Use buffer io to fill in holes. And this will make better performance also. * Clear unwritten after direct write finished. So we can make sure meta data changes after data write to disk. (Unwritten extent is invisible to user, from user's view, meta data is not changed when allocate an unwritten extent.) * Clear ocfs2_direct_IO_write(). Do all ending work in end_io. This patch has passed fs,dio,ltp-aiodio.part1,ltp-aiodio.part2,ltp-aiodio.part4 test cases of ltp. For performance improvement, see following test result: ocfs2 cluster size 1MB, ocfs2 volume is mounted on /mnt/. The original way: + rm /mnt/test.img -f + dd if=/dev/zero of=/mnt/test.img bs=4K count=1048576 oflag=direct 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 1707.83 s, 2.5 MB/s + rm /mnt/test.img -f + dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct 16384+0 records in 16384+0 records out 4294967296 bytes (4.3 GB) copied, 582.705 s, 7.4 MB/s After this patch: + rm /mnt/test.img -f + dd if=/dev/zero of=/mnt/test.img bs=4K count=1048576 oflag=direct 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 64.6412 s, 66.4 MB/s + rm /mnt/test.img -f + dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct 16384+0 records in 16384+0 records out 4294967296 bytes (4.3 GB) copied, 34.7611 s, 124 MB/s Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: record UNWRITTEN extents when populate write descRyan Ding
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. There is still one issue in the direct write procedure. phase 1: alloc extent with UNWRITTEN flag phase 2: submit direct data to disk, add zero page to page cache phase 3: clear UNWRITTEN flag when data has been written to disk When there are 2 direct write A(0~3KB),B(4~7KB) writing to the same cluster 0~7KB (cluster size 8KB). Write request A arrive phase 2 first, it will zero the region (4~7KB). Before request A enter to phase 3, request B arrive phase 2, it will zero region (0~3KB). This is just like request B steps request A. To resolve this issue, we should let request B knows this cluster is already under zero, to prevent it from steps the previous write request. This patch will add function ocfs2_unwritten_check() to do this job. It will record all clusters that are under direct write(it will be recorded in the 'ip_unwritten_list' member of inode info), and prevent the later direct write writing to the same cluster to do the zero work again. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: return the physical address in ocfs2_write_clusterRyan Ding
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. Direct io needs to get the physical address from write_begin, to map the user page. This patch is to change the arg 'phys' of ocfs2_write_cluster to a pointer, so it can be retrieved to write_begin. And we can retrieve it to the direct io procedure. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: do not change i_size in write_end for direct ioRyan Ding
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. Append direct io do not change i_size in get block phase. It only move to orphan when starting write. After data is written to disk, it will delete itself from orphan and update i_size. So skip i_size change section in write_begin for direct io. And when there is no extents alloc, no meta data changes needed for direct io (since write_begin start trans for 2 reason: alloc extents & change i_size. Now none of them needed). So we can skip start trans procedure. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: test target page before change itRyan Ding
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. Direct io data will not appear in buffer. The w_target_page member will not be filled by direct io. So avoid to use it when it's NULL. Unlinke buffer io and mmap, direct io will call write_begin with more than 1 page a time. So the target_index is not sufficient to describe the actual data. change it to a range start at target_index, end in end_index. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: use c_new to indicate newly allocated extentsRyan Ding
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. There is a problem in ocfs2's direct io implement: if system crashed after extents allocated, and before data return, we will get a extent with dirty data on disk. This problem violate the journal=order semantics, which means meta changes take effect after data written to disk. To resolve this issue, direct write can use the UNWRITTEN flag to describe a extent during direct data writeback. The direct write procedure should act in the following order: phase 1: alloc extent with UNWRITTEN flag phase 2: submit direct data to disk, add zero page to page cache phase 3: clear UNWRITTEN flag when data has been written to disk This patch is to change the 'c_unwritten' member of ocfs2_write_cluster_desc to 'c_clear_unwritten'. Means whether to clear the unwritten flag. It do not care if a extent is allocated or not. And use 'c_new' to specify a newly allocated extent. So the direct io procedure can use c_clear_unwritten to control the UNWRITTEN bit on extent. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: add ocfs2_write_type_t type to identify the caller of writeRyan Ding
Patchset: fix ocfs2 direct io code patch to support sparse file and data ordering semantics The idea is to use buffer io(more precisely use the interface ocfs2_write_begin_nolock & ocfs2_write_end_nolock) to do the zero work beyond block size. And clear UNWRITTEN flag until direct io data has been written to disk, which can prevent data corruption when system crashed during direct write. And we will also archive a better performance: eg. dd direct write new file with block size 4KB: before this patchset: 2.5 MB/s after this patchset: 66.4 MB/s This patch (of 8): To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. Remove unused args filp & flags. Add new arg type. The type is one of buffer/direct/mmap. Indicate 3 way to perform write. buffer/mmap type has implemented. direct type will be implemented later. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25ocfs2: o2hb: fix double free bugJunxiao Bi
This is a regression issue and caused the following kernel panic when do ocfs2 multiple test. BUG: unable to handle kernel paging request at 00000002000800c0 IP: [<ffffffff81192978>] kmem_cache_alloc+0x78/0x160 PGD 7bbe5067 PUD 0 Oops: 0000 [#1] SMP Modules linked in: ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi xen_kbdfront xen_netfront xen_fbfront xen_blkfront CPU: 2 PID: 4044 Comm: mpirun Not tainted 4.5.0-rc5-next-20160225 #1 Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014 task: ffff88007a521a80 ti: ffff88007aed0000 task.ti: ffff88007aed0000 RIP: 0010:[<ffffffff81192978>] [<ffffffff81192978>] kmem_cache_alloc+0x78/0x160 RSP: 0018:ffff88007aed3a48 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000001991 RDX: 0000000000001990 RSI: 00000000024000c0 RDI: 000000000001b330 RBP: ffff88007aed3a98 R08: ffff88007d29b330 R09: 00000002000800c0 R10: 0000000c51376d87 R11: ffff8800792cac38 R12: ffff88007cc30f00 R13: 00000000024000c0 R14: ffffffff811b053f R15: ffff88007aed3ce7 FS: 0000000000000000(0000) GS:ffff88007d280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000002000800c0 CR3: 000000007aeb2000 CR4: 00000000000406e0 Call Trace: __d_alloc+0x2f/0x1a0 d_alloc+0x17/0x80 lookup_dcache+0x8a/0xc0 path_openat+0x3c3/0x1210 do_filp_open+0x80/0xe0 do_sys_open+0x110/0x200 SyS_open+0x19/0x20 do_syscall_64+0x72/0x230 entry_SYSCALL64_slow_path+0x25/0x25 Code: 05 e6 77 e7 7e 4d 8b 08 49 8b 40 10 4d 85 c9 0f 84 dd 00 00 00 48 85 c0 0f 84 d4 00 00 00 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 <49> 8b 1c 01 4c 89 c8 65 48 0f c7 0f 0f 94 c0 3c 01 75 b6 49 63 RIP kmem_cache_alloc+0x78/0x160 CR2: 00000002000800c0 ---[ end trace 823969e602e4aaac ]--- Fixes: a4a1dfa4bb8b("ocfs2/cluster: fix memory leak in o2hb_region_release") Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-24Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Final round of fixes for this merge window - some of this has come up after the initial pull request, and some of it was put in a post-merge branch before the merge window. This contains: - Fix for a bad check for an error on dma mapping in the mtip32xx driver, from Alexey Khoroshilov. - A set of fixes for lightnvm, from Javier, Matias, and Wenwei. - An NVMe completion record corruption fix from Marta, ensuring that we read things in the right order. - Two writeback fixes from Tejun, marked for stable@ as well. - A blk-mq sw queue iterator fix from Thomas, fixing an oops for sparse CPU maps. They hit this in the hot plug/unplug rework" * 'for-linus' of git://git.kernel.dk/linux-block: nvme: avoid cqe corruption when update at the same time as read writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list() blk-mq: Use proper cpumask iterator mtip32xx: fix checks for dma mapping errors lightnvm: do not load L2P table if not supported lightnvm: do not reserve lun on l2p loading nvme: lightnvm: return ppa completion status lightnvm: add a bitmap of luns lightnvm: specify target's logical address area null_blk: add lightnvm null_blk device to the nullb_list
2016-03-24Merge tag 'for-linus-20160324' of git://git.infradead.org/linux-mtdLinus Torvalds
Pull MTD updates from Brian Norris: "NAND: - Add sunxi_nand randomizer support - begin refactoring NAND ecclayout structs - fix pxa3xx_nand dmaengine usage - brcmnand: fix support for v7.1 controller - add Qualcomm NAND controller driver SPI NOR: - add new ls1021a, ls2080a support to Freescale QuadSPI - add new flash ID entries - support bottom-block protection for Winbond flash - support Status Register Write Protect - remove broken QPI support for Micron SPI flash JFFS2: - improve post-mount CRC scan efficiency General: - refactor bcm63xxpart parser, to later extend for NAND - add writebuf size parameter to mtdram Other minor code quality improvements" * tag 'for-linus-20160324' of git://git.infradead.org/linux-mtd: (72 commits) mtd: nand: remove kerneldoc for removed function parameter mtd: nand: Qualcomm NAND controller driver dt/bindings: qcom_nandc: Add DT bindings mtd: nand: don't select chip in nand_chip's block_bad op mtd: spi-nor: support lock/unlock for a few Winbond chips mtd: spi-nor: add TB (Top/Bottom) protect support mtd: spi-nor: add SPI_NOR_HAS_LOCK flag mtd: spi-nor: use BIT() for flash_info flags mtd: spi-nor: disallow further writes to SR if WP# is low mtd: spi-nor: make lock/unlock bounds checks more obvious and robust mtd: spi-nor: silently drop lock/unlock for already locked/unlocked region mtd: spi-nor: wait for SR_WIP to clear on initial unlock mtd: nand: simplify nand_bch_init() usage mtd: mtdswap: remove useless if (!mtd->ecclayout) test mtd: create an mtd_oobavail() helper and make use of it mtd: kill the ecclayout->oobavail field mtd: nand: check status before reporting timeout mtd: bcm63xxpart: give width specifier an 'int', not 'size_t' mtd: mtdram: Add parameter for setting writebuf size mtd: nand: pxa3xx_nand: kill unused field 'drcmr_cmd' ...
2016-03-24Merge tag 'upstream-4.6-rc1' of git://git.infradead.org/linux-ubifsLinus Torvalds
Pull UBI/UBIFS updates from Richard Weinberger: "This contains cleanups and a maintainer update for UBI and UBIFS" * tag 'upstream-4.6-rc1' of git://git.infradead.org/linux-ubifs: ubifs: Remove unused header MAINTAINERS: Update UBIFS entry mtd: ubi: Add logging functions ubi_msg, ubi_warn and ubi_err ubifs: Add logging functions for ubifs_msg, ubifs_err and ubifs_warn
2016-03-24Merge tag 'nfsd-4.6-1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull more nfsd updates from Bruce Fields: "Apologies for the previous request, which omitted the top 8 commits from my for-next branch (including the SCSI layout commits). Thanks to Trond for spotting my error!" This actually includes the new layout types, so here's that part of the pull message repeated: "Support for a new pnfs layout type from Christoph Hellwig. The new layout type is a variant of the block layout which uses SCSI features to offer improved fencing and device identification. Note this pull request also includes the client side of SCSI layout, with Trond's permission" * tag 'nfsd-4.6-1' of git://linux-nfs.org/~bfields/linux: nfsd: use short read as well as i_size to set eof nfsd: better layoutupdate bounds-checking nfsd: block and scsi layout drivers need to depend on CONFIG_BLOCK nfsd: add SCSI layout support nfsd: move some blocklayout code nfsd: add a new config option for the block layout driver nfs/blocklayout: add SCSI layout support nfs4.h: add SCSI layout definitions
2016-03-24Merge tag 'nfsd-4.6' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "Various bugfixes, a RDMA update from Chuck Lever, and support for a new pnfs layout type from Christoph Hellwig. The new layout type is a variant of the block layout which uses SCSI features to offer improved fencing and device identification. (Also: note this pull request also includes the client side of SCSI layout, with Trond's permission.)" * tag 'nfsd-4.6' of git://linux-nfs.org/~bfields/linux: sunrpc/cache: drop reference when sunrpc_cache_pipe_upcall() detects a race nfsd: recover: fix memory leak nfsd: fix deadlock secinfo+readdir compound nfsd4: resfh unused in nfsd4_secinfo svcrdma: Use new CQ API for RPC-over-RDMA server send CQs svcrdma: Use new CQ API for RPC-over-RDMA server receive CQs svcrdma: Remove close_out exit path svcrdma: Hook up the logic to return ERR_CHUNK svcrdma: Use correct XID in error replies svcrdma: Make RDMA_ERROR messages work rpcrdma: Add RPCRDMA_HDRLEN_ERR svcrdma: svc_rdma_post_recv() should close connection on error svcrdma: Close connection when a send error occurs nfsd: Lower NFSv4.1 callback message size limit svcrdma: Do not send Write chunk XDR pad with inline content svcrdma: Do not write xdr_buf::tail in a Write chunk svcrdma: Find client-provided write and reply chunks once per reply nfsd: Update NFS server comments related to RDMA support nfsd: Fix a memory leak when meeting unsupported state_protect_how4 nfsd4: fix bad bounds checking
2016-03-23nfsd: use short read as well as i_size to set eofBenjamin Coddington
Use the result of a local read to determine when to set the eof flag. This allows us to return the location of the end of the file atomically at the time of the read. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> [bfields: add some documentation] Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-23Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: "This fixes the following issues: API: - Fix kzalloc error path crash in ecryptfs added by skcipher conversion. Note the subject of the commit is screwed up and the correct subject is actually in the body. Drivers: - A number of fixes to the marvell cesa hashing code. - Remove bogus nested irqsave that clobbers the saved flags in ccp" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: marvell/cesa - forward devm_ioremap_resource() error code crypto: marvell/cesa - initialize hash states crypto: marvell/cesa - fix memory leak crypto: ccp - fix lock acquisition code eCryptfs: Use skcipher and shash
2016-03-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge third patch-bomb from Andrew Morton: - more ocfs2 changes - a few hotfixes - Andy's compat cleanups - misc fixes to fatfs, ptrace, coredump, cpumask, creds, eventfd, panic, ipmi, kgdb, profile, kfifo, ubsan, etc. - many rapidio updates: fixes, new drivers. - kcov: kernel code coverage feature. Like gcov, but not "prohibitively expensive". - extable code consolidation for various archs * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (81 commits) ia64/extable: use generic search and sort routines x86/extable: use generic search and sort routines s390/extable: use generic search and sort routines alpha/extable: use generic search and sort routines kernel/...: convert pr_warning to pr_warn drivers: dma-coherent: use memset_io for DMA_MEMORY_IO mappings drivers: dma-coherent: use MEMREMAP_WC for DMA_MEMORY_MAP memremap: add MEMREMAP_WC flag memremap: don't modify flags kernel/signal.c: add compile-time check for __ARCH_SI_PREAMBLE_SIZE mm/mprotect.c: don't imply PROT_EXEC on non-exec fs ipc/sem: make semctl setting sempid consistent ubsan: fix tree-wide -Wmaybe-uninitialized false positives kfifo: fix sparse complaints scripts/gdb: account for changes in module data structure scripts/gdb: add cmdline reader command scripts/gdb: add version command kernel: add kcov code coverage profile: hide unused functions when !CONFIG_PROC_FS hpwdt: use nmi_panic() when kernel panics in NMI handler ...
2016-03-22eventfd: document lockless access in eventfd_pollPaolo Bonzini
Since commit e22553e2a25e ("eventfd: don't take the spinlock in eventfd_poll", 2015-02-17), eventfd is reading ctx->count outside ctx->wqh.lock. However, things aren't as simple as the read barrier in eventfd_poll would suggest. In fact, the read barrier, besides lacking a comment, is not paired in any obvious manner with another read barrier, and it is pointless because it is sitting between a write (deep in poll_wait) and the read of ctx->count. The read barrier is acting just as a compiler barrier, for which we can use READ_ONCE instead. This is what the code change in this patch does. The documentation change is just as important, however. The question, posed by Andrea Arcangeli, is then why the thing is safe on architectures where spin_unlock does not imply a store-load memory barrier. The answer is that it's safe because writes of ctx->count use the same lock as poll_wait, and hence an acquire barrier implicit in poll_wait provides the necessary synchronization between eventfd_poll and callers of wake_up_locked_poll. This is sort of mentioned in the commit message with respect to eventfd_ctx_read ("eventfd_read is similar, it will do a single decrement with the lock held") but it applies to all other callers too. It's tricky enough that it should be documented in the code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Chris Mason <clm@fb.com> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22fs/coredump: prevent fsuid=0 dumps into user-controlled directoriesJann Horn
This commit fixes the following security hole affecting systems where all of the following conditions are fulfilled: - The fs.suid_dumpable sysctl is set to 2. - The kernel.core_pattern sysctl's value starts with "/". (Systems where kernel.core_pattern starts with "|/" are not affected.) - Unprivileged user namespace creation is permitted. (This is true on Linux >=3.8, but some distributions disallow it by default using a distro patch.) Under these conditions, if a program executes under secure exec rules, causing it to run with the SUID_DUMP_ROOT flag, then unshares its user namespace, changes its root directory and crashes, the coredump will be written using fsuid=0 and a path derived from kernel.core_pattern - but this path is interpreted relative to the root directory of the process, allowing the attacker to control where a coredump will be written with root privileges. To fix the security issue, always interpret core_pattern for dumps that are written under SUID_DUMP_ROOT relative to the root directory of init. Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22fat: add config option to set UTF-8 mount option by defaultMaciej S. Szmigiero
FAT has long supported its own default file name encoding config setting, separate from CONFIG_NLS_DEFAULT. However, if UTF-8 encoded file names are desired FAT character set should not be set to utf8 since this would make file names case sensitive even if case insensitive matching is requested. Instead, "utf8" mount options should be provided to enable UTF-8 file names in FAT file system. Unfortunately, there was no possibility to set the default value of this option so on UTF-8 system "utf8" mount option had to be added manually to most FAT mounts. This patch adds config option to set such default value. Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22ext4: in ext4_dir_llseek, check syscall bitness directlyAndy Lutomirski
ext4 treats directory offsets differently for 32-bit and 64-bit callers. Check the caller type using in_compat_syscall, not is_compat_task. This changes behavior on SPARC slightly. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22ocfs2: check/fix inode block for online file checkGang He
Implement online check or fix inode block during reading a inode block to memory. Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22ocfs2: create/remove sysfile for online file checkGang He
Create online file check sysfile when ocfs2 mount, remove the related sysfile when ocfs2 umount. Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22ocfs2: sysfile interfaces for online file checkGang He
Implement online file check sysfile interfaces, e.g. how to create the related sysfile according to device name, how to display/handle file check request from the sysfile. Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22ocfs2: export ocfs2_kset for online file checkGang He
When there are errors in the ocfs2 filesystem, they are usually accompanied by the inode number which caused the error. This inode number would be the input to fixing the file. One of these options could be considered: A file in the sys filesytem which would accept inode numbers. This could be used to communication back what has to be fixed or is fixed. You could write: $# echo "<inode>" > /sys/fs/ocfs2/devname/filecheck/check or $# echo "<inode>" > /sys/fs/ocfs2/devname/filecheck/fix Compare with second version, I re-design filecheck sysfs interfaces, there are three sysfs files (check, fix and set) under filecheck directory (see above), sysfs will accept only one argument <inode>. Second, I adjust some code in ocfs2_filecheck_repair_inode_block() function according to upstream feedback, we cannot just add VALID_FL flag back as a inode block fix, then we will not fix this field corruption currently until having a complete solution. Compare with first version, I use strncasecmp instead of double strncmp functions. Second, update the source file contribution vendor. This patch (of 4): Export ocfs2_kset object from ocfs2_stackglue kernel module, then online file check code will create the related sysfiles under ocfs2_kset object. We're exporting this because it's built in ocfs2_stackglue.ko. Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22Merge tag 'nfs-for-4.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Features: - Add support for multiple NFSv4.1 callbacks in flight - Initial patchset for RPC multipath support - Adapt RPC/RDMA to use the new completion queue API Bugfixes and cleanups: - nfs4: nfs4_ff_layout_prepare_ds should return NULL if connection failed - Cleanups to remove nfs_inode_dio_wait and nfs4_file_fsync - Fix RPC/RDMA credit accounting - Properly handle RDMA_ERROR replies - xprtrdma: Do not wait if ib_post_send() fails - xprtrdma: Segment head and tail XDR buffers on page boundaries - xprtrdma cleanups for dprintk, physical_op_map and unused macros" * tag 'nfs-for-4.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (35 commits) nfs/blocklayout: make sure making a aligned read request nfs4: nfs4_ff_layout_prepare_ds should return NULL if connection failed nfs: remove nfs_inode_dio_wait nfs: remove nfs4_file_fsync xprtrdma: Use new CQ API for RPC-over-RDMA client send CQs xprtrdma: Use an anonymous union in struct rpcrdma_mw xprtrdma: Use new CQ API for RPC-over-RDMA client receive CQs xprtrdma: Serialize credit accounting again xprtrdma: Properly handle RDMA_ERROR replies rpcrdma: Add RPCRDMA_HDRLEN_ERR xprtrdma: Do not wait if ib_post_send() fails xprtrdma: Segment head and tail XDR buffers on page boundaries xprtrdma: Clean up dprintk format string containing a newline xprtrdma: Clean up physical_op_map() xprtrdma: Clean up unused RPCRDMA_INLINE_PAD_THRESH macro NFS add callback_ops to nfs4_proc_bind_conn_to_session_callback pnfs/NFSv4.1: Add multipath capabilities to pNFS flexfiles servers over NFSv3 SUNRPC: Allow addition of new transports to a struct rpc_clnt NFSv4.1: nfs4_proc_bind_conn_to_session must iterate over all connections SUNRPC: Make NFS swap work with multipath ...
2016-03-22Merge branch 'overlayfs-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs updates from Miklos Szeredi: "Various fixes and tweaks" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: cleanup unused var in rename2 ovl: rename is_merge to is_lowest ovl: fixed coding style warning ovl: Ensure upper filesystem supports d_type ovl: Warn on copy up if a process has a R/O fd open to the lower file ovl: honor flag MS_SILENT at mount ovl: verify upper dentry before unlink and rename
2016-03-22Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse update from Miklos Szeredi: "This contains direct I/O fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: return patrial success from fuse_direct_io() fuse: Add reference counting for fuse_io_priv fuse: do not use iocb after it may have been freed
2016-03-22nfsd: better layoutupdate bounds-checkingJ. Bruce Fields
You could add any multiple of 2^32/PNFS_SCSI_RANGE_SIZE to nr_iomaps and still pass this check. You'd probably still fail the following kcalloc, but best to be paranoid since this is from-the-wire data. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-21Merge branch 'for-linus-4.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "We have a good sized cleanup of our internal read ahead code, and the first series of commits from Chandan to enable PAGE_SIZE > sectorsize Otherwise, it's a normal series of cleanups and fixes, with many thanks to Dave Sterba for doing most of the patch wrangling this time" * 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (82 commits) btrfs: make sure we stay inside the bvec during __btrfs_lookup_bio_sums btrfs: Fix misspellings in comments. btrfs: Print Warning only if ENOSPC_DEBUG is enabled btrfs: scrub: silence an uninitialized variable warning btrfs: move btrfs_compression_type to compression.h btrfs: rename btrfs_print_info to btrfs_print_mod_info Btrfs: Show a warning message if one of objectid reaches its highest value Documentation: btrfs: remove usage specific information btrfs: use kbasename in btrfsic_mount Btrfs: do not collect ordered extents when logging that inode exists Btrfs: fix race when checking if we can skip fsync'ing an inode Btrfs: fix listxattrs not listing all xattrs packed in the same item Btrfs: fix deadlock between direct IO reads and buffered writes Btrfs: fix extent_same allowing destination offset beyond i_size Btrfs: fix file loss on log replay after renaming a file and fsync Btrfs: fix unreplayable log after snapshot delete + parent dir fsync Btrfs: fix lockdep deadlock warning due to dev_replace btrfs: drop unused argument in btrfs_ioctl_get_supported_features btrfs: add GET_SUPPORTED_FEATURES to the control device ioctls btrfs: change max_inline default to 2048 ...
2016-03-21Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull UDF and quota updates from Jan Kara: "This contains a rewrite of UDF handling of filename encoding to fix remaining overflow issues from Andrew Gabbasov and quota changes to support new Q_[X]GETNEXTQUOTA quotactl for VFS quota formats" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: quota: Fix possible GPF due to uninitialised pointers ext4: Make Q_GETNEXTQUOTA work for quota in hidden inodes quota: Forbid Q_GETQUOTA and Q_GETNEXTQUOTA for frozen filesystem quota: Fix possible races during quota loading ocfs2: Implement get_next_id() quota_v2: Implement get_next_id() for V2 quota format quota: Add support for ->get_nextdqblk() for VFS quota udf: Merge linux specific translation into CS0 conversion function udf: Remove struct ustr as non-needed intermediate storage udf: Use separate buffer for copying split names udf: Adjust UDF_NAME_LEN to better reflect actual restrictions udf: Join functions for UTF8 and NLS conversions udf: Parameterize output length in udf_put_filename quota: Allow Q_GETQUOTA for frozen filesystem quota: Fixup comments about return value of Q_[X]GETNEXTQUOTA
2016-03-21Merge tag 'xfs-for-linus-4.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs updates from Dave Chinner: "There's quite a lot in this request, and there's some cross-over with ext4, dax and quota code due to the nature of the changes being made. As for the rest of the XFS changes, there are lots of little things all over the place, which add up to a lot of changes in the end. The major changes are that we've reduced the size of the struct xfs_inode by ~100 bytes (gives an inode cache footprint reduction of >10%), the writepage code now only does a single set of mapping tree lockups so uses less CPU, delayed allocation reservations won't overrun under random write loads anymore, and we added compile time verification for on-disk structure sizes so we find out when a commit or platform/compiler change breaks the on disk structure as early as possible. Change summary: - error propagation for direct IO failures fixes for both XFS and ext4 - new quota interfaces and XFS implementation for iterating all the quota IDs in the filesystem - locking fixes for real-time device extent allocation - reduction of duplicate information in the xfs and vfs inode, saving roughly 100 bytes of memory per cached inode. - buffer flag cleanup - rework of the writepage code to use the generic write clustering mechanisms - several fixes for inode flag based DAX enablement - rework of remount option parsing - compile time verification of on-disk format structure sizes - delayed allocation reservation overrun fixes - lots of little error handling fixes - small memory leak fixes - enable xfsaild freezing again" * tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (66 commits) xfs: always set rvalp in xfs_dir2_node_trim_free xfs: ensure committed is initialized in xfs_trans_roll xfs: borrow indirect blocks from freed extent when available xfs: refactor delalloc indlen reservation split into helper xfs: update freeblocks counter after extent deletion xfs: debug mode forced buffered write failure xfs: remove impossible condition xfs: check sizes of XFS on-disk structures at compile time xfs: ioends require logically contiguous file offsets xfs: use named array initializers for log item dumping xfs: fix computation of inode btree maxlevels xfs: reinitialise per-AG structures if geometry changes during recovery xfs: remove xfs_trans_get_block_res xfs: fix up inode32/64 (re)mount handling xfs: fix format specifier , should be %llx and not %llu xfs: sanitize remount options xfs: convert mount option parsing to tokens xfs: fix two memory leaks in xfs_attr_list.c error paths xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared ...
2016-03-21Merge tag 'for-f2fs-4.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "New Features: - uplift filesystem encryption into fs/crypto/ - give sysfs entries to control memroy consumption Enhancements: - aio performance by preallocating blocks in ->write_iter - use writepages lock for only WB_SYNC_ALL - avoid redundant inline_data conversion - enhance forground GC - use wait_for_stable_page as possible - speed up SEEK_DATA and fiiemap Bug Fixes: - corner case in terms of -ENOSPC for inline_data - hung task caused by long latency in shrinker - corruption between atomic write and f2fs_trace_pid - avoid garbage lengths in dentries - revoke atomicly written pages if an error occurs In addition, there are various minor bug fixes and clean-ups" * tag 'for-f2fs-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (81 commits) f2fs: submit node page write bios when really required f2fs: add missing argument to f2fs_setxattr stub f2fs: fix to avoid unneeded unlock_new_inode f2fs: clean up opened code with f2fs_update_dentry f2fs: declare static functions f2fs: use cryptoapi crc32 functions f2fs: modify the readahead method in ra_node_page() f2fs crypto: sync ext4_lookup and ext4_file_open fs crypto: move per-file encryption from f2fs tree to fs/crypto f2fs: mutex can't be used by down_write_nest_lock() f2fs: recovery missing dot dentries in root directory f2fs: fix to avoid deadlock when merging inline data f2fs: introduce f2fs_flush_merged_bios for cleanup f2fs: introduce f2fs_update_data_blkaddr for cleanup f2fs crypto: fix incorrect positioning for GCing encrypted data page f2fs: fix incorrect upper bound when iterating inode mapping tree f2fs: avoid hungtask problem caused by losing wake_up f2fs: trace old block address for CoWed page f2fs: try to flush inode after merging inline data f2fs: show more info about superblock recovery ...
2016-03-21Merge branch 'for-4.6-ns' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup namespace support from Tejun Heo: "These are changes to implement namespace support for cgroup which has been pending for quite some time now. It is very straight-forward and only affects what part of cgroup hierarchies are visible. After unsharing, mounting a cgroup fs will be scoped to the cgroups the task belonged to at the time of unsharing and the cgroup paths exposed to userland would be adjusted accordingly" * 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: fix and restructure error handling in copy_cgroup_ns() cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns() Add FS_USERNS_FLAG to cgroup fs cgroup: Add documentation for cgroup namespaces cgroup: mount cgroupns-root when inside non-init cgroupns kernfs: define kernfs_node_dentry cgroup: cgroup namespace setns support cgroup: introduce cgroup namespaces sched: new clone flag CLONE_NEWCGROUP for cgroup namespace kernfs: Add API to generate relative kernfs path
2016-03-21nfs/blocklayout: make sure making a aligned read requestKinglong Mee
Only treat write goes up to the inode size as aligned request, because it always write PAGE_CACHE_SIZE, but read a dynamic size. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-03-21ovl: cleanup unused var in rename2Miklos Szeredi
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-03-21ovl: rename is_merge to is_lowestMiklos Szeredi
The 'is_merge' is an historical naming from when only a single lower layer could exist. With the introduction of multiple lower layers the meaning of this flag was changed to mean only the "lowest layer" (while all lower layers were being merged). So now 'is_merge' is inaccurate and hence renaming to 'is_lowest' Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-03-21ovl: fixed coding style warningSohom Bhattacharjee
This patch fixes a newline warning found by the checkpatch.pl tool Signed-off-by: Sohom-Bhattacharjee <soham.bhattacharjee15@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-03-21ovl: Ensure upper filesystem supports d_typeVivek Goyal
In some instances xfs has been created with ftype=0 and there if a file on lower fs is removed, overlay leaves a whiteout in upper fs but that whiteout does not get filtered out and is visible to overlayfs users. And reason it does not get filtered out because upper filesystem does not report file type of whiteout as DT_CHR during iterate_dir(). So it seems to be a requirement that upper filesystem support d_type for overlayfs to work properly. Do this check during mount and fail if d_type is not supported. Suggested-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>