From e533d8da923c77ed58940309c401584bffda75e0 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Tue, 7 Apr 2009 04:19:49 +0200 Subject: reiserfs: kill-the-BKL This patch is an attempt to remove the Bkl based locking scheme from reiserfs and is intended. It is a bit inspired from an old attempt by Peter Zijlstra: http://lkml.indiana.edu/hypermail/linux/kernel/0704.2/2174.html The bkl is heavily used in this filesystem to prevent from concurrent write accesses on the filesystem. Reiserfs makes a deep use of the specific properties of the Bkl: - It can be acqquired recursively by a same task - It is released on the schedule() calls and reacquired when schedule() returns The two properties above are a roadmap for the reiserfs write locking so it's very hard to simply replace it with a common mutex. - We need a recursive-able locking unless we want to restructure several blocks of the code. - We need to identify the sites where the bkl was implictly relaxed (schedule, wait, sync, etc...) so that we can in turn release and reacquire our new lock explicitly. Such implicit releases of the lock are often required to let other resources producer/consumer do their job or we can suffer unexpected starvations or deadlocks. So the new lock that replaces the bkl here is a per superblock mutex with a specific property: it can be acquired recursively by a same task, like the bkl. For such purpose, we integrate a lock owner and a lock depth field on the superblock information structure. The first axis on this patch is to turn reiserfs_write_(un)lock() function into a wrapper to manage this mutex. Also some explicit calls to lock_kernel() have been converted to reiserfs_write_lock() helpers. The second axis is to find the important blocking sites (schedule...(), wait_on_buffer(), sync_dirty_buffer(), etc...) and then apply an explicit release of the write lock on these locations before blocking. Then we can safely wait for those who can give us resources or those who need some. Typically this is a fight between the current writer, the reiserfs workqueue (aka the async commiter) and the pdflush threads. The third axis is a consequence of the second. The write lock is usually on top of a lock dependency chain which can include the journal lock, the flush lock or the commit lock. So it's dangerous to release and trying to reacquire the write lock while we still hold other locks. This is fine with the bkl: T1 T2 lock_kernel() mutex_lock(A) unlock_kernel() // do something lock_kernel() mutex_lock(A) -> already locked by T1 schedule() (and then unlock_kernel()) lock_kernel() mutex_unlock(A) .... This is not fine with a mutex: T1 T2 mutex_lock(write) mutex_lock(A) mutex_unlock(write) // do something mutex_lock(write) mutex_lock(A) -> already locked by T1 schedule() mutex_lock(write) -> already locked by T2 deadlock The solution in this patch is to provide a helper which releases the write lock and sleep a bit if we can't lock a mutex that depend on it. It's another simulation of the bkl behaviour. The last axis is to locate the fs callbacks that are called with the bkl held, according to Documentation/filesystem/Locking. Those are: - reiserfs_remount - reiserfs_fill_super - reiserfs_put_super Reiserfs didn't need to explicitly lock because of the context of these callbacks. But now we must take care of that with the new locking. After this patch, reiserfs suffers from a slight performance regression (for now). On UP, a high volume write with dd reports an average of 27 MB/s instead of 30 MB/s without the patch applied. Signed-off-by: Frederic Weisbecker Reviewed-by: Ingo Molnar Cc: Jeff Mahoney Cc: Peter Zijlstra Cc: Bron Gondwana Cc: Andrew Morton Cc: Linus Torvalds Cc: Alexander Viro LKML-Reference: <1239070789-13354-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar --- fs/reiserfs/Makefile | 2 +- fs/reiserfs/bitmap.c | 2 + fs/reiserfs/dir.c | 8 +++ fs/reiserfs/fix_node.c | 10 +++ fs/reiserfs/inode.c | 23 +++++-- fs/reiserfs/ioctl.c | 6 +- fs/reiserfs/journal.c | 134 ++++++++++++++++++++++++++++++++--------- fs/reiserfs/lock.c | 63 +++++++++++++++++++ fs/reiserfs/resize.c | 2 + fs/reiserfs/stree.c | 2 + fs/reiserfs/super.c | 41 ++++++++++++- include/linux/reiserfs_fs.h | 12 ++-- include/linux/reiserfs_fs_sb.h | 9 +++ 13 files changed, 271 insertions(+), 43 deletions(-) create mode 100644 fs/reiserfs/lock.c diff --git a/fs/reiserfs/Makefile b/fs/reiserfs/Makefile index 7c5ab6330dd6..6a9e30c041dd 100644 --- a/fs/reiserfs/Makefile +++ b/fs/reiserfs/Makefile @@ -7,7 +7,7 @@ obj-$(CONFIG_REISERFS_FS) += reiserfs.o reiserfs-objs := bitmap.o do_balan.o namei.o inode.o file.o dir.o fix_node.o \ super.o prints.o objectid.o lbalance.o ibalance.o stree.o \ hashes.o tail_conversion.o journal.o resize.o \ - item_ops.o ioctl.o procfs.o xattr.o + item_ops.o ioctl.o procfs.o xattr.o lock.o ifeq ($(CONFIG_REISERFS_FS_XATTR),y) reiserfs-objs += xattr_user.o xattr_trusted.o diff --git a/fs/reiserfs/bitmap.c b/fs/reiserfs/bitmap.c index e716161ab325..147033461b87 100644 --- a/fs/reiserfs/bitmap.c +++ b/fs/reiserfs/bitmap.c @@ -1256,7 +1256,9 @@ struct buffer_head *reiserfs_read_bitmap_block(struct super_block *sb, else { if (buffer_locked(bh)) { PROC_INFO_INC(sb, scan_bitmap.wait); + reiserfs_write_unlock(sb); __wait_on_buffer(bh); + reiserfs_write_lock(sb); } BUG_ON(!buffer_uptodate(bh)); BUG_ON(atomic_read(&bh->b_count) == 0); diff --git a/fs/reiserfs/dir.c b/fs/reiserfs/dir.c index 45ee3d357c70..e8b79d21fac2 100644 --- a/fs/reiserfs/dir.c +++ b/fs/reiserfs/dir.c @@ -176,14 +176,22 @@ int reiserfs_readdir_dentry(struct dentry *dentry, void *dirent, // user space buffer is swapped out. At that time // entry can move to somewhere else memcpy(local_buf, d_name, d_reclen); + + /* + * Since filldir might sleep, we can release + * the write lock here for other waiters + */ + reiserfs_write_unlock(inode->i_sb); if (filldir (dirent, local_buf, d_reclen, d_off, d_ino, DT_UNKNOWN) < 0) { + reiserfs_write_lock(inode->i_sb); if (local_buf != small_buf) { kfree(local_buf); } goto end; } + reiserfs_write_lock(inode->i_sb); if (local_buf != small_buf) { kfree(local_buf); } diff --git a/fs/reiserfs/fix_node.c b/fs/reiserfs/fix_node.c index 5e5a4e6fbaf8..bf5f2cbdb063 100644 --- a/fs/reiserfs/fix_node.c +++ b/fs/reiserfs/fix_node.c @@ -1022,7 +1022,11 @@ static int get_far_parent(struct tree_balance *tb, /* Check whether the common parent is locked. */ if (buffer_locked(*pcom_father)) { + + /* Release the write lock while the buffer is busy */ + reiserfs_write_unlock(tb->tb_sb); __wait_on_buffer(*pcom_father); + reiserfs_write_lock(tb->tb_sb); if (FILESYSTEM_CHANGED_TB(tb)) { brelse(*pcom_father); return REPEAT_SEARCH; @@ -1927,7 +1931,9 @@ static int get_direct_parent(struct tree_balance *tb, int h) return REPEAT_SEARCH; if (buffer_locked(bh)) { + reiserfs_write_unlock(tb->tb_sb); __wait_on_buffer(bh); + reiserfs_write_lock(tb->tb_sb); if (FILESYSTEM_CHANGED_TB(tb)) return REPEAT_SEARCH; } @@ -2278,7 +2284,9 @@ static int wait_tb_buffers_until_unlocked(struct tree_balance *tb) REPEAT_SEARCH : CARRY_ON; } #endif + reiserfs_write_unlock(tb->tb_sb); __wait_on_buffer(locked); + reiserfs_write_lock(tb->tb_sb); if (FILESYSTEM_CHANGED_TB(tb)) return REPEAT_SEARCH; } @@ -2349,7 +2357,9 @@ int fix_nodes(int op_mode, struct tree_balance *tb, /* if it possible in indirect_to_direct conversion */ if (buffer_locked(tbS0)) { + reiserfs_write_unlock(tb->tb_sb); __wait_on_buffer(tbS0); + reiserfs_write_lock(tb->tb_sb); if (FILESYSTEM_CHANGED_TB(tb)) return REPEAT_SEARCH; } diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c index 6fd0f47e45db..88ef0b710757 100644 --- a/fs/reiserfs/inode.c +++ b/fs/reiserfs/inode.c @@ -489,10 +489,14 @@ static int reiserfs_get_blocks_direct_io(struct inode *inode, disappeared */ if (REISERFS_I(inode)->i_flags & i_pack_on_close_mask) { int err; - lock_kernel(); + + reiserfs_write_lock(inode->i_sb); + err = reiserfs_commit_for_inode(inode); REISERFS_I(inode)->i_flags &= ~i_pack_on_close_mask; - unlock_kernel(); + + reiserfs_write_unlock(inode->i_sb); + if (err < 0) ret = err; } @@ -616,7 +620,6 @@ int reiserfs_get_block(struct inode *inode, sector_t block, loff_t new_offset = (((loff_t) block) << inode->i_sb->s_blocksize_bits) + 1; - /* bad.... */ reiserfs_write_lock(inode->i_sb); version = get_inode_item_key_version(inode); @@ -997,10 +1000,14 @@ int reiserfs_get_block(struct inode *inode, sector_t block, if (retval) goto failure; } - /* inserting indirect pointers for a hole can take a - ** long time. reschedule if needed + /* + * inserting indirect pointers for a hole can take a + * long time. reschedule if needed and also release the write + * lock for others. */ + reiserfs_write_unlock(inode->i_sb); cond_resched(); + reiserfs_write_lock(inode->i_sb); retval = search_for_position_by_key(inode->i_sb, &key, &path); if (retval == IO_ERROR) { @@ -2612,7 +2619,10 @@ int reiserfs_prepare_write(struct file *f, struct page *page, int ret; int old_ref = 0; + reiserfs_write_unlock(inode->i_sb); reiserfs_wait_on_write_block(inode->i_sb); + reiserfs_write_lock(inode->i_sb); + fix_tail_page_for_writing(page); if (reiserfs_transaction_running(inode->i_sb)) { struct reiserfs_transaction_handle *th; @@ -2762,7 +2772,10 @@ int reiserfs_commit_write(struct file *f, struct page *page, int update_sd = 0; struct reiserfs_transaction_handle *th = NULL; + reiserfs_write_unlock(inode->i_sb); reiserfs_wait_on_write_block(inode->i_sb); + reiserfs_write_lock(inode->i_sb); + if (reiserfs_transaction_running(inode->i_sb)) { th = current->journal_info; } diff --git a/fs/reiserfs/ioctl.c b/fs/reiserfs/ioctl.c index 0ccc3fdda7bf..5e40b0cd4c3d 100644 --- a/fs/reiserfs/ioctl.c +++ b/fs/reiserfs/ioctl.c @@ -141,9 +141,11 @@ long reiserfs_compat_ioctl(struct file *file, unsigned int cmd, default: return -ENOIOCTLCMD; } - lock_kernel(); + + reiserfs_write_lock(inode->i_sb); ret = reiserfs_ioctl(inode, file, cmd, (unsigned long) compat_ptr(arg)); - unlock_kernel(); + reiserfs_write_unlock(inode->i_sb); + return ret; } #endif diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c index 77f5bb746bf0..64dcabd30ace 100644 --- a/fs/reiserfs/journal.c +++ b/fs/reiserfs/journal.c @@ -429,21 +429,6 @@ static void clear_prepared_bits(struct buffer_head *bh) clear_buffer_journal_restore_dirty(bh); } -/* utility function to force a BUG if it is called without the big -** kernel lock held. caller is the string printed just before calling BUG() -*/ -void reiserfs_check_lock_depth(struct super_block *sb, char *caller) -{ -#ifdef CONFIG_SMP - if (current->lock_depth < 0) { - reiserfs_panic(sb, "journal-1", "%s called without kernel " - "lock held", caller); - } -#else - ; -#endif -} - /* return a cnode with same dev, block number and size in table, or null if not found */ static inline struct reiserfs_journal_cnode *get_journal_hash_dev(struct super_block @@ -552,11 +537,48 @@ static inline void insert_journal_hash(struct reiserfs_journal_cnode **table, journal_hash(table, cn->sb, cn->blocknr) = cn; } +/* + * Several mutexes depend on the write lock. + * However sometimes we want to relax the write lock while we hold + * these mutexes, according to the release/reacquire on schedule() + * properties of the Bkl that were used. + * Reiserfs performances and locking were based on this scheme. + * Now that the write lock is a mutex and not the bkl anymore, doing so + * may result in a deadlock: + * + * A acquire write_lock + * A acquire j_commit_mutex + * A release write_lock and wait for something + * B acquire write_lock + * B can't acquire j_commit_mutex and sleep + * A can't acquire write lock anymore + * deadlock + * + * What we do here is avoiding such deadlock by playing the same game + * than the Bkl: if we can't acquire a mutex that depends on the write lock, + * we release the write lock, wait a bit and then retry. + * + * The mutexes concerned by this hack are: + * - The commit mutex of a journal list + * - The flush mutex + * - The journal lock + */ +static inline void reiserfs_mutex_lock_safe(struct mutex *m, + struct super_block *s) +{ + while (!mutex_trylock(m)) { + reiserfs_write_unlock(s); + schedule(); + reiserfs_write_lock(s); + } +} + /* lock the current transaction */ static inline void lock_journal(struct super_block *sb) { PROC_INFO_INC(sb, journal.lock_journal); - mutex_lock(&SB_JOURNAL(sb)->j_mutex); + + reiserfs_mutex_lock_safe(&SB_JOURNAL(sb)->j_mutex, sb); } /* unlock the current transaction */ @@ -708,7 +730,9 @@ static void check_barrier_completion(struct super_block *s, disable_barrier(s); set_buffer_uptodate(bh); set_buffer_dirty(bh); + reiserfs_write_unlock(s); sync_dirty_buffer(bh); + reiserfs_write_lock(s); } } @@ -996,8 +1020,13 @@ static int reiserfs_async_progress_wait(struct super_block *s) { DEFINE_WAIT(wait); struct reiserfs_journal *j = SB_JOURNAL(s); - if (atomic_read(&j->j_async_throttle)) + + if (atomic_read(&j->j_async_throttle)) { + reiserfs_write_unlock(s); congestion_wait(WRITE, HZ / 10); + reiserfs_write_lock(s); + } + return 0; } @@ -1043,7 +1072,8 @@ static int flush_commit_list(struct super_block *s, } /* make sure nobody is trying to flush this one at the same time */ - mutex_lock(&jl->j_commit_mutex); + reiserfs_mutex_lock_safe(&jl->j_commit_mutex, s); + if (!journal_list_still_alive(s, trans_id)) { mutex_unlock(&jl->j_commit_mutex); goto put_jl; @@ -1061,12 +1091,17 @@ static int flush_commit_list(struct super_block *s, if (!list_empty(&jl->j_bh_list)) { int ret; - unlock_kernel(); + + /* + * We might sleep in numerous places inside + * write_ordered_buffers. Relax the write lock. + */ + reiserfs_write_unlock(s); ret = write_ordered_buffers(&journal->j_dirty_buffers_lock, journal, jl, &jl->j_bh_list); if (ret < 0 && retval == 0) retval = ret; - lock_kernel(); + reiserfs_write_lock(s); } BUG_ON(!list_empty(&jl->j_bh_list)); /* @@ -1114,12 +1149,19 @@ static int flush_commit_list(struct super_block *s, bn = SB_ONDISK_JOURNAL_1st_BLOCK(s) + (jl->j_start + i) % SB_ONDISK_JOURNAL_SIZE(s); tbh = journal_find_get_block(s, bn); + + reiserfs_write_unlock(s); wait_on_buffer(tbh); + reiserfs_write_lock(s); // since we're using ll_rw_blk above, it might have skipped over // a locked buffer. Double check here // - if (buffer_dirty(tbh)) /* redundant, sync_dirty_buffer() checks */ + /* redundant, sync_dirty_buffer() checks */ + if (buffer_dirty(tbh)) { + reiserfs_write_unlock(s); sync_dirty_buffer(tbh); + reiserfs_write_lock(s); + } if (unlikely(!buffer_uptodate(tbh))) { #ifdef CONFIG_REISERFS_CHECK reiserfs_warning(s, "journal-601", @@ -1143,10 +1185,15 @@ static int flush_commit_list(struct super_block *s, if (buffer_dirty(jl->j_commit_bh)) BUG(); mark_buffer_dirty(jl->j_commit_bh) ; + reiserfs_write_unlock(s); sync_dirty_buffer(jl->j_commit_bh) ; + reiserfs_write_lock(s); } - } else + } else { + reiserfs_write_unlock(s); wait_on_buffer(jl->j_commit_bh); + reiserfs_write_lock(s); + } check_barrier_completion(s, jl->j_commit_bh); @@ -1286,7 +1333,9 @@ static int _update_journal_header_block(struct super_block *sb, if (trans_id >= journal->j_last_flush_trans_id) { if (buffer_locked((journal->j_header_bh))) { + reiserfs_write_unlock(sb); wait_on_buffer((journal->j_header_bh)); + reiserfs_write_lock(sb); if (unlikely(!buffer_uptodate(journal->j_header_bh))) { #ifdef CONFIG_REISERFS_CHECK reiserfs_warning(sb, "journal-699", @@ -1312,12 +1361,16 @@ static int _update_journal_header_block(struct super_block *sb, disable_barrier(sb); goto sync; } + reiserfs_write_unlock(sb); wait_on_buffer(journal->j_header_bh); + reiserfs_write_lock(sb); check_barrier_completion(sb, journal->j_header_bh); } else { sync: set_buffer_dirty(journal->j_header_bh); + reiserfs_write_unlock(sb); sync_dirty_buffer(journal->j_header_bh); + reiserfs_write_lock(sb); } if (!buffer_uptodate(journal->j_header_bh)) { reiserfs_warning(sb, "journal-837", @@ -1409,7 +1462,7 @@ static int flush_journal_list(struct super_block *s, /* if flushall == 0, the lock is already held */ if (flushall) { - mutex_lock(&journal->j_flush_mutex); + reiserfs_mutex_lock_safe(&journal->j_flush_mutex, s); } else if (mutex_trylock(&journal->j_flush_mutex)) { BUG(); } @@ -1553,7 +1606,11 @@ static int flush_journal_list(struct super_block *s, reiserfs_panic(s, "journal-1011", "cn->bh is NULL"); } + + reiserfs_write_unlock(s); wait_on_buffer(cn->bh); + reiserfs_write_lock(s); + if (!cn->bh) { reiserfs_panic(s, "journal-1012", "cn->bh is NULL"); @@ -1973,11 +2030,19 @@ static int do_journal_release(struct reiserfs_transaction_handle *th, reiserfs_mounted_fs_count--; /* wait for all commits to finish */ cancel_delayed_work(&SB_JOURNAL(sb)->j_work); + + /* + * We must release the write lock here because + * the workqueue job (flush_async_commit) needs this lock + */ + reiserfs_write_unlock(sb); flush_workqueue(commit_wq); + if (!reiserfs_mounted_fs_count) { destroy_workqueue(commit_wq); commit_wq = NULL; } + reiserfs_write_lock(sb); free_journal_ram(sb); @@ -2243,7 +2308,11 @@ static int journal_read_transaction(struct super_block *sb, /* read in the log blocks, memcpy to the corresponding real block */ ll_rw_block(READ, get_desc_trans_len(desc), log_blocks); for (i = 0; i < get_desc_trans_len(desc); i++) { + + reiserfs_write_unlock(sb); wait_on_buffer(log_blocks[i]); + reiserfs_write_lock(sb); + if (!buffer_uptodate(log_blocks[i])) { reiserfs_warning(sb, "journal-1212", "REPLAY FAILURE fsck required! " @@ -2964,8 +3033,11 @@ static void queue_log_writer(struct super_block *s) init_waitqueue_entry(&wait, current); add_wait_queue(&journal->j_join_wait, &wait); set_current_state(TASK_UNINTERRUPTIBLE); - if (test_bit(J_WRITERS_QUEUED, &journal->j_state)) + if (test_bit(J_WRITERS_QUEUED, &journal->j_state)) { + reiserfs_write_unlock(s); schedule(); + reiserfs_write_lock(s); + } __set_current_state(TASK_RUNNING); remove_wait_queue(&journal->j_join_wait, &wait); } @@ -2982,7 +3054,9 @@ static void let_transaction_grow(struct super_block *sb, unsigned int trans_id) struct reiserfs_journal *journal = SB_JOURNAL(sb); unsigned long bcount = journal->j_bcount; while (1) { + reiserfs_write_unlock(sb); schedule_timeout_uninterruptible(1); + reiserfs_write_lock(sb); journal->j_current_jl->j_state |= LIST_COMMIT_PENDING; while ((atomic_read(&journal->j_wcount) > 0 || atomic_read(&journal->j_jlock)) && @@ -3033,7 +3107,9 @@ static int do_journal_begin_r(struct reiserfs_transaction_handle *th, if (test_bit(J_WRITERS_BLOCKED, &journal->j_state)) { unlock_journal(sb); + reiserfs_write_unlock(sb); reiserfs_wait_on_write_block(sb); + reiserfs_write_lock(sb); PROC_INFO_INC(sb, journal.journal_relock_writers); goto relock; } @@ -3506,14 +3582,14 @@ static void flush_async_commits(struct work_struct *work) struct reiserfs_journal_list *jl; struct list_head *entry; - lock_kernel(); + reiserfs_write_lock(sb); if (!list_empty(&journal->j_journal_list)) { /* last entry is the youngest, commit it and you get everything */ entry = journal->j_journal_list.prev; jl = JOURNAL_LIST_ENTRY(entry); flush_commit_list(sb, jl, 1); } - unlock_kernel(); + reiserfs_write_unlock(sb); } /* @@ -4041,7 +4117,7 @@ static int do_journal_end(struct reiserfs_transaction_handle *th, * the new transaction is fully setup, and we've already flushed the * ordered bh list */ - mutex_lock(&jl->j_commit_mutex); + reiserfs_mutex_lock_safe(&jl->j_commit_mutex, sb); /* save the transaction id in case we need to commit it later */ commit_trans_id = jl->j_trans_id; @@ -4203,10 +4279,10 @@ static int do_journal_end(struct reiserfs_transaction_handle *th, * is lost. */ if (!list_empty(&jl->j_tail_bh_list)) { - unlock_kernel(); + reiserfs_write_unlock(sb); write_ordered_buffers(&journal->j_dirty_buffers_lock, journal, jl, &jl->j_tail_bh_list); - lock_kernel(); + reiserfs_write_lock(sb); } BUG_ON(!list_empty(&jl->j_tail_bh_list)); mutex_unlock(&jl->j_commit_mutex); diff --git a/fs/reiserfs/lock.c b/fs/reiserfs/lock.c new file mode 100644 index 000000000000..cdd8d9ef048e --- /dev/null +++ b/fs/reiserfs/lock.c @@ -0,0 +1,63 @@ +#include +#include + +/* + * The previous reiserfs locking scheme was heavily based on + * the tricky properties of the Bkl: + * + * - it was acquired recursively by a same task + * - the performances relied on the release-while-schedule() property + * + * Now that we replace it by a mutex, we still want to keep the same + * recursive property to avoid big changes in the code structure. + * We use our own lock_owner here because the owner field on a mutex + * is only available in SMP or mutex debugging, also we only need this field + * for this mutex, no need for a system wide mutex facility. + * + * Also this lock is often released before a call that could block because + * reiserfs performances were partialy based on the release while schedule() + * property of the Bkl. + */ +void reiserfs_write_lock(struct super_block *s) +{ + struct reiserfs_sb_info *sb_i = REISERFS_SB(s); + + if (sb_i->lock_owner != current) { + mutex_lock(&sb_i->lock); + sb_i->lock_owner = current; + } + + /* No need to protect it, only the current task touches it */ + sb_i->lock_depth++; +} + +void reiserfs_write_unlock(struct super_block *s) +{ + struct reiserfs_sb_info *sb_i = REISERFS_SB(s); + + /* + * Are we unlocking without even holding the lock? + * Such a situation could even raise a BUG() if we don't + * want the data become corrupted + */ + WARN_ONCE(sb_i->lock_owner != current, + "Superblock write lock imbalance"); + + if (--sb_i->lock_depth == -1) { + sb_i->lock_owner = NULL; + mutex_unlock(&sb_i->lock); + } +} + +/* + * Utility function to force a BUG if it is called without the superblock + * write lock held. caller is the string printed just before calling BUG() + */ +void reiserfs_check_lock_depth(struct super_block *sb, char *caller) +{ + struct reiserfs_sb_info *sb_i = REISERFS_SB(sb); + + if (sb_i->lock_depth < 0) + reiserfs_panic(sb, "%s called without kernel lock held %d", + caller); +} diff --git a/fs/reiserfs/resize.c b/fs/reiserfs/resize.c index 238e9d9b31e0..6a7bfb311de6 100644 --- a/fs/reiserfs/resize.c +++ b/fs/reiserfs/resize.c @@ -142,7 +142,9 @@ int reiserfs_resize(struct super_block *s, unsigned long block_count_new) set_buffer_uptodate(bh); mark_buffer_dirty(bh); + reiserfs_write_unlock(s); sync_dirty_buffer(bh); + reiserfs_write_lock(s); // update bitmap_info stuff bitmap[i].free_count = sb_blocksize(sb) * 8 - 1; brelse(bh); diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c index d036ee5b1c81..6bd99a99a652 100644 --- a/fs/reiserfs/stree.c +++ b/fs/reiserfs/stree.c @@ -629,7 +629,9 @@ int search_by_key(struct super_block *sb, const struct cpu_key *key, /* Key to s search_by_key_reada(sb, reada_bh, reada_blocks, reada_count); ll_rw_block(READ, 1, &bh); + reiserfs_write_unlock(sb); wait_on_buffer(bh); + reiserfs_write_lock(sb); if (!buffer_uptodate(bh)) goto io_error; } else { diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c index 3567fb9e3fb1..bafa374d006d 100644 --- a/fs/reiserfs/super.c +++ b/fs/reiserfs/super.c @@ -468,6 +468,13 @@ static void reiserfs_put_super(struct super_block *s) struct reiserfs_transaction_handle th; th.t_trans_id = 0; + /* + * We didn't need to explicitly lock here before, because put_super + * is called with the bkl held. + * Now that we have our own lock, we must explicitly lock. + */ + reiserfs_write_lock(s); + /* change file system state to current state if it was mounted with read-write permissions */ if (!(s->s_flags & MS_RDONLY)) { if (!journal_begin(&th, s, 10)) { @@ -497,6 +504,8 @@ static void reiserfs_put_super(struct super_block *s) reiserfs_proc_info_done(s); + reiserfs_write_unlock(s); + mutex_destroy(&REISERFS_SB(s)->lock); kfree(s->s_fs_info); s->s_fs_info = NULL; @@ -1189,7 +1198,15 @@ static int reiserfs_remount(struct super_block *s, int *mount_flags, char *arg) unsigned int qfmt = 0; #ifdef CONFIG_QUOTA int i; +#endif + /* + * We used to protect using the implicitly acquired bkl here. + * Now we must explictly acquire our own lock + */ + reiserfs_write_lock(s); + +#ifdef CONFIG_QUOTA memcpy(qf_names, REISERFS_SB(s)->s_qf_names, sizeof(qf_names)); #endif @@ -1315,9 +1332,11 @@ static int reiserfs_remount(struct super_block *s, int *mount_flags, char *arg) out_ok: replace_mount_options(s, new_opts); + reiserfs_write_unlock(s); return 0; out_err: + reiserfs_write_unlock(s); kfree(new_opts); return err; } @@ -1422,7 +1441,9 @@ static int read_super_block(struct super_block *s, int offset) static int reread_meta_blocks(struct super_block *s) { ll_rw_block(READ, 1, &(SB_BUFFER_WITH_SB(s))); + reiserfs_write_unlock(s); wait_on_buffer(SB_BUFFER_WITH_SB(s)); + reiserfs_write_lock(s); if (!buffer_uptodate(SB_BUFFER_WITH_SB(s))) { reiserfs_warning(s, "reiserfs-2504", "error reading the super"); return 1; @@ -1631,7 +1652,7 @@ static int reiserfs_fill_super(struct super_block *s, void *data, int silent) sbi = kzalloc(sizeof(struct reiserfs_sb_info), GFP_KERNEL); if (!sbi) { errval = -ENOMEM; - goto error; + goto error_alloc; } s->s_fs_info = sbi; /* Set default values for options: non-aggressive tails, RO on errors */ @@ -1645,6 +1666,20 @@ static int reiserfs_fill_super(struct super_block *s, void *data, int silent) /* setup default block allocator options */ reiserfs_init_alloc_options(s); + mutex_init(&REISERFS_SB(s)->lock); + REISERFS_SB(s)->lock_depth = -1; + + /* + * This function is called with the bkl, which also was the old + * locking used here. + * do_journal_begin() will soon check if we hold the lock (ie: was the + * bkl). This is likely because do_journal_begin() has several another + * callers because at this time, it doesn't seem to be necessary to + * protect against anything. + * Anyway, let's be conservative and lock for now. + */ + reiserfs_write_lock(s); + jdev_name = NULL; if (reiserfs_parse_options (s, (char *)data, &(sbi->s_mount_opt), &blocks, &jdev_name, @@ -1870,9 +1905,13 @@ static int reiserfs_fill_super(struct super_block *s, void *data, int silent) init_waitqueue_head(&(sbi->s_wait)); spin_lock_init(&sbi->bitmap_lock); + reiserfs_write_unlock(s); + return (0); error: + reiserfs_write_unlock(s); +error_alloc: if (jinit_done) { /* kill the commit thread, free journal ram */ journal_release_error(NULL, s); } diff --git a/include/linux/reiserfs_fs.h b/include/linux/reiserfs_fs.h index 2245c78d5876..f6b7b7bb64d7 100644 --- a/include/linux/reiserfs_fs.h +++ b/include/linux/reiserfs_fs.h @@ -52,11 +52,13 @@ #define REISERFS_IOC32_GETVERSION FS_IOC32_GETVERSION #define REISERFS_IOC32_SETVERSION FS_IOC32_SETVERSION -/* Locking primitives */ -/* Right now we are still falling back to (un)lock_kernel, but eventually that - would evolve into real per-fs locks */ -#define reiserfs_write_lock( sb ) lock_kernel() -#define reiserfs_write_unlock( sb ) unlock_kernel() +/* + * Locking primitives. The write lock is a per superblock + * special mutex that has properties close to the Big Kernel Lock + * which was used in the previous locking scheme. + */ +void reiserfs_write_lock(struct super_block *s); +void reiserfs_write_unlock(struct super_block *s); struct fid; diff --git a/include/linux/reiserfs_fs_sb.h b/include/linux/reiserfs_fs_sb.h index 6473650c28f1..8a1d409eb12e 100644 --- a/include/linux/reiserfs_fs_sb.h +++ b/include/linux/reiserfs_fs_sb.h @@ -7,6 +7,8 @@ #ifdef __KERNEL__ #include #include +#include +#include #endif typedef enum { @@ -355,6 +357,13 @@ struct reiserfs_sb_info { struct reiserfs_journal *s_journal; /* pointer to journal information */ unsigned short s_mount_state; /* reiserfs state (valid, invalid) */ + /* Serialize writers access, replace the old bkl */ + struct mutex lock; + /* Owner of the lock (can be recursive) */ + struct task_struct *lock_owner; + /* Depth of the lock, start from -1 like the bkl */ + int lock_depth; + /* Comment? -Hans */ void (*end_io_handler) (struct buffer_head *, int); hashf_t s_hash_function; /* pointer to function which is used -- cgit v1.2.3 From f68c4594d769faf6e48a51b97486602bc2d9789d Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Tue, 14 Apr 2009 00:10:35 +0200 Subject: reiserfs, kill-the-BKL: fix unsafe j_flush_mutex lock Impact: fix a deadlock The j_flush_mutex is acquired safely in journal.c: if we can't take it, we free the reiserfs per superblock lock and wait a bit. But we have a remaining place in kupdate_transactions() where j_flush_mutex is still acquired traditionnaly. Thus the following scenario (warned by lockdep) can happen: A B mutex_lock(&write_lock) mutex_lock(&write_lock) mutex_lock(&j_flush_mutex) mutex_lock(&j_flush_mutex) //block mutex_unlock(&write_lock) sleep... mutex_lock(&write_lock) //deadlock Fix this by using reiserfs_mutex_lock_safe() in kupdate_transactions(). Signed-off-by: Frederic Weisbecker Cc: Alessio Igor Bogani Cc: Jeff Mahoney LKML-Reference: <1239660635-12940-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar --- fs/reiserfs/journal.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c index 64dcabd30ace..7976d7d2f1e9 100644 --- a/fs/reiserfs/journal.c +++ b/fs/reiserfs/journal.c @@ -1826,7 +1826,7 @@ static int kupdate_transactions(struct super_block *s, struct reiserfs_journal *journal = SB_JOURNAL(s); chunk.nr = 0; - mutex_lock(&journal->j_flush_mutex); + reiserfs_mutex_lock_safe(&journal->j_flush_mutex, s); if (!journal_list_still_alive(s, orig_trans_id)) { goto done; } -- cgit v1.2.3 From dd0e4071a111411e692fe346575627cb3c02c889 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Tue, 14 Apr 2009 05:34:23 +0200 Subject: kill-the-BKL/reiserfs: provide a tool to lock only once the write lock Sometimes we don't want to recursively hold the per superblock write lock because we want to be sure it is actually released when we come to sleep. This patch introduces the necessary tools for that. reiserfs_write_lock_once() does the same job than reiserfs_write_lock() except that it won't try to acquire recursively the lock if the current task already owns it. Also the lock_depth before the call of this function is returned. reiserfs_write_unlock_once() unlock only if reiserfs_write_lock_once() returned a depth equal to -1, ie: only if it actually locked. Signed-off-by: Frederic Weisbecker Cc: Alessio Igor Bogani Cc: Jeff Mahoney Cc: Alexander Beregalov Cc: Chris Mason LKML-Reference: <1239680065-25013-2-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar --- fs/reiserfs/lock.c | 26 ++++++++++++++++++++++++++ include/linux/reiserfs_fs.h | 2 ++ 2 files changed, 28 insertions(+) diff --git a/fs/reiserfs/lock.c b/fs/reiserfs/lock.c index cdd8d9ef048e..cb1bba3802dd 100644 --- a/fs/reiserfs/lock.c +++ b/fs/reiserfs/lock.c @@ -49,6 +49,32 @@ void reiserfs_write_unlock(struct super_block *s) } } +/* + * If we already own the lock, just exit and don't increase the depth. + * Useful when we don't want to lock more than once. + * + * We always return the lock_depth we had before calling + * this function. + */ +int reiserfs_write_lock_once(struct super_block *s) +{ + struct reiserfs_sb_info *sb_i = REISERFS_SB(s); + + if (sb_i->lock_owner != current) { + mutex_lock(&sb_i->lock); + sb_i->lock_owner = current; + return sb_i->lock_depth++; + } + + return sb_i->lock_depth; +} + +void reiserfs_write_unlock_once(struct super_block *s, int lock_depth) +{ + if (lock_depth == -1) + reiserfs_write_unlock(s); +} + /* * Utility function to force a BUG if it is called without the superblock * write lock held. caller is the string printed just before calling BUG() diff --git a/include/linux/reiserfs_fs.h b/include/linux/reiserfs_fs.h index f6b7b7bb64d7..6587b4ee9304 100644 --- a/include/linux/reiserfs_fs.h +++ b/include/linux/reiserfs_fs.h @@ -59,6 +59,8 @@ */ void reiserfs_write_lock(struct super_block *s); void reiserfs_write_unlock(struct super_block *s); +int reiserfs_write_lock_once(struct super_block *s); +void reiserfs_write_unlock_once(struct super_block *s, int lock_depth); struct fid; -- cgit v1.2.3 From 9668a4343305eb4e36f4a77c1795c2c35d83c36e Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Tue, 14 Apr 2009 05:34:24 +0200 Subject: kill-the-BKL/reiserfs: lock only once in reiserfs_truncate_file Impact: fix a deadlock reiserfs_truncate_file() can be called from multiple context where the write lock can be already hold or not. This function also acquire (possibly recursively) the write lock. Subsequent releases before sleeping will not actually release the lock because we may be in more than one lock depth degree. A typical case is: reiserfs_file_release { acquire_the_lock() reiserfs_truncate_file() reacquire_the_lock() journal_begin() { do_journal_begin_r() { reiserfs_wait_on_write_block() { /* * Not released because still one * depth owned */ release_lock() wait_for_event() At this stage the event never happen because the one which provides it needs the write lock. We use reiserfs_write_lock_once() here to ensure that we don't acquire the write lock recursively. Signed-off-by: Frederic Weisbecker Cc: Alessio Igor Bogani Cc: Jeff Mahoney Cc: Alexander Beregalov Cc: Chris Mason LKML-Reference: <1239680065-25013-3-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar --- fs/reiserfs/inode.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c index 88ef0b710757..153668eab469 100644 --- a/fs/reiserfs/inode.c +++ b/fs/reiserfs/inode.c @@ -2083,8 +2083,9 @@ int reiserfs_truncate_file(struct inode *inode, int update_timestamps) int error; struct buffer_head *bh = NULL; int err2; + int lock_depth; - reiserfs_write_lock(inode->i_sb); + lock_depth = reiserfs_write_lock_once(inode->i_sb); if (inode->i_size > 0) { error = grab_tail_page(inode, &page, &bh); @@ -2153,14 +2154,17 @@ int reiserfs_truncate_file(struct inode *inode, int update_timestamps) page_cache_release(page); } - reiserfs_write_unlock(inode->i_sb); + reiserfs_write_unlock_once(inode->i_sb, lock_depth); + return 0; out: if (page) { unlock_page(page); page_cache_release(page); } - reiserfs_write_unlock(inode->i_sb); + + reiserfs_write_unlock_once(inode->i_sb, lock_depth); + return error; } -- cgit v1.2.3 From ae70f4e9312bf94d9a0e5b5cef7429839cd3d9dc Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Tue, 14 Apr 2009 05:34:25 +0200 Subject: kill-the-BKL/reiserfs: only acquire the write lock once in reiserfs_dirty_inode Impact: fix a deadlock reiserfs_dirty_inode() is the super_operations::dirty_inode() callback of reiserfs. It can be called from different contexts where the write lock can be already held. But this function also grab the write lock (possibly recursively). Subsequent release of the lock before sleep will actually not release the lock if the caller of mark_inode_dirty() (which in turn calls reiserfs_dirty_inode()) already owns the lock. A typical case: reiserfs_write_end() { acquire_write_lock() mark_inode_dirty() { reiserfs_dirty_inode() { reacquire_write_lock() { journal_begin() { do_journal_begin_r() { /* * fail to release, still * one depth of lock */ release_write_lock() reiserfs_wait_on_write_block() { wait_event() The event is usually provided by something which needs the write lock but it hasn't been released. We use reiserfs_write_lock_once() here to ensure we only grab the write lock in one level. Signed-off-by: Frederic Weisbecker Cc: Frederic Weisbecker Cc: Alessio Igor Bogani Cc: Jeff Mahoney Cc: Chris Mason LKML-Reference: <1239680065-25013-4-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar --- fs/reiserfs/super.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c index bafa374d006d..b301f7ddd848 100644 --- a/fs/reiserfs/super.c +++ b/fs/reiserfs/super.c @@ -565,25 +565,28 @@ static void reiserfs_dirty_inode(struct inode *inode) struct reiserfs_transaction_handle th; int err = 0; + int lock_depth; + if (inode->i_sb->s_flags & MS_RDONLY) { reiserfs_warning(inode->i_sb, "clm-6006", "writing inode %lu on readonly FS", inode->i_ino); return; } - reiserfs_write_lock(inode->i_sb); + lock_depth = reiserfs_write_lock_once(inode->i_sb); /* this is really only used for atime updates, so they don't have ** to be included in O_SYNC or fsync */ err = journal_begin(&th, inode->i_sb, 1); - if (err) { - reiserfs_write_unlock(inode->i_sb); - return; - } + if (err) + goto out; + reiserfs_update_sd(&th, inode); journal_end(&th, inode->i_sb, 1); - reiserfs_write_unlock(inode->i_sb); + +out: + reiserfs_write_unlock_once(inode->i_sb, lock_depth); } #ifdef CONFIG_REISERFS_FS_POSIX_ACL -- cgit v1.2.3 From 8f1680b6f4d57808d691de2757bc41dc75c98209 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Thu, 30 Apr 2009 22:05:25 +0200 Subject: kill-the-BKL/reiserfs: release write lock on fs_changed() fs_changed() is a macro used by reiserfs to check whether its tree has been rebalanced. It has been designed to check parallel changes on the tree after calling a sleeping function, which released the Bkl. fs_changed() also calls cond_resched(), so that if rescheduling is needed, we are in the best place to do that, since we check if the tree has changed just after (because of the bkl release on schedule()). Even if we are not anymore using the Bkl, we still want to release the lock while we reschedule, so that other waiters for the lock can acquire it safely, because of the following __fs_changed() check. [ Impact: release the reiserfs write lock when it is not needed ] Cc: Jeff Mahoney Cc: Chris Mason Cc: Alexander Beregalov Signed-off-by: Frederic Weisbecker --- include/linux/reiserfs_fs.h | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/include/linux/reiserfs_fs.h b/include/linux/reiserfs_fs.h index 6587b4ee9304..397d281b5b46 100644 --- a/include/linux/reiserfs_fs.h +++ b/include/linux/reiserfs_fs.h @@ -1302,7 +1302,13 @@ static inline loff_t max_reiserfs_offset(struct inode *inode) #define get_generation(s) atomic_read (&fs_generation(s)) #define FILESYSTEM_CHANGED_TB(tb) (get_generation((tb)->tb_sb) != (tb)->fs_gen) #define __fs_changed(gen,s) (gen != get_generation (s)) -#define fs_changed(gen,s) ({cond_resched(); __fs_changed(gen, s);}) +#define fs_changed(gen,s) \ +({ \ + reiserfs_write_unlock(s); \ + cond_resched(); \ + reiserfs_write_lock(s); \ + __fs_changed(gen, s); \ +}) /***************************************************************************/ /* FIXATE NODES */ -- cgit v1.2.3 From c8565b672a1fdace7cb792867608cdd3a3528605 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Thu, 30 Apr 2009 23:04:32 +0200 Subject: kill-the-BKL/reiserfs: release the write lock before rescheduling on do_journal_end() When do_journal_end() copies data to the journal blocks buffers in memory, it reschedules if needed between each block copied and dirtyfied. We can also release the write lock at this rescheduling stage, like did the bkl implicitly. [ Impact: release the reiserfs write lock when it is not needed ] Cc: Jeff Mahoney Cc: Chris Mason Cc: Alexander Beregalov Signed-off-by: Frederic Weisbecker --- fs/reiserfs/journal.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c index 7976d7d2f1e9..373d0801caca 100644 --- a/fs/reiserfs/journal.c +++ b/fs/reiserfs/journal.c @@ -4232,7 +4232,9 @@ static int do_journal_end(struct reiserfs_transaction_handle *th, next = cn->next; free_cnode(sb, cn); cn = next; + reiserfs_write_unlock(sb); cond_resched(); + reiserfs_write_lock(sb); } /* we are done with both the c_bh and d_bh, but -- cgit v1.2.3 From ff826335efe2cdbcaf281d56713ccd0d0302ee79 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Thu, 30 Apr 2009 23:36:33 +0200 Subject: kill-the-BKL/reiserfs: release write lock while rescheduling on prepare_for_delete_or_cut() prepare_for_delete_or_cut() can process several types of items, including indirect items, ie: items which contain no file data but pointers to unformatted nodes scattering the datas of a file. In this case it has to zero out these pointers to block numbers of unformatted nodes and release the bitmap from these block numbers. It can take some time, so a rescheduling() is performed between each block processed. We can safely release the write lock while rescheduling(), like the bkl did, because the code checks just after if the item has moved after sleeping. [ Impact: release the reiserfs write lock when it is not needed ] Cc: Jeff Mahoney Cc: Chris Mason Cc: Alexander Beregalov Signed-off-by: Frederic Weisbecker --- fs/reiserfs/stree.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c index 6bd99a99a652..6ddcecb4e8ab 100644 --- a/fs/reiserfs/stree.c +++ b/fs/reiserfs/stree.c @@ -1026,7 +1026,9 @@ static char prepare_for_delete_or_cut(struct reiserfs_transaction_handle *th, st reiserfs_free_block(th, inode, block, 1); } + reiserfs_write_unlock(sb); cond_resched(); + reiserfs_write_lock(sb); if (item_moved (&s_ih, path)) { need_re_search = 1; -- cgit v1.2.3 From 7915b73fdba5fdca56590a14f18fe534e10a0d8c Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Fri, 1 May 2009 01:10:52 +0200 Subject: kill-the-BKL/reiserfs: release the write lock inside get_neighbors() get_neighbors() is used to get the left and/or right blocks against a given one in order to balance a tree. sb_bread() is used to read the buffer of these neighors blocks and while it waits for this operation, it might sleep. The bkl was released at this point, and then we can also release the write lock before calling sb_bread(). This is safe because if the filesystem is changed after this lock release, the function returns REPEAT_SEARCH (aka SCHEDULE_OCCURRED in the function header comments) in order to repeat the neighbhor research. [ Impact: release the reiserfs write lock when it is not needed ] Cc: Jeff Mahoney Cc: Chris Mason Cc: Alexander Beregalov Signed-off-by: Frederic Weisbecker --- fs/reiserfs/fix_node.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/fs/reiserfs/fix_node.c b/fs/reiserfs/fix_node.c index bf5f2cbdb063..3a685e3f754f 100644 --- a/fs/reiserfs/fix_node.c +++ b/fs/reiserfs/fix_node.c @@ -1971,7 +1971,9 @@ static int get_neighbors(struct tree_balance *tb, int h) tb->FL[h]) ? tb->lkey[h] : B_NR_ITEMS(tb-> FL[h]); son_number = B_N_CHILD_NUM(tb->FL[h], child_position); + reiserfs_write_unlock(sb); bh = sb_bread(sb, son_number); + reiserfs_write_lock(sb); if (!bh) return IO_ERROR; if (FILESYSTEM_CHANGED_TB(tb)) { @@ -2009,7 +2011,9 @@ static int get_neighbors(struct tree_balance *tb, int h) child_position = (bh == tb->FR[h]) ? tb->rkey[h] + 1 : 0; son_number = B_N_CHILD_NUM(tb->FR[h], child_position); + reiserfs_write_unlock(sb); bh = sb_bread(sb, son_number); + reiserfs_write_lock(sb); if (!bh) return IO_ERROR; if (FILESYSTEM_CHANGED_TB(tb)) { -- cgit v1.2.3 From d0ad9e0df4ca970516fddf65e5b5fbd4dc5a0fab Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Fri, 1 May 2009 01:44:57 +0200 Subject: kill-the-BKL/reiserfs: release the write lock inside reiserfs_read_bitmap_block() reiserfs_read_bitmap_block() uses sb_bread() to read the bitmap block. This helper might sleep. Then, when the bkl was used, it was released at this point. We can then relax the write lock too here. [ Impact: release the reiserfs write lock when it is not needed ] Cc: Jeff Mahoney Cc: Chris Mason Cc: Alexander Beregalov Signed-off-by: Frederic Weisbecker --- fs/reiserfs/bitmap.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/reiserfs/bitmap.c b/fs/reiserfs/bitmap.c index 147033461b87..685495707181 100644 --- a/fs/reiserfs/bitmap.c +++ b/fs/reiserfs/bitmap.c @@ -1249,7 +1249,9 @@ struct buffer_head *reiserfs_read_bitmap_block(struct super_block *sb, else if (bitmap == 0) block = (REISERFS_DISK_OFFSET_IN_BYTES >> sb->s_blocksize_bits) + 1; + reiserfs_write_unlock(sb); bh = sb_bread(sb, block); + reiserfs_write_lock(sb); if (bh == NULL) reiserfs_warning(sb, "sh-2029: %s: bitmap block (#%u) " "reading failed", __func__, block); -- cgit v1.2.3 From f6af81ff21d399947492c20a50c256f238bf3cdb Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Fri, 1 May 2009 02:27:39 +0200 Subject: kill-the-BKL/reiserfs: release the write lock on flush_commit_list() flush_commit_list() uses ll_rw_block() to commit the pending log blocks. ll_rw_block() might sleep, and the bkl was released at this point. Then we can also relax the write lock at this point. [ Impact: release the reiserfs write lock when it is not needed ] Cc: Jeff Mahoney Cc: Chris Mason Cc: Alexander Beregalov Signed-off-by: Frederic Weisbecker --- fs/reiserfs/journal.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c index 373d0801caca..b1ebd5af9ab7 100644 --- a/fs/reiserfs/journal.c +++ b/fs/reiserfs/journal.c @@ -1120,8 +1120,11 @@ static int flush_commit_list(struct super_block *s, SB_ONDISK_JOURNAL_SIZE(s); tbh = journal_find_get_block(s, bn); if (tbh) { - if (buffer_dirty(tbh)) - ll_rw_block(WRITE, 1, &tbh) ; + if (buffer_dirty(tbh)) { + reiserfs_write_unlock(s); + ll_rw_block(WRITE, 1, &tbh); + reiserfs_write_lock(s); + } put_bh(tbh) ; } } -- cgit v1.2.3 From 83dae927f683a16a90ddab8dd899fc9db6cf8435 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Thu, 7 May 2009 22:51:20 +0200 Subject: kill-the-BKL/reiserfs: add reiserfs_cond_resched() Usually, when we call cond_resched(), we want the write lock to be released and then reacquired once we return from scheduling. Not only does it follow the previous bkl based locking scheme, but it also let other waiters to get the lock. But if we aren't going to reschedule(), such as in !TIF_NEED_RESCHED case, it's useless to release the lock. Worse, if we release and reacquire the lock whereas it is not needed, we create useless contentions. Also if someone takes the lock while we are modifying or reading the tree, there are good chances we'll have to retry our operation, eg if the block we were seeeking has moved. So this patch introduces a helper which only unlock the write lock if we are going to schedule. [ Impact: prepare to inject less lock contention and less tree operation attempts ] Reported-by: Andi Kleen Cc: Jeff Mahoney Cc: Chris Mason Cc: Ingo Molnar Cc: Alexander Beregalov Signed-off-by: Frederic Weisbecker --- include/linux/reiserfs_fs.h | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/include/linux/reiserfs_fs.h b/include/linux/reiserfs_fs.h index 397d281b5b46..995bdf9f94d7 100644 --- a/include/linux/reiserfs_fs.h +++ b/include/linux/reiserfs_fs.h @@ -62,6 +62,19 @@ void reiserfs_write_unlock(struct super_block *s); int reiserfs_write_lock_once(struct super_block *s); void reiserfs_write_unlock_once(struct super_block *s, int lock_depth); +/* + * When we schedule, we usually want to also release the write lock, + * according to the previous bkl based locking scheme of reiserfs. + */ +static inline void reiserfs_cond_resched(struct super_block *s) +{ + if (need_resched()) { + reiserfs_write_unlock(s); + schedule(); + reiserfs_write_lock(s); + } +} + struct fid; /* in reading the #defines, it may help to understand that they employ -- cgit v1.2.3 From caf69d074b85601796f824c96ed7216ffc08eaa4 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Thu, 7 May 2009 23:25:29 +0200 Subject: kill-the-bkl/reiserfs: conditionaly release the write lock on fs_changed() The goal of fs_changed() is to check whether the tree changed during a schedule(). This is a BKL legacy. A recent patch added an explicit unconditional release/reacquire of the write lock around the cond_resched() called inside fs_changed. But it's wasteful to unconditionally do that, we are creating superfluous lock contention in !TIF_NEED_RESCHED case. This patch manage that by calling reiserfs_cond_resched() from fs_changed() which only releases the lock if we are going to reschedule. [ Impact: inject less lock contention and tree job retries ] Cc: Jeff Mahoney Cc: Chris Mason Cc: Ingo Molnar Cc: Alexander Beregalov Signed-off-by: Frederic Weisbecker --- include/linux/reiserfs_fs.h | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/include/linux/reiserfs_fs.h b/include/linux/reiserfs_fs.h index 995bdf9f94d7..39bd4ea04162 100644 --- a/include/linux/reiserfs_fs.h +++ b/include/linux/reiserfs_fs.h @@ -1317,9 +1317,7 @@ static inline loff_t max_reiserfs_offset(struct inode *inode) #define __fs_changed(gen,s) (gen != get_generation (s)) #define fs_changed(gen,s) \ ({ \ - reiserfs_write_unlock(s); \ - cond_resched(); \ - reiserfs_write_lock(s); \ + reiserfs_cond_resched(s); \ __fs_changed(gen, s); \ }) -- cgit v1.2.3 From 492d38f4fdff29d5315d49edaff54231845657df Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Thu, 7 May 2009 23:48:44 +0200 Subject: kill-the-bkl/reiserfs: lock only once on reiserfs_get_block() reiserfs_get_block() is one of these sites where the write lock might be acquired recursively. It's a particular problem because this function is called very often. It's a hot spot which needs to reschedule() periodically while converting direct items to indirect ones because it can take some time. Then if we are applying the write lock release/reacquire pattern on schedule() here, it may not produce the desired effect since we may have locked in more than one depth. The solution is to use reiserfs_write_lock_once() which won't try to reacquire the lock recursively. Then the lock will be *really* released before schedule(). Also, we only release the lock if TIF_NEED_RESCHED is set to not create wasteful numerous contentions. [ Impact: fix a too long holded lock case in reiserfs_get_block() ] Cc: Jeff Mahoney Cc: Chris Mason Cc: Ingo Molnar Cc: Alexander Beregalov Signed-off-by: Frederic Weisbecker --- fs/reiserfs/inode.c | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c index 153668eab469..a5f69ef788f4 100644 --- a/fs/reiserfs/inode.c +++ b/fs/reiserfs/inode.c @@ -605,6 +605,7 @@ int reiserfs_get_block(struct inode *inode, sector_t block, __le32 *item; int done; int fs_gen; + int lock_depth; struct reiserfs_transaction_handle *th = NULL; /* space reserved in transaction batch: . 3 balancings in direct->indirect conversion @@ -620,11 +621,11 @@ int reiserfs_get_block(struct inode *inode, sector_t block, loff_t new_offset = (((loff_t) block) << inode->i_sb->s_blocksize_bits) + 1; - reiserfs_write_lock(inode->i_sb); + lock_depth = reiserfs_write_lock_once(inode->i_sb); version = get_inode_item_key_version(inode); if (!file_capable(inode, block)) { - reiserfs_write_unlock(inode->i_sb); + reiserfs_write_unlock_once(inode->i_sb, lock_depth); return -EFBIG; } @@ -636,7 +637,7 @@ int reiserfs_get_block(struct inode *inode, sector_t block, /* find number of block-th logical block of the file */ ret = _get_block_create_0(inode, block, bh_result, create | GET_BLOCK_READ_DIRECT); - reiserfs_write_unlock(inode->i_sb); + reiserfs_write_unlock_once(inode->i_sb, lock_depth); return ret; } /* @@ -754,7 +755,7 @@ int reiserfs_get_block(struct inode *inode, sector_t block, if (!dangle && th) retval = reiserfs_end_persistent_transaction(th); - reiserfs_write_unlock(inode->i_sb); + reiserfs_write_unlock_once(inode->i_sb, lock_depth); /* the item was found, so new blocks were not added to the file ** there is no need to make sure the inode is updated with this @@ -1005,9 +1006,11 @@ int reiserfs_get_block(struct inode *inode, sector_t block, * long time. reschedule if needed and also release the write * lock for others. */ - reiserfs_write_unlock(inode->i_sb); - cond_resched(); - reiserfs_write_lock(inode->i_sb); + if (need_resched()) { + reiserfs_write_unlock_once(inode->i_sb, lock_depth); + schedule(); + lock_depth = reiserfs_write_lock_once(inode->i_sb); + } retval = search_for_position_by_key(inode->i_sb, &key, &path); if (retval == IO_ERROR) { @@ -1042,7 +1045,7 @@ int reiserfs_get_block(struct inode *inode, sector_t block, retval = err; } - reiserfs_write_unlock(inode->i_sb); + reiserfs_write_unlock_once(inode->i_sb, lock_depth); reiserfs_check_path(&path); return retval; } -- cgit v1.2.3 From 0b5df8a513bd31ffd33b8269d3c43cddba5374b4 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Fri, 8 May 2009 01:05:06 +0200 Subject: kill-the-bkl/reiserfs: don't hold the write recursively in reiserfs_lookup() The write lock can be acquired recursively in reiserfs_lookup(). But we may want to *really* release the lock before possible rescheduling from a reiserfs_lookup() callee. Hence we want to only acquire the lock once (ie: not recursively). [ Impact: prevent from possible false unreleased write lock on sleeping ] Cc: Jeff Mahoney Cc: Chris Mason Cc: Ingo Molnar Cc: Alexander Beregalov Signed-off-by: Frederic Weisbecker --- fs/reiserfs/namei.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/fs/reiserfs/namei.c b/fs/reiserfs/namei.c index 271579128634..b3973c9f0bf1 100644 --- a/fs/reiserfs/namei.c +++ b/fs/reiserfs/namei.c @@ -324,6 +324,7 @@ static struct dentry *reiserfs_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd) { int retval; + int lock_depth; struct inode *inode = NULL; struct reiserfs_dir_entry de; INITIALIZE_PATH(path_to_entry); @@ -331,7 +332,13 @@ static struct dentry *reiserfs_lookup(struct inode *dir, struct dentry *dentry, if (REISERFS_MAX_NAME(dir->i_sb->s_blocksize) < dentry->d_name.len) return ERR_PTR(-ENAMETOOLONG); - reiserfs_write_lock(dir->i_sb); + /* + * Might be called with or without the write lock, must be careful + * to not recursively hold it in case we want to release the lock + * before rescheduling. + */ + lock_depth = reiserfs_write_lock_once(dir->i_sb); + de.de_gen_number_bit_string = NULL; retval = reiserfs_find_entry(dir, dentry->d_name.name, dentry->d_name.len, @@ -341,7 +348,7 @@ static struct dentry *reiserfs_lookup(struct inode *dir, struct dentry *dentry, inode = reiserfs_iget(dir->i_sb, (struct cpu_key *)&(de.de_dir_id)); if (!inode || IS_ERR(inode)) { - reiserfs_write_unlock(dir->i_sb); + reiserfs_write_unlock_once(dir->i_sb, lock_depth); return ERR_PTR(-EACCES); } @@ -350,7 +357,7 @@ static struct dentry *reiserfs_lookup(struct inode *dir, struct dentry *dentry, if (IS_PRIVATE(dir)) inode->i_flags |= S_PRIVATE; } - reiserfs_write_unlock(dir->i_sb); + reiserfs_write_unlock_once(dir->i_sb, lock_depth); if (retval == IO_ERROR) { return ERR_PTR(-EIO); } -- cgit v1.2.3 From 37c876077d74a91f4270f4ae4e66c1181b29fcb0 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Fri, 8 May 2009 14:21:33 +0200 Subject: kill-the-bkl/reiserfs: reduce number of contentions in search_by_key() search_by_key() is a central function in reiserfs which searches the patch in the fs tree from the root to a node given its key. It is the function that is most requesting the write lock because it's a path very often used. Also we forget to release the lock while reading the next tree node, making us holding the lock in a wasteful way. Then we release the lock while reading the current node and its childs, all-in-one. It should be safe because we have a reference to these blocks and even if we read a block that will be concurrently changed, we have an fs_changed check later that will make us retry the path from the root. [ Impact: release the write lock while unused in a hot path ] Cc: Jeff Mahoney Cc: Chris Mason Cc: Ingo Molnar Cc: Alexander Beregalov Signed-off-by: Frederic Weisbecker --- fs/reiserfs/stree.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c index 6ddcecb4e8ab..960c9114f6d3 100644 --- a/fs/reiserfs/stree.c +++ b/fs/reiserfs/stree.c @@ -529,6 +529,14 @@ static void search_by_key_reada(struct super_block *s, for (i = 0; i < num; i++) { bh[i] = sb_getblk(s, b[i]); } + /* + * We are going to read some blocks on which we + * have a reference. It's safe, though we might be + * reading blocks concurrently changed if we release + * the lock. But it's still fine because we check later + * if the tree changed + */ + reiserfs_write_unlock(s); for (j = 0; j < i; j++) { /* * note, this needs attention if we are getting rid of the BKL @@ -626,10 +634,12 @@ int search_by_key(struct super_block *sb, const struct cpu_key *key, /* Key to s if ((bh = last_element->pe_buffer = sb_getblk(sb, block_number))) { if (!buffer_uptodate(bh) && reada_count > 1) + /* will unlock the write lock */ search_by_key_reada(sb, reada_bh, reada_blocks, reada_count); + else + reiserfs_write_unlock(sb); ll_rw_block(READ, 1, &bh); - reiserfs_write_unlock(sb); wait_on_buffer(bh); reiserfs_write_lock(sb); if (!buffer_uptodate(bh)) -- cgit v1.2.3 From 3982f3c66f33e5163c038600e90a6fb0129d3754 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Fri, 8 May 2009 14:53:52 +0200 Subject: kill-the-bkl/reiserfs: factorize the locking in reiserfs_write_end() reiserfs_write_end() is a hot path in reiserfs. We have two wasteful write lock lock/release inside that can be gathered without changing the code logic. This patch factorizes them out in a single protected section, reducing the number of contentions inside. [ Impact: reduce lock contention in a reiserfs hotpath ] Cc: Jeff Mahoney Cc: Chris Mason Cc: Ingo Molnar Cc: Alexander Beregalov Signed-off-by: Frederic Weisbecker --- fs/reiserfs/inode.c | 25 +++++++++++++++---------- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c index a5f69ef788f4..cd651767a7ea 100644 --- a/fs/reiserfs/inode.c +++ b/fs/reiserfs/inode.c @@ -2685,6 +2685,8 @@ static int reiserfs_write_end(struct file *file, struct address_space *mapping, int update_sd = 0; struct reiserfs_transaction_handle *th; unsigned start; + int lock_depth = 0; + bool locked = false; if ((unsigned long)fsdata & AOP_FLAG_CONT_EXPAND) pos ++; @@ -2711,9 +2713,11 @@ static int reiserfs_write_end(struct file *file, struct address_space *mapping, ** to do the i_size updates here. */ pos += copied; + if (pos > inode->i_size) { struct reiserfs_transaction_handle myth; - reiserfs_write_lock(inode->i_sb); + lock_depth = reiserfs_write_lock_once(inode->i_sb); + locked = true; /* If the file have grown beyond the border where it can have a tail, unmark it as needing a tail packing */ @@ -2724,10 +2728,9 @@ static int reiserfs_write_end(struct file *file, struct address_space *mapping, REISERFS_I(inode)->i_flags &= ~i_pack_on_close_mask; ret = journal_begin(&myth, inode->i_sb, 1); - if (ret) { - reiserfs_write_unlock(inode->i_sb); + if (ret) goto journal_error; - } + reiserfs_update_inode_transaction(inode); inode->i_size = pos; /* @@ -2739,34 +2742,36 @@ static int reiserfs_write_end(struct file *file, struct address_space *mapping, reiserfs_update_sd(&myth, inode); update_sd = 1; ret = journal_end(&myth, inode->i_sb, 1); - reiserfs_write_unlock(inode->i_sb); if (ret) goto journal_error; } if (th) { - reiserfs_write_lock(inode->i_sb); + if (!locked) { + lock_depth = reiserfs_write_lock_once(inode->i_sb); + locked = true; + } if (!update_sd) mark_inode_dirty(inode); ret = reiserfs_end_persistent_transaction(th); - reiserfs_write_unlock(inode->i_sb); if (ret) goto out; } out: + if (locked) + reiserfs_write_unlock_once(inode->i_sb, lock_depth); unlock_page(page); page_cache_release(page); return ret == 0 ? copied : ret; journal_error: + reiserfs_write_unlock_once(inode->i_sb, lock_depth); + locked = false; if (th) { - reiserfs_write_lock(inode->i_sb); if (!update_sd) reiserfs_update_sd(th, inode); ret = reiserfs_end_persistent_transaction(th); - reiserfs_write_unlock(inode->i_sb); } - goto out; } -- cgit v1.2.3 From 007bdbd0e3e82e70f02d1c76b4ea094d5693f516 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Fri, 8 May 2009 20:01:09 +0200 Subject: kill-the-bkl/reiserfs: use mutex_lock in reiserfs_mutex_lock_safe reiserfs_mutex_lock_safe() is a hack to avoid any dependency between an internal reiserfs mutex and the write lock, it has been proposed to follow the old bkl logic. The code does the following: while (!mutex_trylock(m)) { reiserfs_write_unlock(s); schedule(); reiserfs_write_lock(s); } It then imitate the implicit behaviour of the lock when it was a Bkl and hadn't such dependency: mutex_lock(m) { if (fastpath) let's go else { wait_for_mutex() { schedule() { unlock_kernel() reacquire_lock_kernel() } } } } The problem is that by using such explicit schedule(), we don't benefit of the adaptive mutex spinning on owner. The logic in use now is: reiserfs_write_unlock(s); mutex_lock(m); // -> possible adaptive spinning reiserfs_write_lock(s); [ Impact: restore the use of adaptive spinning mutexes in reiserfs ] Cc: Jeff Mahoney Cc: Chris Mason Cc: Ingo Molnar Cc: Alexander Beregalov Signed-off-by: Frederic Weisbecker --- fs/reiserfs/journal.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c index b1ebd5af9ab7..3c3e00ddc143 100644 --- a/fs/reiserfs/journal.c +++ b/fs/reiserfs/journal.c @@ -566,11 +566,9 @@ static inline void insert_journal_hash(struct reiserfs_journal_cnode **table, static inline void reiserfs_mutex_lock_safe(struct mutex *m, struct super_block *s) { - while (!mutex_trylock(m)) { - reiserfs_write_unlock(s); - schedule(); - reiserfs_write_lock(s); - } + reiserfs_write_unlock(s); + mutex_lock(m); + reiserfs_write_lock(s); } /* lock the current transaction */ -- cgit v1.2.3 From d1551b3df3d901f66033b9ea5e48c1f8fbd798b1 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Thu, 14 May 2009 02:56:39 +0200 Subject: kill-the-bkl/reiserfs: unlock only when needed in search_by_key search_by_key() is the site which most requires the lock. This is mostly because it is a very central function and also because it releases/reaqcuires the write lock at least once each time it is called. Such release/reacquire creates a lot of contention in this place and also opens more the window which let another thread changing the tree. When it happens, the current path searching over the tree must be retried from the beggining (the root) which is a wasteful and time consuming recovery. This patch factorizes two release/reacquire sequences: - reading leaf nodes blocks - reading current block The latter immediately follows the former. The whole sequence is safe as a single unlocked section because we check just after if the tree has changed during these operations. Cc: Jeff Mahoney Cc: Chris Mason Cc: Ingo Molnar Cc: Alexander Beregalov Signed-off-by: Frederic Weisbecker --- fs/reiserfs/stree.c | 42 ++++++++++++++++++++++++++++++++++-------- 1 file changed, 34 insertions(+), 8 deletions(-) diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c index 960c9114f6d3..6b025a42d510 100644 --- a/fs/reiserfs/stree.c +++ b/fs/reiserfs/stree.c @@ -519,12 +519,22 @@ static int is_tree_node(struct buffer_head *bh, int level) #define SEARCH_BY_KEY_READA 16 -/* The function is NOT SCHEDULE-SAFE! */ -static void search_by_key_reada(struct super_block *s, +/* + * The function is NOT SCHEDULE-SAFE! + * It might unlock the write lock if we needed to wait for a block + * to be read. Note that in this case it won't recover the lock to avoid + * high contention resulting from too much lock requests, especially + * the caller (search_by_key) will perform other schedule-unsafe + * operations just after calling this function. + * + * @return true if we have unlocked + */ +static bool search_by_key_reada(struct super_block *s, struct buffer_head **bh, b_blocknr_t *b, int num) { int i, j; + bool unlocked = false; for (i = 0; i < num; i++) { bh[i] = sb_getblk(s, b[i]); @@ -536,16 +546,21 @@ static void search_by_key_reada(struct super_block *s, * the lock. But it's still fine because we check later * if the tree changed */ - reiserfs_write_unlock(s); for (j = 0; j < i; j++) { /* * note, this needs attention if we are getting rid of the BKL * you have to make sure the prepared bit isn't set on this buffer */ - if (!buffer_uptodate(bh[j])) + if (!buffer_uptodate(bh[j])) { + if (!unlocked) { + reiserfs_write_unlock(s); + unlocked = true; + } ll_rw_block(READA, 1, bh + j); + } brelse(bh[j]); } + return unlocked; } /************************************************************************** @@ -633,15 +648,26 @@ int search_by_key(struct super_block *sb, const struct cpu_key *key, /* Key to s have a pointer to it. */ if ((bh = last_element->pe_buffer = sb_getblk(sb, block_number))) { + bool unlocked = false; + if (!buffer_uptodate(bh) && reada_count > 1) - /* will unlock the write lock */ - search_by_key_reada(sb, reada_bh, + /* may unlock the write lock */ + unlocked = search_by_key_reada(sb, reada_bh, reada_blocks, reada_count); - else + /* + * If we haven't already unlocked the write lock, + * then we need to do that here before reading + * the current block + */ + if (!buffer_uptodate(bh) && !unlocked) { reiserfs_write_unlock(sb); + unlocked = true; + } ll_rw_block(READ, 1, &bh); wait_on_buffer(bh); - reiserfs_write_lock(sb); + + if (unlocked) + reiserfs_write_lock(sb); if (!buffer_uptodate(bh)) goto io_error; } else { -- cgit v1.2.3 From a24cceed64a84ef83b2844952e1e328ea905472a Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Sat, 16 May 2009 18:12:08 +0200 Subject: kill-the-bkl/reiserfs: acquire the inode mutex safely While searching a pathname, an inode mutex can be acquired in do_lookup() which calls reiserfs_lookup() which in turn acquires the write lock. On the other side reiserfs_fill_super() can acquire the write_lock and then call reiserfs_lookup_privroot() which can acquire an inode mutex (the root of the mount point). So we theoretically risk an AB - BA lock inversion that could lead to a deadlock. As for other lock dependencies found since the bkl to mutex conversion, the fix is to use reiserfs_mutex_lock_safe() which drops the lock dependency to the write lock. [ Impact: fix a possible deadlock with reiserfs ] Cc: Jeff Mahoney Cc: Chris Mason Cc: Ingo Molnar Cc: Alexander Beregalov Signed-off-by: Frederic Weisbecker --- fs/reiserfs/journal.c | 34 ---------------------------------- fs/reiserfs/xattr.c | 4 ++-- include/linux/reiserfs_fs.h | 35 +++++++++++++++++++++++++++++++++++ 3 files changed, 37 insertions(+), 36 deletions(-) diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c index 3c3e00ddc143..86c1ff41476f 100644 --- a/fs/reiserfs/journal.c +++ b/fs/reiserfs/journal.c @@ -537,40 +537,6 @@ static inline void insert_journal_hash(struct reiserfs_journal_cnode **table, journal_hash(table, cn->sb, cn->blocknr) = cn; } -/* - * Several mutexes depend on the write lock. - * However sometimes we want to relax the write lock while we hold - * these mutexes, according to the release/reacquire on schedule() - * properties of the Bkl that were used. - * Reiserfs performances and locking were based on this scheme. - * Now that the write lock is a mutex and not the bkl anymore, doing so - * may result in a deadlock: - * - * A acquire write_lock - * A acquire j_commit_mutex - * A release write_lock and wait for something - * B acquire write_lock - * B can't acquire j_commit_mutex and sleep - * A can't acquire write lock anymore - * deadlock - * - * What we do here is avoiding such deadlock by playing the same game - * than the Bkl: if we can't acquire a mutex that depends on the write lock, - * we release the write lock, wait a bit and then retry. - * - * The mutexes concerned by this hack are: - * - The commit mutex of a journal list - * - The flush mutex - * - The journal lock - */ -static inline void reiserfs_mutex_lock_safe(struct mutex *m, - struct super_block *s) -{ - reiserfs_write_unlock(s); - mutex_lock(m); - reiserfs_write_lock(s); -} - /* lock the current transaction */ static inline void lock_journal(struct super_block *sb) { diff --git a/fs/reiserfs/xattr.c b/fs/reiserfs/xattr.c index 8e7deb0e6964..ee3ddd9e5315 100644 --- a/fs/reiserfs/xattr.c +++ b/fs/reiserfs/xattr.c @@ -976,7 +976,7 @@ int reiserfs_lookup_privroot(struct super_block *s) int err = 0; /* If we don't have the privroot located yet - go find it */ - mutex_lock(&s->s_root->d_inode->i_mutex); + reiserfs_mutex_lock_safe(&s->s_root->d_inode->i_mutex, s); dentry = lookup_one_len(PRIVROOT_NAME, s->s_root, strlen(PRIVROOT_NAME)); if (!IS_ERR(dentry)) { @@ -1011,7 +1011,7 @@ int reiserfs_xattr_init(struct super_block *s, int mount_flags) if (privroot->d_inode) { s->s_xattr = reiserfs_xattr_handlers; - mutex_lock(&privroot->d_inode->i_mutex); + reiserfs_mutex_lock_safe(&privroot->d_inode->i_mutex, s); if (!REISERFS_SB(s)->xattr_root) { struct dentry *dentry; dentry = lookup_one_len(XAROOT_NAME, privroot, diff --git a/include/linux/reiserfs_fs.h b/include/linux/reiserfs_fs.h index 39bd4ea04162..7665f41580dd 100644 --- a/include/linux/reiserfs_fs.h +++ b/include/linux/reiserfs_fs.h @@ -62,6 +62,41 @@ void reiserfs_write_unlock(struct super_block *s); int reiserfs_write_lock_once(struct super_block *s); void reiserfs_write_unlock_once(struct super_block *s, int lock_depth); +/* + * Several mutexes depend on the write lock. + * However sometimes we want to relax the write lock while we hold + * these mutexes, according to the release/reacquire on schedule() + * properties of the Bkl that were used. + * Reiserfs performances and locking were based on this scheme. + * Now that the write lock is a mutex and not the bkl anymore, doing so + * may result in a deadlock: + * + * A acquire write_lock + * A acquire j_commit_mutex + * A release write_lock and wait for something + * B acquire write_lock + * B can't acquire j_commit_mutex and sleep + * A can't acquire write lock anymore + * deadlock + * + * What we do here is avoiding such deadlock by playing the same game + * than the Bkl: if we can't acquire a mutex that depends on the write lock, + * we release the write lock, wait a bit and then retry. + * + * The mutexes concerned by this hack are: + * - The commit mutex of a journal list + * - The flush mutex + * - The journal lock + * - The inode mutex + */ +static inline void reiserfs_mutex_lock_safe(struct mutex *m, + struct super_block *s) +{ + reiserfs_write_unlock(s); + mutex_lock(m); + reiserfs_write_lock(s); +} + /* * When we schedule, we usually want to also release the write lock, * according to the previous bkl based locking scheme of reiserfs. -- cgit v1.2.3 From cfcb7d059be7c2441b2c3ac56e0015631396ead6 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Sat, 16 May 2009 19:10:38 +0200 Subject: kill-the-bkl/reiserfs: move the concurrent tree accesses checks per superblock When do_balance() balances the tree, a trick is performed to provide the ability for other tree writers/readers to check whether do_balance() is executing concurrently (requires CONFIG_REISERFS_CHECK). This is done to protect concurrent accesses to the tree. The trick is the following: When do_balance is called, a unique global variable called cur_tb takes a pointer to the current tree to be rebalanced. Once do_balance finishes its work, cur_tb takes the NULL value. Then, concurrent tree readers/writers just have to check the value of cur_tb to ensure do_balance isn't executing concurrently. If it is, then it proves that schedule() occured on do_balance(), which then relaxed the bkl that protected the tree. Now that the bkl has be turned into a mutex, this check is still fine even though do_balance() becomes preemptible: the write lock will not be automatically released on schedule(), so the tree is still protected. But this is only fine if we have a single reiserfs mountpoint. Indeed, because the bkl is a global lock, it didn't allowed concurrent executions between a tree reader/writer in a mount point and a do_balance() on another tree from another mountpoint. So assuming all these readers/writers weren't supposed to be reentrant, the current check now sometimes detect false positives with the current per-superblock mutex which allows this reentrancy. This patch keeps the concurrent tree accesses check but moves it per superblock, so that only trees from a same mount point are checked to be not accessed concurrently. [ Impact: fix spurious panic while running several reiserfs mount-points ] Cc: Jeff Mahoney Cc: Chris Mason Cc: Ingo Molnar Cc: Alexander Beregalov Signed-off-by: Frederic Weisbecker --- fs/reiserfs/do_balan.c | 17 +++++------------ fs/reiserfs/fix_node.c | 5 +---- fs/reiserfs/prints.c | 4 ---- fs/reiserfs/stree.c | 5 +---- include/linux/reiserfs_fs_sb.h | 11 +++++++++++ 5 files changed, 18 insertions(+), 24 deletions(-) diff --git a/fs/reiserfs/do_balan.c b/fs/reiserfs/do_balan.c index 4beb964a2a3e..2787349662fc 100644 --- a/fs/reiserfs/do_balan.c +++ b/fs/reiserfs/do_balan.c @@ -21,14 +21,6 @@ #include #include -#ifdef CONFIG_REISERFS_CHECK - -struct tree_balance *cur_tb = NULL; /* detects whether more than one - copy of tb exists as a means - of checking whether schedule - is interrupting do_balance */ -#endif - static inline void buffer_info_init_left(struct tree_balance *tb, struct buffer_info *bi) { @@ -1841,11 +1833,12 @@ static int check_before_balancing(struct tree_balance *tb) { int retval = 0; - if (cur_tb) { + if (REISERFS_SB(tb->tb_sb)->cur_tb) { reiserfs_panic(tb->tb_sb, "vs-12335", "suspect that schedule " "occurred based on cur_tb not being null at " "this point in code. do_balance cannot properly " - "handle schedule occurring while it runs."); + "handle concurrent tree accesses on a same " + "mount point."); } /* double check that buffers that we will modify are unlocked. (fix_nodes should already have @@ -1987,7 +1980,7 @@ static inline void do_balance_starts(struct tree_balance *tb) "check");*/ RFALSE(check_before_balancing(tb), "PAP-12340: locked buffers in TB"); #ifdef CONFIG_REISERFS_CHECK - cur_tb = tb; + REISERFS_SB(tb->tb_sb)->cur_tb = tb; #endif } @@ -1997,7 +1990,7 @@ static inline void do_balance_completed(struct tree_balance *tb) #ifdef CONFIG_REISERFS_CHECK check_leaf_level(tb); check_internal_levels(tb); - cur_tb = NULL; + REISERFS_SB(tb->tb_sb)->cur_tb = NULL; #endif /* reiserfs_free_block is no longer schedule safe. So, we need to diff --git a/fs/reiserfs/fix_node.c b/fs/reiserfs/fix_node.c index 3a685e3f754f..d2f31330dcae 100644 --- a/fs/reiserfs/fix_node.c +++ b/fs/reiserfs/fix_node.c @@ -563,9 +563,6 @@ static int get_num_ver(int mode, struct tree_balance *tb, int h, return needed_nodes; } -#ifdef CONFIG_REISERFS_CHECK -extern struct tree_balance *cur_tb; -#endif /* Set parameters for balancing. * Performs write of results of analysis of balancing into structure tb, @@ -2368,7 +2365,7 @@ int fix_nodes(int op_mode, struct tree_balance *tb, return REPEAT_SEARCH; } #ifdef CONFIG_REISERFS_CHECK - if (cur_tb) { + if (REISERFS_SB(tb->tb_sb)->cur_tb) { print_cur_tb("fix_nodes"); reiserfs_panic(tb->tb_sb, "PAP-8305", "there is pending do_balance"); diff --git a/fs/reiserfs/prints.c b/fs/reiserfs/prints.c index 536eacaeb710..adbc6f538515 100644 --- a/fs/reiserfs/prints.c +++ b/fs/reiserfs/prints.c @@ -349,10 +349,6 @@ void reiserfs_debug(struct super_block *s, int level, const char *fmt, ...) . */ -#ifdef CONFIG_REISERFS_CHECK -extern struct tree_balance *cur_tb; -#endif - void __reiserfs_panic(struct super_block *sb, const char *id, const char *function, const char *fmt, ...) { diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c index 6b025a42d510..5fa7118f04e1 100644 --- a/fs/reiserfs/stree.c +++ b/fs/reiserfs/stree.c @@ -222,9 +222,6 @@ static inline int bin_search(const void *key, /* Key to search for. */ return ITEM_NOT_FOUND; } -#ifdef CONFIG_REISERFS_CHECK -extern struct tree_balance *cur_tb; -#endif /* Minimal possible key. It is never in the tree. */ const struct reiserfs_key MIN_KEY = { 0, 0, {{0, 0},} }; @@ -711,7 +708,7 @@ int search_by_key(struct super_block *sb, const struct cpu_key *key, /* Key to s !key_in_buffer(search_path, key, sb), "PAP-5130: key is not in the buffer"); #ifdef CONFIG_REISERFS_CHECK - if (cur_tb) { + if (REISERFS_SB(sb)->cur_tb) { print_cur_tb("5140"); reiserfs_panic(sb, "PAP-5140", "schedule occurred in do_balance!"); diff --git a/include/linux/reiserfs_fs_sb.h b/include/linux/reiserfs_fs_sb.h index 8a1d409eb12e..2f8bb04e6430 100644 --- a/include/linux/reiserfs_fs_sb.h +++ b/include/linux/reiserfs_fs_sb.h @@ -417,6 +417,17 @@ struct reiserfs_sb_info { char *s_qf_names[MAXQUOTAS]; int s_jquota_fmt; #endif +#ifdef CONFIG_REISERFS_CHECK + + struct tree_balance *cur_tb; /* + * Detects whether more than one + * copy of tb exists per superblock + * as a means of checking whether + * do_balance is executing concurrently + * against another tree reader/writer + * on a same mount point. + */ +#endif }; /* Definitions of reiserfs on-disk properties: */ -- cgit v1.2.3