summaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)Author
2008-08-18ext4: Fix small file fragmentationAneesh Kumar K.V
For small file block allocations, mballoc uses per cpu prealloc space. Use goal block when searching for the right prealloc space. Also make sure ext4_da_writepages tries to write all the pages for small files in single attempt Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-19ext4: Initialize writeback_index to 0 when allocating a new inodeAneesh Kumar K.V
The write_cache_pages() function uses the mapping->writeback_index as the starting index to write out when range_cyclic is set. Properly initialize writeback_index so that we start the writeout at index 0. This was found when debugging the small file fragmentation on ext4. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-19ext4: make sure ext4_has_free_blocks returns 0 for ENOSPCAneesh Kumar K.V
Fix ext4_has_free_blocks() to return 0 when we don't have enough space. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-19ext4: journal credit fix for the delayed allocation's writepages() functionMingming Cao
Previous delalloc writepages implementation started a new transaction outside of a loop which called get_block() to do the block allocation. Since we didn't know exactly how many blocks would need to be allocated, the estimated journal credits required was very conservative and caused many issues. With the reworked delayed allocation, a new transaction is created for each get_block(), thus we don't need to guess how many credits for the multiple chunk of allocation. We start every transaction with enough credits for inserting a single exent. When estimate the credits for indirect blocks to allocate a chunk of blocks, we need to know the number of data blocks to allocate. We use the total number of reserved delalloc datablocks; if that is too big, for non-extent files, we need to limit the number of blocks to EXT4_MAX_TRANS_BLOCKS. Code cleanup from Aneesh. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-19ext4: Rework the ext4_da_writepages() functionAneesh Kumar K.V
With the below changes we reserve credit needed to insert only one extent resulting from a call to single get_block. This makes sure we don't take too much journal credits during writeout. We also don't limit the pages to write. That means we loop through the dirty pages building largest possible contiguous block request. Then we issue a single get_block request. We may get less block that we requested. If so we would end up not mapping some of the buffer_heads. That means those buffer_heads are still marked delay. Later in the writepage callback via __mpage_writepage we redirty those pages. We should also not limit/throttle wbc->nr_to_write in the filesystem writepages callback. That cause wrong behaviour in generic_sync_sb_inodes caused by wbc->nr_to_write being <= 0 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-19ext4: journal credits reservation fixes for DIO, fallocateMingming Cao
DIO and fallocate credit calculation is different than writepage, as they do start a new journal right for each call to ext4_get_blocks_wrap(). This patch uses the helper function in DIO and fallocate case, passing a flag indicating that the modified data are contigous thus could account less indirect/index blocks. This patch also fixed the journal credit reservation for direct I/O (DIO). Previously the estimated credits for DIO only was calculated for non-extent files, which was not enough if the file is extent-based. Also fixed was fallocate double-counting credits for modifying the the superblock. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-19ext4: journal credits reservation fixes for extent file writepageMingming Cao
This patch modified the writepage/write_begin credit calculation for extent files, to use the credits caculation helper function. The current calculation of how many index/leaf blocks should be accounted is too conservetive, it always considered the worse case, where the tree level is 5, and in the case of multiple chunk allocations, it always assumed no blocks were dirtied in common across the allocations. This path uses the accurate depth of the inode with some extras to calculate the index blocks, and also less conservative in the case of multiple allocation accounting. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-19ext4: journal credits calulation cleanup and fix for non-extent writepageMingming Cao
When considering how many journal credits are needed for modifying a chunk of data, we need to account for the super block, inode block, quota blocks and xattr block, indirect/index blocks, also, group bitmap and group descriptor blocks for new allocation (including data and indirect/index blocks). There are many places in ext4 do the calculation on their own and often missed one or two meta blocks, and often they assume single block allocation, and did not considering the multile chunk of allocation case. This patch is trying to cleanup current journal credit code, provides some common helper funtion to calculate the journal credits, to be used for writepage, writepages, DIO, fallocate, migration, defrag, and for both nonextent and extent files. This patch modified the writepage/write_begin credit caculation for nonextent files, to use the new helper function. It also fixed the problem that writepage on nonextent files did not consider the case blocksize <pagesize, thus could possibelly need multiple block allocation in a single transaction. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-19ext4: Fix bug where we return ENOSPC even though we have plenty of inodesEric Sandeen
The find_group_flex() function starts with best_flex as the parent_fbg_group, which happens to have 0 inodes free. Some of the flex groups searched have free blocks and free inodes, but the flex_freeb_ratio is < 10, so they're skipped. Then when a group is compared to the current "best" flex group, it does not have more free blocks than "best", so it is skipped as well. This continues until no flex group with free inodes is found which has a proper ratio or which has more free blocks than the "best" group, and we're left with a "best" group that has 0 inodes free, and we return -ENOSPC. We fix this by changing the logic so that if the current "best" flex group has no inodes free, and the current one does have room, it is promoted to the next "best." Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-19ext4: don't try to resize if there are no reserved gdt blocks leftJosef Bacik
When trying to resize an ext4 fs and you run out of reserved gdt blocks, you get an error that doesn't actually tell you what went wrong, it just says that the gdb it picked is not correct, which is the case since you don't have any reserved gdt blocks left. This patch adds a check to make sure you have reserved gdt blocks to use, and if not prints out a more relevant error. Signed-off-by: Josef Bacik <jbacik@redhat.com> Cc: <linux-ext4@vger.kernel.org> Cc: Andreas Dilger <adilger@sun.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-16ext4: Use ext4_discard_reservations instead of mballoc-specific callTheodore Ts'o
In ext4_ext_truncate(), we should use the more generic ext4_discard_reservations() call so we do the right thing when the filesystem is mounted with the nomballoc option. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2008-08-19ext4: Fix ext4_dx_readdir hash collision handlingTheodore Ts'o
This fixes a bug where readdir() would return a directory entry twice if there was a hash collision in an hash tree indexed directory. Signed-off-by: Eugene Dashevsky <eugene@ibrix.com> Signed-off-by: Mike Snitzer <msnitzer@ibrix.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-19ext4: Fix delalloc release block reservation for truncateMingming Cao
Ext4 will release the reserved blocks for delayed allocations when inode is truncated/unlinked. If there is no reserved block at all, we shouldn't need to do so. But current code still tries to release the reserved blocks regardless whether the counters's value is 0. Continue to do that causes the later calculation to go wrong and a kernel BUG_ON() caught that. This doesn't happen for extent-based files, as the calculation for 0 reserved blocks was right for extent based file. This patch fixed the kernel BUG() due to above reason. It adds checks for 0 to avoid unnecessary release and fix calculation for non-extent files. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-13ext4: Fix potential truncate BUG due to i_prealloc_list being non-emptyTheodore Ts'o
We need to call ext4_discard_reservation() earlier in ext4_truncate(), to avoid a BUG() in ext4_mb_return_to_preallocation(), which is called (ultimately) by ext4_free_blocks(). So we must ditch the blocks on i_prealloc_list before we start freeing the data blocks. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-19ext4: Handle unwritten extent properly with delayed allocationAneesh Kumar K.V
When using fallocate the buffer_heads are marked unwritten and unmapped. We need to map them in the writepages after a get_block. Otherwise we split the uninit extents, but never write the content to disk. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-03Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: remove write-only variables from ext4_ordered_write_end ext4: unexport jbd2_journal_update_superblock ext4: Cleanup whitespace and other miscellaneous style issues ext4: improve ext4_fill_flex_info() a bit ext4: Cleanup the block reservation code path ext4: don't assume extents can't cross block groups when truncating ext4: Fix lack of credits BUG() when deleting a badly fragmented inode ext4: Fix ext4_ext_journal_restart() ext4: fix ext4_da_write_begin error path jbd2: don't abort if flushing file data failed ext4: don't read inode block if the buffer has a write error ext4: Don't allow lg prealloc list to be grow large. ext4: Convert the usage of NR_CPUS to nr_cpu_ids. ext4: Improve error handling in mballoc ext4: lock block groups when initializing ext4: sync up block and inode bitmap reading functions ext4: Allow read/only mounts with corrupted block group checksums ext4: Fix data corruption when writing to prealloc area
2008-08-02ext4: remove write-only variables from ext4_ordered_write_endEric Sandeen
The variables 'from' and 'to' are not used anywhere. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-01[PATCH] fix races and leaks in vfs_quota_on() usersAl Viro
* new helper: vfs_quota_on_path(); equivalent of vfs_quota_on() sans the pathname resolution. * callers of vfs_quota_on() that do their own pathname resolution and checks based on it are switched to vfs_quota_on_path(); that way we avoid the races. * reiserfs leaked dentry/vfsmount references on several failure exits. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-07-28vfs: pagecache usage optimization for pagesize!=blocksizeHisashi Hifumi
When we read some part of a file through pagecache, if there is a pagecache of corresponding index but this page is not uptodate, read IO is issued and this page will be uptodate. I think this is good for pagesize == blocksize environment but there is room for improvement on pagesize != blocksize environment. Because in this case a page can have multiple buffers and even if a page is not uptodate, some buffers can be uptodate. So I suggest that when all buffers which correspond to a part of a file that we want to read are uptodate, use this pagecache and copy data from this pagecache to user buffer even if a page is not uptodate. This can reduce read IO and improve system throughput. I wrote a benchmark program and got result number with this program. This benchmark do: 1: mount and open a test file. 2: create a 512MB file. 3: close a file and umount. 4: mount and again open a test file. 5: pwrite randomly 300000 times on a test file. offset is aligned by IO size(1024bytes). 6: measure time of preading randomly 100000 times on a test file. The result was: 2.6.26 330 sec 2.6.26-patched 226 sec Arch:i386 Filesystem:ext3 Blocksize:1024 bytes Memory: 1GB On ext3/4, a file is written through buffer/block. So random read/write mixed workloads or random read after random write workloads are optimized with this patch under pagesize != blocksize environment. This test result showed this. The benchmark program is as follows: #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <time.h> #include <stdlib.h> #include <string.h> #include <sys/mount.h> #define LEN 1024 #define LOOP 1024*512 /* 512MB */ main(void) { unsigned long i, offset, filesize; int fd; char buf[LEN]; time_t t1, t2; if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) { perror("cannot mount\n"); exit(1); } memset(buf, 0, LEN); fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC); if (fd < 0) { perror("cannot open file\n"); exit(1); } for (i = 0; i < LOOP; i++) write(fd, buf, LEN); close(fd); if (umount("/root/test1/") < 0) { perror("cannot umount\n"); exit(1); } if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) { perror("cannot mount\n"); exit(1); } fd = open("/root/test1/testfile", O_RDWR); if (fd < 0) { perror("cannot open file\n"); exit(1); } filesize = LEN * LOOP; for (i = 0; i < 300000; i++){ offset = (random() % filesize) & (~(LEN - 1)); pwrite(fd, buf, LEN, offset); } printf("start test\n"); time(&t1); for (i = 0; i < 100000; i++){ offset = (random() % filesize) & (~(LEN - 1)); pread(fd, buf, LEN, offset); } time(&t2); printf("%ld sec\n", t2-t1); close(fd); if (umount("/root/test1/") < 0) { perror("cannot umount\n"); exit(1); } } Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jan Kara <jack@ucw.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26[PATCH] sanitize ->permission() prototypeAl Viro
* kill nameidata * argument; map the 3 bits in ->flags anybody cares about to new MAY_... ones and pass with the mask. * kill redundant gfs2_iop_permission() * sanitize ecryptfs_permission() * fix remaining places where ->permission() instances might barf on new MAY_... found in mask. The obvious next target in that direction is permission(9) folded fix for nfs_permission() breakage from Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-07-26ext4: Cleanup whitespace and other miscellaneous style issuesTheodore Ts'o
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-26SL*B: drop kmem cache argument from constructorAlexey Dobriyan
Kmem cache passed to constructor is only needed for constructors that are themselves multiplexeres. Nobody uses this "feature", nor does anybody uses passed kmem cache in non-trivial way, so pass only pointer to object. Non-trivial places are: arch/powerpc/mm/init_64.c arch/powerpc/mm/hugetlbpage.c This is flag day, yes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Matt Mackall <mpm@selenic.com> [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c] [akpm@linux-foundation.org: fix mm/slab.c] [akpm@linux-foundation.org: fix ubifs] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24ext4: improve ext4_fill_flex_info() a bitLi Zefan
- use kzalloc() instead of kmalloc() + memset() - improve a printk info Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-17ext4: Cleanup the block reservation code pathAneesh Kumar K.V
The truncate patch should not use the i_allocated_meta_blocks value. So add seperate functions to be used in the truncate and alloc path. We also need to release the meta-data block that we reserved for the blocks that we are truncating. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-08-01ext4: don't assume extents can't cross block groups when truncatingTheodore Ts'o
With the FLEX_BG layout, there is no reason why extents can't cross block groups, so make the truncate code reserve enough credits so we don't BUG if we come across such an extent. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-02ext4: Fix lack of credits BUG() when deleting a badly fragmented inodeTheodore Ts'o
The extents codepath for ext4_truncate() requests journal transaction credits in very small chunks, requesting only what is needed. This means there may not be enough credits left on the transaction handle after ext4_truncate() returns and then when ext4_delete_inode() tries finish up its work, it may not have enough transaction credits, causing a BUG() oops in the jbd2 core. Also, reserve an extra 2 blocks when starting an ext4_delete_inode() since we need to update the inode bitmap, as well as update the orphaned inode linked list. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-01ext4: Fix ext4_ext_journal_restart()Theodore Ts'o
The ext4_ext_journal_restart() is a convenience function which checks to see if the requested number of credits is present, and if so it closes the current transaction and attaches the current handle to the new transaction. Unfortunately, it wasn't proprely checking the return value from ext4_journal_extend(), so it was starting a new transaction when one was not necessary, and returning an error when all that was necessary was to restart the handle with a new transaction. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-02ext4: fix ext4_da_write_begin error pathEric Sandeen
ext4_da_write_begin needs to call journal_stop before returning, if the page allocation fails. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-26ext4: don't read inode block if the buffer has a write errorHidehiro Kawai
A transient I/O error can corrupt inode data. Here is the scenario: (1) update inode_A at the block_B (2) pdflush writes out new inode_A to the filesystem, but it results in write I/O error, at this point, BH_Uptodate flag of the buffer for block_B is cleared and BH_Write_EIO is set (3) create new inode_C which located at block_B, and __ext4_get_inode_loc() tries to read on-disk block_B because the buffer is not uptodate (4) if it can read on-disk block_B successfully, inode_A is overwritten by old data This patch makes __ext4_get_inode_loc() not read the inode block if the buffer has BH_Write_EIO flag. In this case, the buffer should have the latest information, so setting the uptodate flag to the buffer (this avoids WARN_ON_ONCE() in mark_buffer_dirty().) According to this change, we would need to test BH_Write_EIO flag for the error checking. Currently nobody checks write I/O errors on metadata buffers, but it will be done in other patches I'm working on. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: sugita <yumiko.sugita.yf@hitachi.com> Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Jan Kara <jack@ucw.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-23ext4: Don't allow lg prealloc list to be grow large.Aneesh Kumar K.V
Currently, the locality group prealloc list is freed only when there is a block allocation failure. This can result in large number of entries in the preallocation list making ext4_mb_use_preallocated() expensive. To fix this, we convert the locality group prealloc list to a hash list. The hash index is the order of number of blocks in the prealloc space with a max order of 9. When adding prealloc space to the list we make sure total entries for each order does not exceed 8. If it is more than 8 we discard few entries and make sure the we have only <= 5 entries. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-23ext4: Convert the usage of NR_CPUS to nr_cpu_ids.Aneesh Kumar K.V
NR_CPUS can be really large. We should be using nr_cpu_ids instead. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-23ext4: Improve error handling in mballocAneesh Kumar K.V
Don't call BUG_ON on file system failures. Instead use ext4_error and also handle the continue case properly. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-08-02ext4: lock block groups when initializingEric Sandeen
I noticed when filling a 1T filesystem with 4 threads using the fs_mark benchmark: fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0 that I occasionally got checksum mismatch errors: EXT4-fs error (device sdb): ext4_init_inode_bitmap: Checksum bad for group 6935 etc. I'd reliably get 4-5 of them during the run. It appears that the problem is likely a race to init the bg's when the uninit_bg feature is enabled. With the patch below, which adds sb_bgl_locking around initialization, I was able to complete several runs with no errors or warnings. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-08-02ext4: sync up block and inode bitmap reading functionsEric Sandeen
ext4_read_block_bitmap and read_inode_bitmap do essentially the same thing, and yet they are structured quite differently. I came across this difference while looking at doing bg locking during bg initialization. This patch: * removes unnecessary casts in the error messages * renames read_inode_bitmap to ext4_read_inode_bitmap * and more substantially, restructures the inode bitmap reading function to be more like the block bitmap counterpart. The change to the inode bitmap reader simplifies the locking to be applied in the next patch. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-26ext4: Allow read/only mounts with corrupted block group checksumsTheodore Ts'o
If the block group checksums are corrupted, still allow the mount to succeed, so e2fsck can have a chance to try to fix things up. Add code in the remount r/w path to make sure the block group checksums are valid before allowing the filesystem to be remounted read/write. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-02ext4: Fix data corruption when writing to prealloc areaAneesh Kumar K.V
Inserting an extent can cause a new entry in the already existing index block. That doesn't increase the depth of the instead. Instead it adds a new leaf block. Now with the new leaf block the path information corresponding to the logical block should be fetched from the new block. The old path will be pointing to the old leaf block. We need to recalucate the path information on extent insert even if depth doesn't change. Without this change, the extent merge after converting an unwritten extent to initialized extent takes the wrong extent and cause data corruption. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11ext4: do not set extents feature from the kernelEric Sandeen
We've talked for a while about getting rid of any feature- setting from the kernel; this gets rid of the code which would set the INCOMPAT_EXTENTS flag on the first file write when mounted as ext4[dev]. With this patch, if the extents feature is not already set on disk, then mounting as ext4 will fall back to noextents with a warning, and if -o extents is explicitly requested, the mount will fail, also with warning. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11ext4: Don't allow nonextenst mount option for large filesystemAneesh Kumar K.V
The block mapped inode format can address only blocks within 2**32. This causes a number of issues, the biggest of which is that the block allocator needs to be taught that certain inodes can not utilize block numbers > 2**32. So until this is fixed, it is simplest to fail mounting of file systems with more than 2**32 blocks if the -o noextents option is given. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11ext4: Enable delalloc by default.Aneesh Kumar K.V
Enable delalloc by default to ensure it gets sufficient testing and because it makes the filesystem much more efficient. Add a nodealalloc option to disable delayed allocation, and update ext4_show_options to show delayed allocation off if it is disabled. If the data=journal mount option is used, disable delayed allocation since the delalloc code doesn't support data=journal yet. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-07-11ext4: delayed allocation i_blocks fix for statMingming Cao
Right now i_blocks is not getting updated until the blocks are actually allocaed on disk. This means with delayed allocation, right after files are copied, "ls -sF" shoes the file as taking 0 blocks on disk. "du" also shows the files taking zero space, which is highly confusing to the user. Since delayed allocation already keeps track of per-inode total number of blocks that are subject to delayed allocation, this patch fix this by using that to adjust the value returned by stat(2). When real block allocation is done, the i_blocks will get updated. Since the reserved blocks for delayed allocation will be decreased, this will be keep value returned by stat(2) consistent. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11ext4: fix delalloc i_disksize early update issueMingming Cao
Ext4_da_write_end() used walk_page_buffers() with a callback function of ext4_bh_unmapped_or_delay() to check if it extended the file size without allocating any blocks (since in this case i_disksize needs to be updated). However, this is didn't work proprely because the buffer head has not been marked dirty yet --- this is done later in block_commit_write() --- which caused ext4_bh_unmapped_or_delay() to always return false. In addition, walk_page_buffers() checks all of the buffer heads covering the page, and the only buffer_head that should be checked is the one covering the end of the write. Otherwise, given a 1k blocksize filesystem and a 4k page size, the buffer head covering the first 1k stripe of the file could be unmapped (because it was a sparse file), and the second or third buffer_head covering that page could be mapped, and using walk_page_buffers() would fail in this case since it would stop at the first unmapped buffer_head and return true. The core problem is that walk_page_buffers() was intended to do work in a callback function, and a non-zero return value indicated a failure, which termined the walk of the buffer heads covering the page. It was not intended to be used with a boolean function, such as ext4_bh_unmapped_or_delay(). Add addtional fix from Aneesh to protect i_disksize update rave with truncate. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11ext4: Handle page without buffers in ext4_*_writepage()Aneesh Kumar K.V
It can happen that buffers are removed from the page before it gets marked dirty and then is passed to writepage(). In writepage() we just initialize the buffers and check whether they are mapped and non delay. If they are mapped and non delay we write the page. Otherwise we mark them dirty. With this change we don't do block allocation at all in ext4_*_write_page. writepage() can get called under many condition and with a locking order of journal_start -> lock_page, we should not try to allocate blocks in writepage() which get called after taking page lock. writepage() can get called via shrink_page_list even with a journal handle which was created for doing inode update. For example when doing ext4_da_write_begin we create a journal handle with credit 1 expecting a i_disksize update for the inode. But ext4_da_write_begin can cause shrink_page_list via _grab_page_cache. So having a valid handle via ext4_journal_current_handle is not a guarantee that we can use the handle for block allocation in writepage, since we shouldn't be using credits that had been reserved for other updates. That it could result in we running out of credits when we update inodes. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11ext4: Add ordered mode support for delallocAneesh Kumar K.V
This provides a new ordered mode implementation which gets rid of using buffer heads to enforce the ordering between metadata change with the related data chage. Instead, in the new ordering mode, it keeps track of all of the inodes touched by each transaction on a list, and when that transaction is committed, it flushes all of the dirty pages for those inodes. In addition, the new ordered mode reverses the lock ordering of the page lock and transaction lock, which provides easier support for delayed allocation. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11ext4: Invert lock ordering of page_lock and transaction start in delallocMingming Cao
With the reverse locking, we need to start a transation before taking the page lock, so in ext4_da_writepages() we need to break the write-out into chunks, and restart the journal for each chunck to ensure the write-out fits in a single transaction. Updated patch from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> which fixes delalloc sync hang with journal lock inversion, and address the performance regression issue. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-14ext4: delayed allocation ENOSPC handlingMingming Cao
This patch does block reservation for delayed allocation, to avoid ENOSPC later at page flush time. Blocks(data and metadata) are reserved at da_write_begin() time, the freeblocks counter is updated by then, and the number of reserved blocks is store in per inode counter. At the writepage time, the unused reserved meta blocks are returned back. At unlink/truncate time, reserved blocks are properly released. Updated fix from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> to fix the oldallocator block reservation accounting with delalloc, added lock to guard the counters and also fix the reservation for meta blocks. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-11percpu_counter: new function percpu_counter_sum_and_setMingming Cao
Delayed allocation need to check free blocks at every write time. percpu_counter_read_positive() is not quit accurate. delayed allocation need a more accurate accounting, but using percpu_counter_sum_positive() is frequently is quite expensive. This patch added a new function to update center counter when sum per-cpu counter, to increase the accurate rate for next percpu_counter_read() and require less calling expensive percpu_counter_sum(). Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11ext4: Add delayed allocation support in data=writeback modeAlex Tomas
Updated with fixes from Mingming Cao <cmm@us.ibm.com> to unlock and release the page from page cache if the delalloc write_begin failed, and properly handle preallocated blocks. Also added a fix to clear buffer_delay in block_write_full_page() after allocating a delayed buffer. Updated with fixes from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> to update i_disksize properly and to add bmap support for delayed allocation. Updated with a fix from Valerie Clement <valerie.clement@bull.net> to avoid filesystem corruption when the filesystem is mounted with the delalloc option and blocksize < pagesize. Signed-off-by: Alex Tomas <alex@clusterfs.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2008-07-11ext4: Use new framework for data=ordered mode in JBD2Jan Kara
This patch makes ext4 use inode-based implementation of data=ordered mode in JBD2. It allows us to unify some data=ordered and data=writeback paths (especially writepage since we don't have to start a transaction anymore) and remove some buffer walking. Updated fix from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> to fix file system hang due to corrupt jinode values. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11ext4: Fix lock inversion in ext4_ext_truncate()Jan Kara
We cannot call ext4_orphan_add() from under i_data_sem because that causes a lock ordering violation between i_data_sem and and the superblock lock. Updated with Aneesh's locking order fix Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11ext4: Invert the locking order of page_lock and transaction startJan Kara
This changes are needed to support data=ordered mode handling via inodes. This enables us to get rid of the journal heads and buffer heads for data buffers in the ordered mode. With the changes, during tranasaction commit we writeout the inode pages using the writepages()/writepage(). That implies we take page lock during transaction commit. This can cause a deadlock with the locking order page_lock -> jbd2_journal_start, since the jbd2_journal_start can wait for the journal_commit to happen and the journal_commit now needs to take the page lock. To avoid this dead lock reverse the locking order. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>