Age | Commit message (Collapse) | Author |
|
Make the fibmap call the return the proper physical block number for any
offset request in the fallocated range.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Skip new cluster allocation after checking i_disksize in _fat_get_block.
because the blocks are already allocated in fallocated region.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Implement preallocation via the fallocate syscall on VFAT partitions.
This patch is based on an earlier patch of the same name which had some
issues detailed below and did not get accepted. Refer
https://lkml.org/lkml/2007/12/22/130.
a) The preallocated space was not persistent when the
FALLOC_FL_KEEP_SIZE flag was set. It will deallocate cluster at evict
time.
b) There was no need to zero out the clusters when the flag was set
Instead of doing an expanding truncate, just allocate clusters and add
them to the fat chain. This reduces preallocation time.
Compatibility with windows:
There are no issues when FALLOC_FL_KEEP_SIZE is not set because it just
does an expanding truncate. Thus reading from the preallocated area on
windows returns null until data is written to it.
When a file with preallocated area using the FALLOC_FL_KEEP_SIZE was
written to on windows, the windows driver freed-up the preallocated
clusters and allocated new clusters for the new data. The freed up
clusters gets reflected in the free space available for the partition
which can be seen from the Volume properties.
The windows chkdsk tool also does not report any errors on a disk
containing files with preallocated space.
And there is also no issue using linux fat fsck. because discard
preallocated clusters at repair time.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This patchset provides support for doing fallocate operation on FAT
filesystem.
This patch (of 5):
Add i_disksize to represent uninitialized allocated size. And mmu_private
represent initialized allocated size. i_disksize - is always kept cluster
size aligned mmu_private - is normally block size aligned
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set. If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.
Here's a simple code snippet to demonstrate the bug:
char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
assert(*m == '\0'); /* new PTE allows write access */
assert(!soft_dirty(x));
*m = 'x'; /* should dirty the page */
assert(soft_dirty(x)); /* fails */
With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.
As a side effect of enabling and disabling write notifications with care,
this patch fixes a bug in mprotect where vm_page_prot bits set by drivers
were zapped on mprotect. An analogous bug was fixed in mmap by
c9d0bf241451a3ab7d ("mm: uncached vma support with writenotify").
Signed-off-by: Peter Feiner <pfeiner@google.com>
Reported-by: Peter Feiner <pfeiner@google.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
It's very common for the buffer heads in the lru to have different block
numbers. By comparing the blocknr before the bdev and size we can reduce
the cost of searching in the very common case where all the entries have
the same bdev and size.
In quick hot cache cycle counting tests on a single fs workstation this
cut the cost of a miss by about 20%.
A diff of the disassembly shows the reordering of the bdev and blocknr
comparisons. This is in such a tiny loop that skipping one comparison is
a meaningful portion of the total work being done:
1628: 83 c1 01 add $0x1,%ecx
162b: 83 f9 08 cmp $0x8,%ecx
162e: 74 60 je 1690 <__find_get_block+0xa0>
1630: 89 c8 mov %ecx,%eax
1632: 65 4c 8b 04 c5 00 00 mov %gs:0x0(,%rax,8),%r8
1639: 00 00
163b: 4d 85 c0 test %r8,%r8
163e: 4c 89 c3 mov %r8,%rbx
1641: 74 e5 je 1628 <__find_get_block+0x38>
- 1643: 4d 3b 68 30 cmp 0x30(%r8),%r13
+ 1643: 4d 3b 68 18 cmp 0x18(%r8),%r13
1647: 75 df jne 1628 <__find_get_block+0x38>
- 1649: 4d 3b 60 18 cmp 0x18(%r8),%r12
+ 1649: 4d 3b 60 30 cmp 0x30(%r8),%r12
164d: 75 d9 jne 1628 <__find_get_block+0x38>
164f: 49 39 50 20 cmp %rdx,0x20(%r8)
1653: 75 d3 jne 1628 <__find_get_block+0x38>
Signed-off-by: Zach Brown <zab@zabbo.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The kernel used to contain two functions for length-delimited,
case-insensitive string comparison, strnicmp with correct semantics and a
slightly buggy strncasecmp. The latter is the POSIX name, so strnicmp was
renamed to strncasecmp, and strnicmp made into a wrapper for the new
strncasecmp to avoid breaking existing users.
To allow the compat wrapper strnicmp to be removed at some point in the
future, and to avoid the extra indirection cost, do
s/strnicmp/strncasecmp/g.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The kernel used to contain two functions for length-delimited,
case-insensitive string comparison, strnicmp with correct semantics and a
slightly buggy strncasecmp. The latter is the POSIX name, so strnicmp was
renamed to strncasecmp, and strnicmp made into a wrapper for the new
strncasecmp to avoid breaking existing users.
To allow the compat wrapper strnicmp to be removed at some point in the
future, and to avoid the extra indirection cost, do
s/strnicmp/strncasecmp/g.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The kernel used to contain two functions for length-delimited,
case-insensitive string comparison, strnicmp with correct semantics and a
slightly buggy strncasecmp. The latter is the POSIX name, so strnicmp was
renamed to strncasecmp, and strnicmp made into a wrapper for the new
strncasecmp to avoid breaking existing users.
To allow the compat wrapper strnicmp to be removed at some point in the
future, and to avoid the extra indirection cost, do
s/strnicmp/strncasecmp/g.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Conflicts:
arch/arm64/mm/dma-mapping.c
drivers/clk/Kconfig
mm/cma.c
|
|
|
|
|
|
Conflicts:
fs/nfs/client.c
|
|
Conflicts:
arch/s390/include/asm/cputime.h
arch/s390/kernel/irq.c
arch/s390/kernel/processor.c
arch/s390/kernel/vtime.c
kernel/irq_work.c
|
|
Conflicts:
arch/x86/kernel/entry_64.S
arch/x86/kernel/ptrace.c
fs/nfs/blocklayout/blocklayoutdev.c
fs/nfs/blocklayout/blocklayoutdm.c
|
|
Conflicts:
block/blk-core.c
|
|
Conflicts:
drivers/usb/gadget/function/f_fs.c
|
|
|
|
|
|
|
|
|
|
Conflicts:
fs/ext4/super.c
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris.
Mostly ima, selinux, smack and key handling updates.
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits)
integrity: do zero padding of the key id
KEYS: output last portion of fingerprint in /proc/keys
KEYS: strip 'id:' from ca_keyid
KEYS: use swapped SKID for performing partial matching
KEYS: Restore partial ID matching functionality for asymmetric keys
X.509: If available, use the raw subjKeyId to form the key description
KEYS: handle error code encoded in pointer
selinux: normalize audit log formatting
selinux: cleanup error reporting in selinux_nlmsg_perm()
KEYS: Check hex2bin()'s return when generating an asymmetric key ID
ima: detect violations for mmaped files
ima: fix race condition on ima_rdwr_violation_check and process_measurement
ima: added ima_policy_flag variable
ima: return an error code from ima_add_boot_aggregate()
ima: provide 'ima_appraise=log' kernel option
ima: move keyring initialization to ima_init()
PKCS#7: Handle PKCS#7 messages that contain no X.509 certs
PKCS#7: Better handling of unsupported crypto
KEYS: Overhaul key identification when searching for asymmetric keys
KEYS: Implement binary asymmetric key ID handling
...
|
|
In patch 'ext4: refactor ext4_move_extents code base', Dmitry Monakhov has
refactored ext4_move_extents' implementation, but forgot to update the
corresponding comments, this patch will try to delete some useless comments.
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Delalloc write journal reservations only reserve 1 credit,
to update the inode if necessary. However, it may happen
once in a filesystem's lifetime that a file will cross
the 2G threshold, and require the LARGE_FILE feature to
be set in the superblock as well, if it was not set already.
This overruns the transaction reservation, and can be
demonstrated simply on any ext4 filesystem without the LARGE_FILE
feature already set:
dd if=/dev/zero of=testfile bs=1 seek=2147483646 count=1 \
conv=notrunc of=testfile
sync
dd if=/dev/zero of=testfile bs=1 seek=2147483647 count=1 \
conv=notrunc of=testfile
leads to:
EXT4-fs: ext4_do_update_inode:4296: aborting transaction: error 28 in __ext4_handle_dirty_super
EXT4-fs error (device loop0) in ext4_do_update_inode:4301: error 28
EXT4-fs error (device loop0) in ext4_reserve_inode_write:4757: Readonly filesystem
EXT4-fs error (device loop0) in ext4_dirty_inode:4876: error 28
EXT4-fs error (device loop0) in ext4_da_write_end:2685: error 28
Adjust the number of credits based on whether the flag is
already set, and whether the current write may extend past the
LARGE_FILE limit.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
|
|
Pull file locking related changes from Jeff Layton:
"This release is a little more busy for file locking changes than the
last:
- a set of patches from Kinglong Mee to fix the lockowner handling in
knfsd
- a pile of cleanups to the internal file lease API. This should get
us a bit closer to allowing for setlease methods that can block.
There are some dependencies between mine and Bruce's trees this cycle,
and I based my tree on top of the requisite patches in Bruce's tree"
* tag 'locks-v3.18-1' of git://git.samba.org/jlayton/linux: (26 commits)
locks: fix fcntl_setlease/getlease return when !CONFIG_FILE_LOCKING
locks: flock_make_lock should return a struct file_lock (or PTR_ERR)
locks: set fl_owner for leases to filp instead of current->files
locks: give lm_break a return value
locks: __break_lease cleanup in preparation of allowing direct removal of leases
locks: remove i_have_this_lease check from __break_lease
locks: move freeing of leases outside of i_lock
locks: move i_lock acquisition into generic_*_lease handlers
locks: define a lm_setup handler for leases
locks: plumb a "priv" pointer into the setlease routines
nfsd: don't keep a pointer to the lease in nfs4_file
locks: clean up vfs_setlease kerneldoc comments
locks: generic_delete_lease doesn't need a file_lock at all
nfsd: fix potential lease memory leak in nfs4_setlease
locks: close potential race in lease_get_mtime
security: make security_file_set_fowner, f_setown and __f_setown void return
locks: consolidate "nolease" routines
locks: remove lock_may_read and lock_may_write
lockd: rip out deferred lock handling from testlock codepath
NFSD: Get reference of lockowner when coping file_lock
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
"The largest set of changes here come from Miao Xie. He's cleaning up
and improving read recovery/repair for raid, and has a number of
related fixes.
I've merged another set of fsync fixes from Filipe, and he's also
improved the way we handle metadata write errors to make sure we force
the FS readonly if things go wrong.
Otherwise we have a collection of fixes and cleanups. Dave Sterba
gets a cookie for removing the most lines (thanks Dave)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (139 commits)
btrfs: Fix compile error when CONFIG_SECURITY is not set.
Btrfs: fix compiles when CONFIG_BTRFS_FS_RUN_SANITY_TESTS is off
btrfs: Make btrfs handle security mount options internally to avoid losing security label.
Btrfs: send, don't delay dir move if there's a new parent inode
btrfs: add more superblock checks
Btrfs: fix race in WAIT_SYNC ioctl
Btrfs: be aware of btree inode write errors to avoid fs corruption
Btrfs: remove redundant btrfs_verify_qgroup_counts declaration.
btrfs: fix shadow warning on cmp
Btrfs: fix compilation errors under DEBUG
Btrfs: fix crash of btrfs_release_extent_buffer_page
Btrfs: add missing end_page_writeback on submit_extent_page failure
btrfs: Fix the wrong condition judgment about subset extent map
Btrfs: fix build_backref_tree issue with multiple shared blocks
Btrfs: cleanup error handling in build_backref_tree
btrfs: move checks for DUMMY_ROOT into a helper
btrfs: new define for the inline extent data start
btrfs: kill extent_buffer_page helper
btrfs: drop constant param from btrfs_release_extent_buffer_page
btrfs: hide typecast to definition of BTRFS_SEND_TRANS_STUB
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF and quota updates from Jan Kara:
"A few UDF fixes and also a few patches which are preparing filesystems
for support of project quotas in VFS"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Fix loading of special inodes
ocfs2: Back out change to use OCFS2_MAXQUOTAS in ocfs2_setattr()
udf: remove redundant sys_tz declaration
ocfs2: Don't use MAXQUOTAS value
reiserfs: Don't use MAXQUOTAS value
ext3: Don't use MAXQUOTAS value
udf: Fix race between write(2) and close(2)
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
Pull eCryptfs updates from Tyler Hicks:
"Minor code cleanups and a fix for when eCryptfs metadata is stored in
xattrs"
* tag 'ecryptfs-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
ecryptfs: remove unneeded buggy code in ecryptfs_do_create()
ecryptfs: avoid to access NULL pointer when write metadata in xattr
ecryptfs: remove unnecessary break after goto
ecryptfs: Remove unnecessary include of syscall.h in keystore.c
fs/ecryptfs/messaging.c: remove null test before kfree
ecryptfs: Drop cast
Use %pd in eCryptFS
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw
Pull gfs2 updates from Steven Whitehouse:
"This time we have a couple of bug fixes, one relating to bad i_goal
values which are now ignored (i_goal is basically a hint so it is safe
to so this) and another relating to the saving of the dirent location
during rename.
There is one performance improvement, which is an optimisation in
rgblk_free so that multiple block deallocations will now be more
efficient, and one clean up patch to use _RET_IP_ rather than writing
it out longhand"
* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: use _RET_IP_ instead of (unsigned long)__builtin_return_address(0)
GFS2: Use gfs2_rbm_incr in rgblk_free
GFS2: Make rename not save dirent location
GFS2: fix bad inode i_goal values during block allocation
|
|
Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent. This patch instead allocates the appropriate amount of
memory using a char array using the SHASH_DESC_ON_STACK macro.
The new code can be compiled with both gcc and clang.
Signed-off-by: Vinícius Tinti <viniciustinti@gmail.com>
Reviewed-by: Jan-Simon Möller <dl9pf@gmx.de>
Reviewed-by: Mark Charlebois <charlebm@gmail.com>
Signed-off-by: Behan Webster <behanw@converseincode.com>
Acked-by: Chris Mason <clm@fb.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:
"A lot of activities on percpu front. Notable changes are...
- percpu allocator now can take @gfp. If @gfp doesn't contain
GFP_KERNEL, it tries to allocate from what's already available to
the allocator and a work item tries to keep the reserve around
certain level so that these atomic allocations usually succeed.
This will replace the ad-hoc percpu memory pool used by
blk-throttle and also be used by the planned blkcg support for
writeback IOs.
Please note that I noticed a bug in how @gfp is interpreted while
preparing this pull request and applied the fix 6ae833c7fe0c
("percpu: fix how @gfp is interpreted by the percpu allocator")
just now.
- percpu_ref now uses longs for percpu and global counters instead of
ints. It leads to more sparse packing of the percpu counters on
64bit machines but the overhead should be negligible and this
allows using percpu_ref for refcnting pages and in-memory objects
directly.
- The switching between percpu and single counter modes of a
percpu_ref is made independent of putting the base ref and a
percpu_ref can now optionally be initialized in single or killed
mode. This allows avoiding percpu shutdown latency for cases where
the refcounted objects may be synchronously created and destroyed
in rapid succession with only a fraction of them reaching fully
operational status (SCSI probing does this when combined with
blk-mq support). It's also planned to be used to implement forced
single mode to detect underflow more timely for debugging.
There's a separate branch percpu/for-3.18-consistent-ops which cleans
up the duplicate percpu accessors. That branch causes a number of
conflicts with s390 and other trees. I'll send a separate pull
request w/ resolutions once other branches are merged"
* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
percpu: fix how @gfp is interpreted by the percpu allocator
blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
percpu_ref: add PERCPU_REF_INIT_* flags
percpu_ref: decouple switching to percpu mode and reinit
percpu_ref: decouple switching to atomic mode and killing
percpu_ref: add PCPU_REF_DEAD
percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
percpu_ref: replace pcpu_ prefix with percpu_
percpu_ref: minor code and comment updates
percpu_ref: relocate percpu_ref_reinit()
Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
Revert "percpu: free percpu allocation info for uniprocessor system"
percpu-refcount: make percpu_ref based on longs instead of ints
percpu-refcount: improve WARN messages
percpu: fix locking regression in the failure path of pcpu_alloc()
percpu-refcount: add @gfp to percpu_ref_init()
proportions: add @gfp to init functions
percpu_counter: add @gfp to percpu_counter_init()
percpu_counter: make percpu_counters_lock irq-safe
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
"Nothing too interesting. Just a handful of cleanup patches"
* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
Revert "cgroup: remove redundant variable in cgroup_mount()"
cgroup: remove redundant variable in cgroup_mount()
cgroup: fix missing unlock in cgroup_release_agent()
cgroup: remove CGRP_RELEASABLE flag
perf/cgroup: Remove perf_put_cgroup()
cgroup: remove redundant check in cgroup_ino()
cpuset: simplify proc_cpuset_show()
cgroup: simplify proc_cgroup_show()
cgroup: use a per-cgroup work for release agent
cgroup: remove bogus comments
cgroup: remove redundant code in cgroup_rmdir()
cgroup: remove some useless forward declarations
cgroup: fix a typo in comment.
|
|
Increase the buffer-head per-CPU LRU size to allow efficient filesystem
operations that access many blocks for each transaction. For example,
creating a file in a large ext4 directory with quota enabled will access
multiple buffer heads and will overflow the LRU at the default 8-block LRU
size:
* parent directory inode table block (ctime, nlinks for subdirs)
* new inode bitmap
* inode table block
* 2 quota blocks
* directory leaf block (not reused, but pollutes one cache entry)
* 2 levels htree blocks (only one is reused, other pollutes cache)
* 2 levels indirect/index blocks (only one is reused)
The buffer-head per-CPU LRU size is raised to 16, as it shows in metadata
performance benchmarks up to 10% gain for create, 4% for lookup and 7% for
destroy.
Signed-off-by: Liang Zhen <liang.zhen@intel.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Sebastien Buisson <sebastien.buisson@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Always mark pages with PageBalloon even if balloon compaction is disabled
and expose this mark in /proc/kpageflags as KPF_BALLOON.
Also this patch adds three counters into /proc/vmstat: "balloon_inflate",
"balloon_deflate" and "balloon_migrate". They accumulate balloon
activity. Current size of balloon is (balloon_inflate - balloon_deflate)
pages.
All generic balloon code now gathered under option CONFIG_MEMORY_BALLOON.
It should be selected by ballooning driver which wants use this feature.
Currently virtio-balloon is the only user.
Signed-off-by: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If a /proc/pid/pagemap read spans a [VMA, an unmapped region, then a
VM_SOFTDIRTY VMA], the virtual pages in the unmapped region are reported
as softdirty. Here's a program to demonstrate the bug:
int main() {
const uint64_t PAGEMAP_SOFTDIRTY = 1ul << 55;
uint64_t pme[3];
int fd = open("/proc/self/pagemap", O_RDONLY);;
char *m = mmap(NULL, 3 * getpagesize(), PROT_READ,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
munmap(m + getpagesize(), getpagesize());
pread(fd, pme, 24, (unsigned long) m / getpagesize() * 8);
assert(pme[0] & PAGEMAP_SOFTDIRTY); /* passes */
assert(!(pme[1] & PAGEMAP_SOFTDIRTY)); /* fails */
assert(pme[2] & PAGEMAP_SOFTDIRTY); /* passes */
return 0;
}
(Note that all pages in new VMAs are softdirty until cleared).
Tested:
Used the program given above. I'm going to include this code in
a selftest in the future.
[n-horiguchi@ah.jp.nec.com: prevent pagemap_pte_range() from overrunning]
Signed-off-by: Peter Feiner <pfeiner@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix a deadlock problem caused by direct memory reclaim in o2net_wq. The
situation is as follows:
1) Receive a connect message from another node, node queues a
work_struct o2net_listen_work.
2) o2net_wq processes this work and call the following functions:
o2net_wq
-> o2net_accept_one
-> sock_create_lite
-> sock_alloc()
-> kmem_cache_alloc with GFP_KERNEL
-> ____cache_alloc_node
->__alloc_pages_nodemask
-> do_try_to_free_pages
-> shrink_slab
-> evict
-> ocfs2_evict_inode
-> ocfs2_drop_lock
-> dlmunlock
-> o2net_send_message_vec
then o2net_wq wait for the unlock reply from master.
3) tcp layer received the reply, call o2net_data_ready() and queue
sc_rx_work, waiting o2net_wq to process this work.
4) o2net_wq is a single thread workqueue, it process the work one by
one. Right now it is still doing o2net_listen_work and cannot handle
sc_rx_work. so we deadlock.
Junxiao Bi's patch "mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set"
(http://ozlabs.org/~akpm/mmots/broken-out/mm-clear-__gfp_fs-when-pf_memalloc_noio-is-set.patch)
clears __GFP_FS in memalloc_noio_flags() besides __GFP_IO. We use
memalloc_noio_save() to set process flag PF_MEMALLOC_NOIO so that all
allocations done by this process are done as if GFP_NOIO was specified.
We are not reentering filesystem while doing memory reclaim.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
9e7814404b77 "hold task->mempolicy while numa_maps scans." fixed the
race with the exiting task but this is not enough.
The current code assumes that get_vma_policy(task) should either see
task->mempolicy == NULL or it should be equal to ->task_mempolicy saved
by hold_task_mempolicy(), so we can never race with __mpol_put(). But
this can only work if we can't race with do_set_mempolicy(), and thus
we can't race with another do_set_mempolicy() or do_exit() after that.
However, do_set_mempolicy()->down_write(mmap_sem) can not prevent this
race. This task can exec, change it's ->mm, and call do_set_mempolicy()
after that; in this case they take 2 different locks.
Change hold_task_mempolicy() to use get_task_policy(), it never returns
NULL, and change show_numa_map() to use __get_vma_policy() or fall back
to proc_priv->task_mempolicy.
Note: this is the minimal fix, we will cleanup this code later. I think
hold_task_mempolicy() and release_task_mempolicy() should die, we can
move this logic into show_numa_map(). Or we can move get_task_policy()
outside of ->mmap_sem and !CONFIG_NUMA code at least.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Sequential read from a block device is expected to be equal or faster than
from the file on a filesystem. But it is not correct due to the lack of
effective readpages() in the address space operations for block device.
This implements readpages() operation for block device by using
mpage_readpages() which can create multipage BIOs instead of BIOs for each
page and reduce system CPU time consumption.
Install 1GB of RAM disk storage:
# modprobe scsi_debug dev_size_mb=1024 delay=0
Sequential read from file on a filesystem:
# mkfs.ext4 /dev/$DEV
# mount /dev/$DEV /mnt
# fio --name=t --size=512m --rw=read --filename=/mnt/file
...
read : io=524288KB, bw=2133.4MB/s, iops=546133, runt= 240msec
Sequential read from a block device:
# fio --name=t --size=512m --rw=read --filename=/dev/$DEV
...
(Without this commit)
read : io=524288KB, bw=1700.2MB/s, iops=435455, runt= 301msec
(With this commit)
read : io=524288KB, bw=2160.4MB/s, iops=553046, runt= 237msec
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add guard_bio_eod() check for mpage code in order to allow us to do IO
even on the odd last sectors of a device, even if the block size is some
multiple of the physical sector size.
Using mpage_readpages() for block device requires this guard check.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patchset implements readpages() operation for block device by using
mpage_readpages() which can create multipage BIOs instead of BIOs for each
page and reduce system CPU time consumption.
This patch (of 3):
guard_bh_eod() is used in submit_bh() to allow us to do IO even on the odd
last sectors of a device, even if the block size is some multiple of the
physical sector size. This makes guard_bh_eod() more generic and renames
it guard_bio_eod() so that we can use it without struct buffer_head
argument.
The reason for this change is that using mpage_readpages() for block
device requires to add this guard check in mpage code.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
On some ARCHs modules range is eauql to vmalloc range. E.g on i686
"#define MODULES_VADDR VMALLOC_START"
"#define MODULES_END VMALLOC_END"
This will cause 2 duplicate program segments in /proc/kcore, and no flag
to indicate they are different. This is confusing. And usually people
who need check the elf header or read the content of kcore will check
memory ranges. Two program segments which are the same are unnecessary.
So check if the modules range is equal to vmalloc range. If so, just skip
adding the modules range.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
- Rename vm_is_stack() to task_of_stack() and change it to return
"struct task_struct *" rather than the global (and thus wrong in
general) pid_t.
- Add the new pid_of_stack() helper which calls task_of_stack() and
uses the right namespace to report the correct pid_t.
Unfortunately we need to define this helper twice, in task_mmu.c
and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
and move at least pid_of_stack/task_of_stack there to avoid the
code duplication.
- Change show_map_vma() and show_numa_map() to use the new helper.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
m_start() can use get_proc_task() instead, and "struct inode *"
provides more potentially useful info, see the next changes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
I do not know if CONFIG_PREEMPT/SMP is possible without CONFIG_MMU
but the usage of task->mm in m_stop(). The task can exit/exec before
we take mmap_sem, in this case m_stop() can hit NULL or unlock the
wrong rw_semaphore.
Also, this code uses priv->task != NULL to decide whether we need
up_read/mmput. This is correct, but we will probably kill priv->task.
Change m_start/m_stop to rely on IS_ERR_OR_NULL() like task_mmu.c does.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Copy-and-paste the changes from "fs/proc/task_mmu.c: shift mm_access()
from m_start() to proc_maps_open()" into task_nommu.c.
Change maps_open() to initialize priv->mm using proc_mem_open(), m_start()
can rely on atomic_inc_not_zero(mm_users) like task_mmu.c does.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Cleanup and preparation. maps_open() can use __seq_open_private()
like proc_maps_open() does.
[akpm@linux-foundation.org: deuglify]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Change the main loop in m_start() to update m->version. Mostly for
consistency, but this can help to avoid the same loop if the very
1st ->show() fails due to seq_overflow().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add the "last_addr" optimization back. Like before, every ->show()
method checks !seq_overflow() and sets m->version = vma->vm_start.
However, it also checks that m_next_vma(vma) != NULL, otherwise it
sets m->version = -1 for the lockless "EOF" fast-path in m_start().
m_start() can simply do find_vma() + m_next_vma() if last_addr is
not zero, the code looks clear and simple and this case is clearly
separated from "scan vmas" path.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|