summaryrefslogtreecommitdiff
path: root/security
AgeCommit message (Collapse)Author
2011-03-01Merge remote-tracking branch 'tip/auto-latest'Stephen Rothwell
Conflicts: arch/x86/kernel/acpi/sleep.c
2011-03-01Merge remote-tracking branch 'selinux/master'Stephen Rothwell
2011-03-01Merge remote-tracking branch 'security-testing/next'Stephen Rothwell
2011-02-25Revert "selinux: simplify ioctl checking"Eric Paris
This reverts commit 242631c49d4cf39642741d6627750151b058233b. Conflicts: security/selinux/hooks.c SELinux used to recognize certain individual ioctls and check permissions based on the knowledge of the individual ioctl. In commit 242631c49d4cf396 the SELinux code stopped trying to understand individual ioctls and to instead looked at the ioctl access bits to determine in we should check read or write for that operation. This same suggestion was made to SMACK (and I believe copied into TOMOYO). But this suggestion is total rubbish. The ioctl access bits are actually the access requirements for the structure being passed into the ioctl, and are completely unrelated to the operation of the ioctl or the object the ioctl is being performed upon. Take FS_IOC_FIEMAP as an example. FS_IOC_FIEMAP is defined as: FS_IOC_FIEMAP _IOWR('f', 11, struct fiemap) So it has access bits R and W. What this really means is that the kernel is going to both read and write to the struct fiemap. It has nothing at all to do with the operations that this ioctl might perform on the file itself! Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
2011-02-25selinux: drop unused packet flow permissionsEric Paris
These permissions are not used and can be dropped in the kernel definitions. Suggested-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
2011-02-25selinux: Fix packet forwarding checks on postroutingSteffen Klassert
The IPSKB_FORWARDED and IP6SKB_FORWARDED flags are used only in the multicast forwarding case to indicate that a packet looped back after forward. So these flags are not a good indicator for packet forwarding. A better indicator is the incoming interface. If we have no socket context, but an incoming interface and we see the packet in the ip postroute hook, the packet is going to be forwarded. With this patch we use the incoming interface as an indicator on packet forwarding. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Paul Moore <paul.moore@hp.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2011-02-25selinux: Fix wrong checks for selinux_policycap_netpeerSteffen Klassert
selinux_sock_rcv_skb_compat and selinux_ip_postroute_compat are just called if selinux_policycap_netpeer is not set. However in these functions we check if selinux_policycap_netpeer is set. This leads to some dead code and to the fact that selinux_xfrm_postroute_last is never executed. This patch removes the dead code and the checks for selinux_policycap_netpeer in the compatibility functions. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Paul Moore <paul.moore@hp.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2011-02-25selinux: Fix check for xfrm selinux context algorithmSteffen Klassert
selinux_xfrm_sec_ctx_alloc accidentally checks the xfrm domain of interpretation against the selinux context algorithm. This patch fixes this by checking ctx_alg against the selinux context algorithm. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Paul Moore <paul.moore@hp.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2011-02-23ima: remove unnecessary call to ima_must_measureMimi Zohar
The original ima_must_measure() function based its results on cached iint information, which required an iint be allocated for all files. Currently, an iint is allocated only for files in policy. As a result, for those files in policy, ima_must_measure() is now called twice: once to determine if the inode is in the measurement policy and, the second time, to determine if it needs to be measured/re-measured. The second call to ima_must_measure() unnecessarily checks to see if the file is in policy. As we already know the file is in policy, this patch removes the second unnecessary call to ima_must_measure(), removes the vestige iint parameter, and just checks the iint directly to determine if the inode has been measured or needs to be measured/re-measured. Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: Eric Paris <eparis@redhat.com>
2011-02-22xfrm: Mark flowi arg to security_xfrm_state_pol_flow_match() const.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-15Merge branch 'linus' into auto-latestIngo Molnar
2011-02-11security: add cred argument to security_capable()Chris Wright
Expand security_capable() to include cred, so that it can be usable in a wider range of call sites. Signed-off-by: Chris Wright <chrisw@sous-sol.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: James Morris <jmorris@namei.org>
2011-02-10IMA: remove IMA imbalance checkingMimi Zohar
Now that i_readcount is maintained by the VFS layer, remove the imbalance checking in IMA. Cleans up the IMA code nicely. Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: Eric Paris <eparis@redhat.com>
2011-02-10IMA: maintain i_readcount in the VFS layerMimi Zohar
ima_counts_get() updated the readcount and invalidated the PCR, as necessary. Only update the i_readcount in the VFS layer. Move the PCR invalidation checks to ima_file_check(), where it belongs. Maintaining the i_readcount in the VFS layer, will allow other subsystems to use i_readcount. Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: Eric Paris <eparis@redhat.com>
2011-02-10IMA: convert i_readcount to atomicMimi Zohar
Convert the inode's i_readcount from an unsigned int to atomic. Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: Eric Paris <eparis@redhat.com>
2011-02-09Smack: correct final mmap check comparisonCasey Schaufler
The mmap policy enforcement checks the access of the SMACK64MMAP subject against the current subject incorrectly. The check as written works correctly only if the access rules involved have the same access. This is the common case, so initial testing did not find a problem. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
2011-02-09security:smack: kill unused SMACK_LIST_MAX, MAY_ANY and MAY_ANYWRITEShan Wei
Kill unused macros of SMACK_LIST_MAX, MAY_ANY and MAY_ANYWRITE. v2: As Casey Schaufler's advice, also remove MAY_ANY. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
2011-02-09Merge branch 'linus' into auto-latestIngo Molnar
2011-02-09Smack: correct behavior in the mmap hookCasey Schaufler
The mmap policy enforcement was not properly handling the interaction between the global and local rule lists. Instead of going through one and then the other, which missed the important case where a rule specified that there should be no access, combine the access limitations where there is a rule in each list. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: James Morris <jmorris@namei.org>
2011-02-07CRED: Fix BUG() upon security_cred_alloc_blank() failureTetsuo Handa
In cred_alloc_blank() since 2.6.32, abort_creds(new) is called with new->security == NULL and new->magic == 0 when security_cred_alloc_blank() returns an error. As a result, BUG() will be triggered if SELinux is enabled or CONFIG_DEBUG_CREDENTIALS=y. If CONFIG_DEBUG_CREDENTIALS=y, BUG() is called from __invalid_creds() because cred->magic == 0. Failing that, BUG() is called from selinux_cred_free() because selinux_cred_free() is not expecting cred->security == NULL. This does not affect smack_cred_free(), tomoyo_cred_free() or apparmor_cred_free(). Fix these bugs by (1) Set new->magic before calling security_cred_alloc_blank(). (2) Handle null cred->security in creds_are_invalid() and selinux_cred_free(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-02time: Correct the *settime* parametersRichard Cochran
Both settimeofday() and clock_settime() promise with a 'const' attribute not to alter the arguments passed in. This patch adds the missing 'const' attribute into the various kernel functions implementing these calls. Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Acked-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20110201134417.545698637@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-02-01security: remove unused security_sysctl hookLucian Adrian Grijincu
The only user for this hook was selinux. sysctl routes every call through /proc/sys/. Selinux and other security modules use the file system checks for sysctl too, so no need for this hook any more. Signed-off-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2011-02-01security/selinux: fix /proc/sys/ labelingLucian Adrian Grijincu
This fixes an old (2007) selinux regression: filesystem labeling for /proc/sys returned -r--r--r-- unknown /proc/sys/fs/file-nr instead of -r--r--r-- system_u:object_r:sysctl_fs_t:s0 /proc/sys/fs/file-nr Events that lead to breaking of /proc/sys/ selinux labeling: 1) sysctl was reimplemented to route all calls through /proc/sys/ commit 77b14db502cb85a031fe8fde6c85d52f3e0acb63 [PATCH] sysctl: reimplement the sysctl proc support 2) proc_dir_entry was removed from ctl_table: commit 3fbfa98112fc3962c416452a0baf2214381030e6 [PATCH] sysctl: remove the proc_dir_entry member for the sysctl tables 3) selinux still walked the proc_dir_entry tree to apply labeling. Because ctl_tables don't have a proc_dir_entry, we did not label /proc/sys/ inodes any more. To achieve this the /proc/sys/ inodes were marked private and private inodes were ignored by selinux. commit bbaca6c2e7ef0f663bc31be4dad7cf530f6c4962 [PATCH] selinux: enhance selinux to always ignore private inodes commit 86a71dbd3e81e8870d0f0e56b87875f57e58222b [PATCH] sysctl: hide the sysctl proc inodes from selinux Access control checks have been done by means of a special sysctl hook that was called for read/write accesses to any /proc/sys/ entry. We don't have to do this because, instead of walking the proc_dir_entry tree we can walk the dentry tree (as done in this patch). With this patch: * we don't mark /proc/sys/ inodes as private * we don't need the sysclt security hook * we walk the dentry tree to find the path to the inode. We have to strip the PID in /proc/PID/ entries that have a proc_dir_entry because selinux does not know how to label paths like '/1/net/rpc/nfsd.fh' (and defaults to 'proc_t' labeling). Selinux does know of '/net/rpc/nfsd.fh' (and applies the 'sysctl_rpc_t' label). PID stripping from the path was done implicitly in the previous code because the proc_dir_entry tree had the root in '/net' in the example from above. The dentry tree has the root in '/1'. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2011-02-01SELinux: Use dentry name in new object labelingEric Paris
Currently SELinux has rules which label new objects according to 3 criteria. The label of the process creating the object, the label of the parent directory, and the type of object (reg, dir, char, block, etc.) This patch adds a 4th criteria, the dentry name, thus we can distinguish between creating a file in an etc_t directory called shadow and one called motd. There is no file globbing, regex parsing, or anything mystical. Either the policy exactly (strcmp) matches the dentry name of the object or it doesn't. This patch has no changes from today if policy does not implement the new rules. Signed-off-by: Eric Paris <eparis@redhat.com>
2011-02-01fs/vfs/security: pass last path component to LSM on inode creationEric Paris
SELinux would like to implement a new labeling behavior of newly created inodes. We currently label new inodes based on the parent and the creating process. This new behavior would also take into account the name of the new object when deciding the new label. This is not the (supposed) full path, just the last component of the path. This is very useful because creating /etc/shadow is different than creating /etc/passwd but the kernel hooks are unable to differentiate these operations. We currently require that userspace realize it is doing some difficult operation like that and than userspace jumps through SELinux hoops to get things set up correctly. This patch does not implement new behavior, that is obviously contained in a seperate SELinux patch, but it does pass the needed name down to the correct LSM hook. If no such name exists it is fine to pass NULL. Signed-off-by: Eric Paris <eparis@redhat.com>
2011-01-26KEYS: Fix __key_link_end() quota fixup on errorDavid Howells
Fix __key_link_end()'s attempt to fix up the quota if an error occurs. There are two erroneous cases: Firstly, we always decrease the quota if the preallocated replacement keyring needs cleaning up, irrespective of whether or not we should (we may have replaced a pointer rather than adding another pointer). Secondly, we never clean up the quota if we added a pointer without the keyring storage being extended (we allocate multiple pointers at a time, even if we're not going to use them all immediately). We handle this by setting the bottom bit of the preallocation pointer in __key_link_begin() to indicate that the quota needs fixing up, which is then passed to __key_link() (which clears the whole thing) and __key_link_end(). Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-24selinux: return -ENOMEM when memory allocation failsDavidlohr Bueso
Return -ENOMEM when memory allocation fails in cond_init_bool_indexes, correctly propagating error code to caller. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Signed-off-by: James Morris <jmorris@namei.org>
2011-01-24trusted keys: Fix a memory leak in trusted_update().Jesper Juhl
One failure path in security/keys/trusted.c::trusted_update() does not free 'new_p' while the others do. This patch makes sure we also free it in the remaining path (if datablob_parse() returns different from Opt_update). Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: James Morris <jmorris@namei.org>
2011-01-24CacheFiles: Add calls to path-based security hooksDavid Howells
Add calls to path-based security hooks into CacheFiles as, unlike inode-based security, these aren't implicit in the vfs_mkdir() and similar calls. Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2011-01-24security:selinux: kill unused MAX_AVTAB_HASH_MASK and ebitmap_startbitShan Wei
Kill unused MAX_AVTAB_HASH_MASK and ebitmap_startbit. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: James Morris <jmorris@namei.org>
2011-01-24encrypted-keys: rename encrypted_defined files to encryptedMimi Zohar
Rename encrypted_defined.c and encrypted_defined.h files to encrypted.c and encrypted.h, respectively. Based on request from David Howells. Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2011-01-24trusted-keys: rename trusted_defined files to trustedMimi Zohar
Rename trusted_defined.c and trusted_defined.h files to trusted.c and trusted.h, respectively. Based on request from David Howells. Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2011-01-21KEYS: Fix up comments in key management codeDavid Howells
Fix up comments in the key management code. No functional changes. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-21KEYS: Do some style cleanup in the key management code.David Howells
Do a bit of a style clean up in the key management code. No functional changes. Done using: perl -p -i -e 's!^/[*]*/\n!!' security/keys/*.c perl -p -i -e 's!} /[*] end [a-z0-9_]*[(][)] [*]/\n!}\n!' security/keys/*.c sed -i -s -e ": next" -e N -e 's/^\n[}]$/}/' -e t -e P -e 's/^.*\n//' -e "b next" security/keys/*.c To remove /*****/ lines, remove comments on the closing brace of a function to name the function and remove blank lines before the closing brace of a function. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-19trusted-keys: avoid scattring va_end()Tetsuo Handa
We can avoid scattering va_end() within the va_start(); for (;;) { } va_end(); loop, assuming that crypto_shash_init()/crypto_shash_update() return 0 on success and negative value otherwise. Make TSS_authhmac()/TSS_checkhmac1()/TSS_checkhmac2() similar to TSS_rawhmac() by removing "va_end()/goto" from the loop. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Jesper Juhl <jj@chaosbits.net> Acked-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2011-01-19trusted-keys: check for NULL before using itTetsuo Handa
TSS_rawhmac() checks for data != NULL before using it. We should do the same thing for TSS_authhmac(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Jesper Juhl <jj@chaosbits.net> Acked-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2011-01-19trusted-keys: another free memory bugfixTetsuo Handa
TSS_rawhmac() forgot to call va_end()/kfree() when data == NULL and forgot to call va_end() when crypto_shash_update() < 0. Fix these bugs by escaping from the loop using "break" (rather than "return"/"goto") in order to make sure that va_end()/kfree() are always called. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Jesper Juhl <jj@chaosbits.net> Acked-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2011-01-17Subject: [PATCH] Smack: mmap controls for library containmentCasey Schaufler
In the embedded world there are often situations where libraries are updated from a variety of sources, for a variety of reasons, and with any number of security characteristics. These differences might include privilege required for a given library provided interface to function properly, as occurs from time to time in graphics libraries. There are also cases where it is important to limit use of libraries based on the provider of the library and the security aware application may make choices based on that criteria. These issues are addressed by providing an additional Smack label that may optionally be assigned to an object, the SMACK64MMAP attribute. An mmap operation is allowed if there is no such attribute. If there is a SMACK64MMAP attribute the mmap is permitted only if a subject with that label has all of the access permitted a subject with the current task label. Security aware applications may from time to time wish to reduce their "privilege" to avoid accidental use of privilege. One case where this arises is the environment in which multiple sources provide libraries to perform the same functions. An application may know that it should eschew services made available from a particular vendor, or of a particular version. In support of this a secondary list of Smack rules has been added that is local to the task. This list is consulted only in the case where the global list has approved access. It can only further restrict access. Unlike the global last, if no entry is found on the local list access is granted. An application can add entries to its own list by writing to /smack/load-self. The changes appear large as they involve refactoring the list handling to accomodate there being more than one rule list. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
2011-01-14trusted-keys: free memory bugfixMimi Zohar
Add missing kfree(td) in tpm_seal() before the return, freeing td on error paths as well. Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: David Safford <safford@watson.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: James Morris <jmorris@namei.org>
2011-01-13Merge branch 'for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits) Documentation/trace/events.txt: Remove obsolete sched_signal_send. writeback: fix global_dirty_limits comment runtime -> real-time ppc: fix comment typo singal -> signal drivers: fix comment typo diable -> disable. m68k: fix comment typo diable -> disable. wireless: comment typo fix diable -> disable. media: comment typo fix diable -> disable. remove doc for obsolete dynamic-printk kernel-parameter remove extraneous 'is' from Documentation/iostats.txt Fix spelling milisec -> ms in snd_ps3 module parameter description Fix spelling mistakes in comments Revert conflicting V4L changes i7core_edac: fix typos in comments mm/rmap.c: fix comment sound, ca0106: Fix assignment to 'channel'. hrtimer: fix a typo in comment init/Kconfig: fix typo anon_inodes: fix wrong function name in comment fix comment typos concerning "consistent" poll: fix a typo in comment ... Fix up trivial conflicts in: - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c) - fs/ext4/ext4.h Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.
2011-01-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (30 commits) MAINTAINERS: Add tomoyo-dev-en ML. SELinux: define permissions for DCB netlink messages encrypted-keys: style and other cleanup encrypted-keys: verify datablob size before converting to binary trusted-keys: kzalloc and other cleanup trusted-keys: additional TSS return code and other error handling syslog: check cap_syslog when dmesg_restrict Smack: Transmute labels on specified directories selinux: cache sidtab_context_to_sid results SELinux: do not compute transition labels on mountpoint labeled filesystems This patch adds a new security attribute to Smack called SMACK64EXEC. It defines label that is used while task is running. SELinux: merge policydb_index_classes and policydb_index_others selinux: convert part of the sym_val_to_name array to use flex_array selinux: convert type_val_to_struct to flex_array flex_array: fix flex_array_put_ptr macro to be valid C SELinux: do not set automatic i_ino in selinuxfs selinux: rework security_netlbl_secattr_to_sid SELinux: standardize return code handling in selinuxfs.c SELinux: standardize return code handling in selinuxfs.c SELinux: standardize return code handling in policydb.c ...
2011-01-10headers: kobject.h reduxAlexey Dobriyan
Remove kobject.h from files which don't need it, notably, sched.h and fs.h. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-10headers: path.h reduxAlexey Dobriyan
Remove path.h from sched.h and other files. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-10Merge branch 'master' of git://git.infradead.org/users/eparis/selinux into nextJames Morris
2011-01-10Merge branch 'master' into nextJames Morris
Conflicts: security/smack/smack_lsm.c Verified and added fix by Stephen Rothwell <sfr@canb.auug.org.au> Ok'd by Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: James Morris <jmorris@namei.org>
2011-01-07Merge branch 'vfs-scale-working' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin * 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin: (57 commits) fs: scale mntget/mntput fs: rename vfsmount counter helpers fs: implement faster dentry memcmp fs: prefetch inode data in dcache lookup fs: improve scalability of pseudo filesystems fs: dcache per-inode inode alias locking fs: dcache per-bucket dcache hash locking bit_spinlock: add required includes kernel: add bl_list xfs: provide simple rcu-walk ACL implementation btrfs: provide simple rcu-walk ACL implementation ext2,3,4: provide simple rcu-walk ACL implementation fs: provide simple rcu-walk generic_check_acl implementation fs: provide rcu-walk aware permission i_ops fs: rcu-walk aware d_revalidate method fs: cache optimise dentry and inode for rcu-walk fs: dcache reduce branches in lookup path fs: dcache remove d_mounted fs: fs_struct use seqlock fs: rcu-walk for path lookup ...
2011-01-07fs: rcu-walk for path lookupNick Piggin
Perform common cases of path lookups without any stores or locking in the ancestor dentry elements. This is called rcu-walk, as opposed to the current algorithm which is a refcount based walk, or ref-walk. This results in far fewer atomic operations on every path element, significantly improving path lookup performance. It also avoids cacheline bouncing on common dentries, significantly improving scalability. The overall design is like this: * LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk. * Take the RCU lock for the entire path walk, starting with the acquiring of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are not required for dentry persistence. * synchronize_rcu is called when unregistering a filesystem, so we can access d_ops and i_ops during rcu-walk. * Similarly take the vfsmount lock for the entire path walk. So now mnt refcounts are not required for persistence. Also we are free to perform mount lookups, and to assume dentry mount points and mount roots are stable up and down the path. * Have a per-dentry seqlock to protect the dentry name, parent, and inode, so we can load this tuple atomically, and also check whether any of its members have changed. * Dentry lookups (based on parent, candidate string tuple) recheck the parent sequence after the child is found in case anything changed in the parent during the path walk. * inode is also RCU protected so we can load d_inode and use the inode for limited things. * i_mode, i_uid, i_gid can be tested for exec permissions during path walk. * i_op can be loaded. When we reach the destination dentry, we lock it, recheck lookup sequence, and increment its refcount and mountpoint refcount. RCU and vfsmount locks are dropped. This is termed "dropping rcu-walk". If the dentry refcount does not match, we can not drop rcu-walk gracefully at the current point in the lokup, so instead return -ECHILD (for want of a better errno). This signals the path walking code to re-do the entire lookup with a ref-walk. Aside from the final dentry, there are other situations that may be encounted where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take a reference on the last good dentry) and continue with a ref-walk. Again, if we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup using ref-walk. But it is very important that we can continue with ref-walk for most cases, particularly to avoid the overhead of double lookups, and to gain the scalability advantages on common path elements (like cwd and root). The cases where rcu-walk cannot continue are: * NULL dentry (ie. any uncached path element) * parent with d_inode->i_op->permission or ACLs * dentries with d_revalidate * Following links In future patches, permission checks and d_revalidate become rcu-walk aware. It may be possible eventually to make following links rcu-walk aware. Uncached path elements will always require dropping to ref-walk mode, at the very least because i_mutex needs to be grabbed, and objects allocated. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache rationalise dget variantsNick Piggin
dget_locked was a shortcut to avoid the lazy lru manipulation when we already held dcache_lock (lru manipulation was relatively cheap at that point). However, how that the lru lock is an innermost one, we never hold it at any caller, so the lock cost can now be avoided. We already have well working lazy dcache LRU, so it should be fine to defer LRU manipulations to scan time. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache remove dcache_lockNick Piggin
dcache_lock no longer protects anything. remove it. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache scale subdirsNick Piggin
Protect d_subdirs and d_child with d_lock, except in filesystems that aren't using dcache_lock for these anyway (eg. using i_mutex). Note: if we change the locking rule in future so that ->d_child protection is provided only with ->d_parent->d_lock, it may allow us to reduce some locking. But it would be an exception to an otherwise regular locking scheme, so we'd have to see some good results. Probably not worthwhile. Signed-off-by: Nick Piggin <npiggin@kernel.dk>