From c6e9d6f38894798696f23c8084ca7edbf16ee895 Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Thu, 17 Jul 2014 04:13:05 -0400 Subject: random: introduce getrandom(2) system call The getrandom(2) system call was requested by the LibreSSL Portable developers. It is analoguous to the getentropy(2) system call in OpenBSD. The rationale of this system call is to provide resiliance against file descriptor exhaustion attacks, where the attacker consumes all available file descriptors, forcing the use of the fallback code where /dev/[u]random is not available. Since the fallback code is often not well-tested, it is better to eliminate this potential failure mode entirely. The other feature provided by this new system call is the ability to request randomness from the /dev/urandom entropy pool, but to block until at least 128 bits of entropy has been accumulated in the /dev/urandom entropy pool. Historically, the emphasis in the /dev/urandom development has been to ensure that urandom pool is initialized as quickly as possible after system boot, and preferably before the init scripts start execution. This is because changing /dev/urandom reads to block represents an interface change that could potentially break userspace which is not acceptable. In practice, on most x86 desktop and server systems, in general the entropy pool can be initialized before it is needed (and in modern kernels, we will printk a warning message if not). However, on an embedded system, this may not be the case. And so with this new interface, we can provide the functionality of blocking until the urandom pool has been initialized. Any userspace program which uses this new functionality must take care to assure that if it is used during the boot process, that it will not cause the init scripts or other portions of the system startup to hang indefinitely. SYNOPSIS #include int getrandom(void *buf, size_t buflen, unsigned int flags); DESCRIPTION The system call getrandom() fills the buffer pointed to by buf with up to buflen random bytes which can be used to seed user space random number generators (i.e., DRBG's) or for other cryptographic uses. It should not be used for Monte Carlo simulations or other programs/algorithms which are doing probabilistic sampling. If the GRND_RANDOM flags bit is set, then draw from the /dev/random pool instead of the /dev/urandom pool. The /dev/random pool is limited based on the entropy that can be obtained from environmental noise, so if there is insufficient entropy, the requested number of bytes may not be returned. If there is no entropy available at all, getrandom(2) will either block, or return an error with errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags. If the GRND_RANDOM bit is not set, then the /dev/urandom pool will be used. Unlike using read(2) to fetch data from /dev/urandom, if the urandom pool has not been sufficiently initialized, getrandom(2) will block (or return -1 with the errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags). The getentropy(2) system call in OpenBSD can be emulated using the following function: int getentropy(void *buf, size_t buflen) { int ret; if (buflen > 256) goto failure; ret = getrandom(buf, buflen, 0); if (ret < 0) return ret; if (ret == buflen) return 0; failure: errno = EIO; return -1; } RETURN VALUE On success, the number of bytes that was filled in the buf is returned. This may not be all the bytes requested by the caller via buflen if insufficient entropy was present in the /dev/random pool, or if the system call was interrupted by a signal. On error, -1 is returned, and errno is set appropriately. ERRORS EINVAL An invalid flag was passed to getrandom(2) EFAULT buf is outside the accessible address space. EAGAIN The requested entropy was not available, and getentropy(2) would have blocked if the GRND_NONBLOCK flag was not set. EINTR While blocked waiting for entropy, the call was interrupted by a signal handler; see the description of how interrupted read(2) calls on "slow" devices are handled with and without the SA_RESTART flag in the signal(7) man page. NOTES For small requests (buflen <= 256) getrandom(2) will not return EINTR when reading from the urandom pool once the entropy pool has been initialized, and it will return all of the bytes that have been requested. This is the recommended way to use getrandom(2), and is designed for compatibility with OpenBSD's getentropy() system call. However, if you are using GRND_RANDOM, then getrandom(2) may block until the entropy accounting determines that sufficient environmental noise has been gathered such that getrandom(2) will be operating as a NRBG instead of a DRBG for those people who are working in the NIST SP 800-90 regime. Since it may block for a long time, these guarantees do *not* apply. The user may want to interrupt a hanging process using a signal, so blocking until all of the requested bytes are returned would be unfriendly. For this reason, the user of getrandom(2) MUST always check the return value, in case it returns some error, or if fewer bytes than requested was returned. In the case of !GRND_RANDOM and small request, the latter should never happen, but the careful userspace code (and all crypto code should be careful) should check for this anyway! Finally, unless you are doing long-term key generation (and perhaps not even then), you probably shouldn't be using GRND_RANDOM. The cryptographic algorithms used for /dev/urandom are quite conservative, and so should be sufficient for all purposes. The disadvantage of GRND_RANDOM is that it can block, and the increased complexity required to deal with partially fulfilled getrandom(2) requests. Signed-off-by: Theodore Ts'o Reviewed-by: Zach Brown --- include/linux/syscalls.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include/linux/syscalls.h') diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index b0881a0ed322..43324a897cf2 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -866,4 +866,7 @@ asmlinkage long sys_process_vm_writev(pid_t pid, asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1, unsigned long idx2); asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags); +asmlinkage long sys_getrandom(char __user *buf, size_t count, + unsigned int flags); + #endif -- cgit v1.2.3 From 9183df25fe7b194563db3fec6dc3202a5855839c Mon Sep 17 00:00:00 2001 From: David Herrmann Date: Fri, 8 Aug 2014 14:25:29 -0700 Subject: shm: add memfd_create() syscall memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor that you can pass to mmap(). It can support sealing and avoids any connection to user-visible mount-points. Thus, it's not subject to quotas on mounted file-systems, but can be used like malloc()'ed memory, but with a file-descriptor to it. memfd_create() returns the raw shmem file, so calls like ftruncate() can be used to modify the underlying inode. Also calls like fstat() will return proper information and mark the file as regular file. If you want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not supported (like on all other regular files). Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not subject to a filesystem size limit. It is still properly accounted to memcg limits, though, and to the same overcommit or no-overcommit accounting as all user memory. Signed-off-by: David Herrmann Acked-by: Hugh Dickins Cc: Michael Kerrisk Cc: Ryan Lortie Cc: Lennart Poettering Cc: Daniel Mack Cc: Andy Lutomirski Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/x86/syscalls/syscall_32.tbl | 1 + arch/x86/syscalls/syscall_64.tbl | 1 + include/linux/syscalls.h | 1 + include/uapi/linux/memfd.h | 8 +++++ kernel/sys_ni.c | 1 + mm/shmem.c | 73 ++++++++++++++++++++++++++++++++++++++++ 6 files changed, 85 insertions(+) create mode 100644 include/uapi/linux/memfd.h (limited to 'include/linux/syscalls.h') diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl index d1b4a119d4a5..028b78168d85 100644 --- a/arch/x86/syscalls/syscall_32.tbl +++ b/arch/x86/syscalls/syscall_32.tbl @@ -362,3 +362,4 @@ 353 i386 renameat2 sys_renameat2 354 i386 seccomp sys_seccomp 355 i386 getrandom sys_getrandom +356 i386 memfd_create sys_memfd_create diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl index 252c804bb1aa..ca2b9aa78c81 100644 --- a/arch/x86/syscalls/syscall_64.tbl +++ b/arch/x86/syscalls/syscall_64.tbl @@ -325,6 +325,7 @@ 316 common renameat2 sys_renameat2 317 common seccomp sys_seccomp 318 common getrandom sys_getrandom +319 common memfd_create sys_memfd_create # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 701daff5d899..15a069425cbf 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -802,6 +802,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags, asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr); asmlinkage long sys_eventfd(unsigned int count); asmlinkage long sys_eventfd2(unsigned int count, int flags); +asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags); asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len); asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int); asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *, diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h new file mode 100644 index 000000000000..534e364bda92 --- /dev/null +++ b/include/uapi/linux/memfd.h @@ -0,0 +1,8 @@ +#ifndef _UAPI_LINUX_MEMFD_H +#define _UAPI_LINUX_MEMFD_H + +/* flags for memfd_create(2) (unsigned int) */ +#define MFD_CLOEXEC 0x0001U +#define MFD_ALLOW_SEALING 0x0002U + +#endif /* _UAPI_LINUX_MEMFD_H */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 2904a2105914..1f79e3714533 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -197,6 +197,7 @@ cond_syscall(compat_sys_timerfd_settime); cond_syscall(compat_sys_timerfd_gettime); cond_syscall(sys_eventfd); cond_syscall(sys_eventfd2); +cond_syscall(sys_memfd_create); /* performance counters: */ cond_syscall(sys_perf_event_open); diff --git a/mm/shmem.c b/mm/shmem.c index 8b43bb7a4efe..4a5498795a2b 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt; #include #include #include +#include #include +#include #include #include @@ -2732,6 +2734,77 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root) shmem_show_mpol(seq, sbinfo->mpol); return 0; } + +#define MFD_NAME_PREFIX "memfd:" +#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) +#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) + +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING) + +SYSCALL_DEFINE2(memfd_create, + const char __user *, uname, + unsigned int, flags) +{ + struct shmem_inode_info *info; + struct file *file; + int fd, error; + char *name; + long len; + + if (flags & ~(unsigned int)MFD_ALL_FLAGS) + return -EINVAL; + + /* length includes terminating zero */ + len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); + if (len <= 0) + return -EFAULT; + if (len > MFD_NAME_MAX_LEN + 1) + return -EINVAL; + + name = kmalloc(len + MFD_NAME_PREFIX_LEN, GFP_TEMPORARY); + if (!name) + return -ENOMEM; + + strcpy(name, MFD_NAME_PREFIX); + if (copy_from_user(&name[MFD_NAME_PREFIX_LEN], uname, len)) { + error = -EFAULT; + goto err_name; + } + + /* terminating-zero may have changed after strnlen_user() returned */ + if (name[len + MFD_NAME_PREFIX_LEN - 1]) { + error = -EFAULT; + goto err_name; + } + + fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0); + if (fd < 0) { + error = fd; + goto err_name; + } + + file = shmem_file_setup(name, 0, VM_NORESERVE); + if (IS_ERR(file)) { + error = PTR_ERR(file); + goto err_fd; + } + info = SHMEM_I(file_inode(file)); + file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; + file->f_flags |= O_RDWR | O_LARGEFILE; + if (flags & MFD_ALLOW_SEALING) + info->seals &= ~F_SEAL_SEAL; + + fd_install(fd, file); + kfree(name); + return fd; + +err_fd: + put_unused_fd(fd); +err_name: + kfree(name); + return error; +} + #endif /* CONFIG_TMPFS */ static void shmem_put_super(struct super_block *sb) -- cgit v1.2.3 From f0895685c7fd8c938c91a9d8a6f7c11f22df58d2 Mon Sep 17 00:00:00 2001 From: Vivek Goyal Date: Fri, 8 Aug 2014 14:25:55 -0700 Subject: kexec: new syscall kexec_file_load() declaration This is the new syscall kexec_file_load() declaration/interface. I have reserved the syscall number only for x86_64 so far. Other architectures (including i386) can reserve syscall number when they enable the support for this new syscall. Signed-off-by: Vivek Goyal Cc: Michael Kerrisk Cc: Borislav Petkov Cc: Yinghai Lu Cc: Eric Biederman Cc: H. Peter Anvin Cc: Matthew Garrett Cc: Greg Kroah-Hartman Cc: Dave Young Cc: WANG Chao Cc: Baoquan He Cc: Andy Lutomirski Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/x86/syscalls/syscall_64.tbl | 1 + include/linux/syscalls.h | 4 ++++ kernel/kexec.c | 7 +++++++ kernel/sys_ni.c | 1 + 4 files changed, 13 insertions(+) (limited to 'include/linux/syscalls.h') diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl index ca2b9aa78c81..35dd922727b9 100644 --- a/arch/x86/syscalls/syscall_64.tbl +++ b/arch/x86/syscalls/syscall_64.tbl @@ -326,6 +326,7 @@ 317 common seccomp sys_seccomp 318 common getrandom sys_getrandom 319 common memfd_create sys_memfd_create +320 common kexec_file_load sys_kexec_file_load # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 15a069425cbf..0f86d85a9ce4 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -317,6 +317,10 @@ asmlinkage long sys_restart_syscall(void); asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments, struct kexec_segment __user *segments, unsigned long flags); +asmlinkage long sys_kexec_file_load(int kernel_fd, int initrd_fd, + unsigned long cmdline_len, + const char __user *cmdline_ptr, + unsigned long flags); asmlinkage long sys_exit(int error_code); asmlinkage long sys_exit_group(int error_code); diff --git a/kernel/kexec.c b/kernel/kexec.c index bfdda316697d..ec4386c1b94f 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -1058,6 +1058,13 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry, } #endif +SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, + unsigned long, cmdline_len, const char __user *, cmdline_ptr, + unsigned long, flags) +{ + return -ENOSYS; +} + void crash_kexec(struct pt_regs *regs) { /* Take the kexec_mutex here to prevent sys_kexec_load diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 1f79e3714533..391d4ddb6f4b 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -25,6 +25,7 @@ cond_syscall(sys_swapon); cond_syscall(sys_swapoff); cond_syscall(sys_kexec_load); cond_syscall(compat_sys_kexec_load); +cond_syscall(sys_kexec_file_load); cond_syscall(sys_init_module); cond_syscall(sys_finit_module); cond_syscall(sys_delete_module); -- cgit v1.2.3