From b8eddff8886b173b0a0f21a3bb1a594cc6d974d1 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Mon, 14 Dec 2020 19:06:20 -0800 Subject: mm: memcontrol: add file_thp, shmem_thp to memory.stat As huge page usage in the page cache and for shmem files proliferates in our production environment, the performance monitoring team has asked for per-cgroup stats on those pages. We already track and export anon_thp per cgroup. We already track file THP and shmem THP per node, so making them per-cgroup is only a matter of switching from node to lruvec counters. All callsites are in places where the pages are charged and locked, so page->memcg is stable. [hannes@cmpxchg.org: add documentation] Link: https://lkml.kernel.org/r/20201026174029.GC548555@cmpxchg.org Link: https://lkml.kernel.org/r/20201022151844.489337-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Reviewed-by: Rik van Riel Reviewed-by: Shakeel Butt Acked-by: David Rientjes Acked-by: Michal Hocko Acked-by: Song Liu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/cgroup-v2.rst | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 608d7c279396..515bb13084a0 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1300,6 +1300,14 @@ PAGE_SIZE multiple when read back. Amount of memory used in anonymous mappings backed by transparent hugepages + file_thp + Amount of cached filesystem data backed by transparent + hugepages + + shmem_thp + Amount of shm, tmpfs, shared anonymous mmap()s backed by + transparent hugepages + inactive_anon, active_anon, inactive_file, active_file, unevictable Amount of memory, swap-backed and filesystem-backed, on the internal memory management lists used by the -- cgit v1.2.3 From 184218639a6f2a1cb84cf3ba583cee93a3ff4b81 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Mon, 14 Dec 2020 19:06:52 -0800 Subject: docs: cgroup-v1: reflect the deprecation of the non-hierarchical mode Update cgroup v1 docs after the deprecation of the non-hierarchical mode of the memory controller. Link: https://lkml.kernel.org/r/20201110220800.929549-3-guro@fb.com Signed-off-by: Roman Gushchin Reviewed-by: Shakeel Butt Acked-by: Michal Hocko Acked-by: David Rientjes Acked-by: Johannes Weiner Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/cgroup-v1/memcg_test.rst | 8 ++--- Documentation/admin-guide/cgroup-v1/memory.rst | 40 +++++++--------------- 2 files changed, 16 insertions(+), 32 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst index 3f7115e07b5d..4f83de2dab6e 100644 --- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst +++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst @@ -219,13 +219,11 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. This is an easy way to test page migration, too. -9.5 mkdir/rmdir ---------------- +9.5 nested cgroups +------------------ - When using hierarchy, mkdir/rmdir test should be done. - Use tests like the following:: + Use tests like the following for testing nested cgroups:: - echo 1 >/opt/cgroup/01/memory/use_hierarchy mkdir /opt/cgroup/01/child_a mkdir /opt/cgroup/01/child_b diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 12757e63b26c..a44cd467d218 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -77,6 +77,8 @@ Brief summary of control files. memory.soft_limit_in_bytes set/show soft limit of memory usage memory.stat show various statistics memory.use_hierarchy set/show hierarchical account enabled + This knob is deprecated and shouldn't be + used. memory.force_empty trigger forced page reclaim memory.pressure_level set memory pressure notifications memory.swappiness set/show swappiness parameter of vmscan @@ -495,16 +497,13 @@ cgroup might have some charge associated with it, even though all tasks have migrated away from it. (because we charge against pages, not against tasks.) -We move the stats to root (if use_hierarchy==0) or parent (if -use_hierarchy==1), and no change on the charge except uncharging +We move the stats to parent, and no change on the charge except uncharging from the child. Charges recorded in swap information is not updated at removal of cgroup. Recorded information is discarded and a cgroup which uses swap (swapcache) will be charged as a new owner of it. -About use_hierarchy, see Section 6. - 5. Misc. interfaces =================== @@ -527,8 +526,6 @@ About use_hierarchy, see Section 6. write will still return success. In this case, it is expected that memory.kmem.usage_in_bytes == memory.usage_in_bytes. - About use_hierarchy, see Section 6. - 5.2 stat file ------------- @@ -675,31 +672,20 @@ hierarchy:: d e In the diagram above, with hierarchical accounting enabled, all memory -usage of e, is accounted to its ancestors up until the root (i.e, c and root), -that has memory.use_hierarchy enabled. If one of the ancestors goes over its -limit, the reclaim algorithm reclaims from the tasks in the ancestor and the -children of the ancestor. - -6.1 Enabling hierarchical accounting and reclaim ------------------------------------------------- - -A memory cgroup by default disables the hierarchy feature. Support -can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup:: +usage of e, is accounted to its ancestors up until the root (i.e, c and root). +If one of the ancestors goes over its limit, the reclaim algorithm reclaims +from the tasks in the ancestor and the children of the ancestor. - # echo 1 > memory.use_hierarchy - -The feature can be disabled by:: +6.1 Hierarchical accounting and reclaim +--------------------------------------- - # echo 0 > memory.use_hierarchy +Hierarchical accounting is enabled by default. Disabling the hierarchical +accounting is deprecated. An attempt to do it will result in a failure +and a warning printed to dmesg. -NOTE1: - Enabling/disabling will fail if either the cgroup already has other - cgroups created below it, or if the parent cgroup has use_hierarchy - enabled. +For compatibility reasons writing 1 to memory.use_hierarchy will always pass:: -NOTE2: - When panic_on_oom is set to "2", the whole system will panic in - case of an OOM event in any cgroup. + # echo 1 > memory.use_hierarchy 7. Soft limits ============== -- cgit v1.2.3 From f0c0c115fb81940f4dba0644ac2a8a43b39c83f3 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Mon, 14 Dec 2020 19:07:17 -0800 Subject: mm: memcontrol: account pagetables per node For many workloads, pagetable consumption is significant and it makes sense to expose it in the memory.stat for the memory cgroups. However at the moment, the pagetables are accounted per-zone. Converting them to per-node and using the right interface will correctly account for the memory cgroups as well. [akpm@linux-foundation.org: export __mod_lruvec_page_state to modules for arch/mips/kvm/] Link: https://lkml.kernel.org/r/20201130212541.2781790-3-shakeelb@google.com Signed-off-by: Shakeel Butt Acked-by: Johannes Weiner Acked-by: Roman Gushchin Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/cgroup-v2.rst | 3 +++ arch/nds32/mm/mm-nds32.c | 6 +++--- drivers/base/node.c | 2 +- fs/proc/meminfo.c | 2 +- include/linux/mm.h | 8 ++++---- include/linux/mmzone.h | 2 +- mm/memcontrol.c | 2 ++ mm/page_alloc.c | 6 +++--- mm/vmstat.c | 2 +- 9 files changed, 19 insertions(+), 14 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 515bb13084a0..63521cd36ce5 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1274,6 +1274,9 @@ PAGE_SIZE multiple when read back. kernel_stack Amount of memory allocated to kernel stacks. + pagetables + Amount of memory allocated for page tables. + percpu(npn) Amount of memory used for storing per-cpu kernel data structures. diff --git a/arch/nds32/mm/mm-nds32.c b/arch/nds32/mm/mm-nds32.c index 55bec50ccc03..f2778f2b39f6 100644 --- a/arch/nds32/mm/mm-nds32.c +++ b/arch/nds32/mm/mm-nds32.c @@ -34,8 +34,8 @@ pgd_t *pgd_alloc(struct mm_struct *mm) cpu_dcache_wb_range((unsigned long)new_pgd, (unsigned long)new_pgd + PTRS_PER_PGD * sizeof(pgd_t)); - inc_zone_page_state(virt_to_page((unsigned long *)new_pgd), - NR_PAGETABLE); + inc_lruvec_page_state(virt_to_page((unsigned long *)new_pgd), + NR_PAGETABLE); return new_pgd; } @@ -59,7 +59,7 @@ void pgd_free(struct mm_struct *mm, pgd_t * pgd) pte = pmd_page(*pmd); pmd_clear(pmd); - dec_zone_page_state(virt_to_page((unsigned long *)pgd), NR_PAGETABLE); + dec_lruvec_page_state(virt_to_page((unsigned long *)pgd), NR_PAGETABLE); pte_free(mm, pte); mm_dec_nr_ptes(mm); pmd_free(mm, pmd); diff --git a/drivers/base/node.c b/drivers/base/node.c index 6ffa470e2984..04f71c7bc3f8 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -450,7 +450,7 @@ static ssize_t node_read_meminfo(struct device *dev, #ifdef CONFIG_SHADOW_CALL_STACK nid, node_page_state(pgdat, NR_KERNEL_SCS_KB), #endif - nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)), + nid, K(node_page_state(pgdat, NR_PAGETABLE)), nid, 0UL, nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)), nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index 887a5532e449..d6fc74619625 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -107,7 +107,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) global_node_page_state(NR_KERNEL_SCS_KB)); #endif show_val_kb(m, "PageTables: ", - global_zone_page_state(NR_PAGETABLE)); + global_node_page_state(NR_PAGETABLE)); show_val_kb(m, "NFS_Unstable: ", 0); show_val_kb(m, "Bounce: ", diff --git a/include/linux/mm.h b/include/linux/mm.h index db6ae4d3fb4e..5bbbf4aeee94 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2203,7 +2203,7 @@ static inline bool pgtable_pte_page_ctor(struct page *page) if (!ptlock_init(page)) return false; __SetPageTable(page); - inc_zone_page_state(page, NR_PAGETABLE); + inc_lruvec_page_state(page, NR_PAGETABLE); return true; } @@ -2211,7 +2211,7 @@ static inline void pgtable_pte_page_dtor(struct page *page) { ptlock_free(page); __ClearPageTable(page); - dec_zone_page_state(page, NR_PAGETABLE); + dec_lruvec_page_state(page, NR_PAGETABLE); } #define pte_offset_map_lock(mm, pmd, address, ptlp) \ @@ -2298,7 +2298,7 @@ static inline bool pgtable_pmd_page_ctor(struct page *page) if (!pmd_ptlock_init(page)) return false; __SetPageTable(page); - inc_zone_page_state(page, NR_PAGETABLE); + inc_lruvec_page_state(page, NR_PAGETABLE); return true; } @@ -2306,7 +2306,7 @@ static inline void pgtable_pmd_page_dtor(struct page *page) { pmd_ptlock_free(page); __ClearPageTable(page); - dec_zone_page_state(page, NR_PAGETABLE); + dec_lruvec_page_state(page, NR_PAGETABLE); } /* diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index fb3bf696c05e..cca2a4443c9c 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -152,7 +152,6 @@ enum zone_stat_item { NR_ZONE_UNEVICTABLE, NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */ NR_MLOCK, /* mlock()ed pages found and moved off LRU */ - NR_PAGETABLE, /* used for pagetables */ /* Second 128 byte cacheline */ NR_BOUNCE, #if IS_ENABLED(CONFIG_ZSMALLOC) @@ -207,6 +206,7 @@ enum node_stat_item { #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK) NR_KERNEL_SCS_KB, /* measured in KiB */ #endif + NR_PAGETABLE, /* used for pagetables */ NR_VM_NODE_STAT_ITEMS }; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 52837d68bbec..b9419a3605eb 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -869,6 +869,7 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx, lruvec = mem_cgroup_lruvec(head->mem_cgroup, pgdat); __mod_lruvec_state(lruvec, idx, val); } +EXPORT_SYMBOL(__mod_lruvec_page_state); void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) { @@ -1493,6 +1494,7 @@ static struct memory_stat memory_stats[] = { { "anon", PAGE_SIZE, NR_ANON_MAPPED }, { "file", PAGE_SIZE, NR_FILE_PAGES }, { "kernel_stack", 1024, NR_KERNEL_STACK_KB }, + { "pagetables", PAGE_SIZE, NR_PAGETABLE }, { "percpu", 1, MEMCG_PERCPU_B }, { "sock", PAGE_SIZE, MEMCG_SOCK }, { "shmem", PAGE_SIZE, NR_SHMEM }, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eaa227a479e4..743fb2bccecc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5465,7 +5465,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B), global_node_page_state(NR_FILE_MAPPED), global_node_page_state(NR_SHMEM), - global_zone_page_state(NR_PAGETABLE), + global_node_page_state(NR_PAGETABLE), global_zone_page_state(NR_BOUNCE), global_zone_page_state(NR_FREE_PAGES), free_pcp, @@ -5497,6 +5497,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) #ifdef CONFIG_SHADOW_CALL_STACK " shadow_call_stack:%lukB" #endif + " pagetables:%lukB" " all_unreclaimable? %s" "\n", pgdat->node_id, @@ -5522,6 +5523,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) #ifdef CONFIG_SHADOW_CALL_STACK node_page_state(pgdat, NR_KERNEL_SCS_KB), #endif + K(node_page_state(pgdat, NR_PAGETABLE)), pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ? "yes" : "no"); } @@ -5553,7 +5555,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) " present:%lukB" " managed:%lukB" " mlocked:%lukB" - " pagetables:%lukB" " bounce:%lukB" " free_pcp:%lukB" " local_pcp:%ukB" @@ -5574,7 +5575,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) K(zone->present_pages), K(zone_managed_pages(zone)), K(zone_page_state(zone, NR_MLOCK)), - K(zone_page_state(zone, NR_PAGETABLE)), K(zone_page_state(zone, NR_BOUNCE)), K(free_pcp), K(this_cpu_read(zone->pageset->pcp.count)), diff --git a/mm/vmstat.c b/mm/vmstat.c index 698bc0bc18d1..da36e3b0aab2 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1157,7 +1157,6 @@ const char * const vmstat_text[] = { "nr_zone_unevictable", "nr_zone_write_pending", "nr_mlock", - "nr_page_table_pages", "nr_bounce", #if IS_ENABLED(CONFIG_ZSMALLOC) "nr_zspages", @@ -1215,6 +1214,7 @@ const char * const vmstat_text[] = { #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK) "nr_shadow_call_stack", #endif + "nr_page_table_pages", /* enum writeback_stat_item counters */ "nr_dirty_threshold", -- cgit v1.2.3 From 56db19fef3f1c28a2fac37079eb276aaffec2e3d Mon Sep 17 00:00:00 2001 From: Alex Shi Date: Mon, 14 Dec 2020 19:09:02 -0800 Subject: docs/vm: remove unused 3 items explanation for /proc/vmstat Commit 5647bc293ab1 ("mm: compaction: Move migration fail/success stats to migrate.c"), removed 3 items in /proc/vmstat. but the docs still has their explanation. let's remove them. "compact_blocks_moved", "compact_pages_moved", "compact_pagemigrate_failed", Link: https://lkml.kernel.org/r/1605520282-51993-1-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi Reviewed-by: Zi Yan Cc: Jonathan Corbet Cc: Yang Shi Cc: "Kirill A. Shutemov" Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/mm/transhuge.rst | 15 --------------- 1 file changed, 15 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index b2acd0d395ca..3b8a336511a4 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -401,21 +401,6 @@ compact_fail is incremented if the system tries to compact memory but failed. -compact_pages_moved - is incremented each time a page is moved. If - this value is increasing rapidly, it implies that the system - is copying a lot of data to satisfy the huge page allocation. - It is possible that the cost of copying exceeds any savings - from reduced TLB misses. - -compact_pagemigrate_failed - is incremented when the underlying mechanism - for moving a page failed. - -compact_blocks_moved - is incremented each time memory compaction examines - a huge page aligned range of pages. - It is possible to establish how long the stalls were using the function tracer to record how long was spent in __alloc_pages_nodemask and using the mm_page_alloc tracepoint to identify which allocations were -- cgit v1.2.3 From d0d4730ac2e404a5b0da9a87ef38c73e51cb1664 Mon Sep 17 00:00:00 2001 From: Lokesh Gidra Date: Mon, 14 Dec 2020 19:13:54 -0800 Subject: userfaultfd: add user-mode only option to unprivileged_userfaultfd sysctl knob With this change, when the knob is set to 0, it allows unprivileged users to call userfaultfd, like when it is set to 1, but with the restriction that page faults from only user-mode can be handled. In this mode, an unprivileged user (without SYS_CAP_PTRACE capability) must pass UFFD_USER_MODE_ONLY to userfaultd or the API will fail with EPERM. This enables administrators to reduce the likelihood that an attacker with access to userfaultfd can delay faulting kernel code to widen timing windows for other exploits. The default value of this knob is changed to 0. This is required for correct functioning of pipe mutex. However, this will fail postcopy live migration, which will be unnoticeable to the VM guests. To avoid this, set 'vm.userfault = 1' in /sys/sysctl.conf. The main reason this change is desirable as in the short term is that the Android userland will behave as with the sysctl set to zero. So without this commit, any Linux binary using userfaultfd to manage its memory would behave differently if run within the Android userland. For more details, refer to Andrea's reply [1]. [1] https://lore.kernel.org/lkml/20200904033438.GI9411@redhat.com/ Link: https://lkml.kernel.org/r/20201120030411.2690816-3-lokeshgidra@google.com Signed-off-by: Lokesh Gidra Reviewed-by: Andrea Arcangeli Cc: Kees Cook Cc: Jonathan Corbet Cc: Peter Xu Cc: Sebastian Andrzej Siewior Cc: Alexander Viro Cc: Stephen Smalley Cc: Eric Biggers Cc: Daniel Colascione Cc: "Joel Fernandes (Google)" Cc: Kalesh Singh Cc: Suren Baghdasaryan Cc: Jeff Vander Stoep Cc: Cc: Mike Rapoport Cc: Shaohua Li Cc: Jerome Glisse Cc: Mauro Carvalho Chehab Cc: Johannes Weiner Cc: Mel Gorman Cc: Nitin Gupta Cc: Vlastimil Babka Cc: Iurii Zaikin Cc: Luis Chamberlain Cc: Daniel Colascione Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/sysctl/vm.rst | 15 ++++++++++----- fs/userfaultfd.c | 10 ++++++++-- 2 files changed, 18 insertions(+), 7 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index f455fa00c00f..d06a98b2a4e7 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -873,12 +873,17 @@ file-backed pages is less than the high watermark in a zone. unprivileged_userfaultfd ======================== -This flag controls whether unprivileged users can use the userfaultfd -system calls. Set this to 1 to allow unprivileged users to use the -userfaultfd system calls, or set this to 0 to restrict userfaultfd to only -privileged users (with SYS_CAP_PTRACE capability). +This flag controls the mode in which unprivileged users can use the +userfaultfd system calls. Set this to 0 to restrict unprivileged users +to handle page faults in user mode only. In this case, users without +SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd to +succeed. Prohibiting use of userfaultfd for handling faults from kernel +mode may make certain vulnerabilities more difficult to exploit. -The default value is 1. +Set this to 1 to allow unprivileged users to use the userfaultfd system +calls without any restrictions. + +The default value is 0. user_reserve_kbytes diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 605599fde015..894cc28142e7 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -28,7 +28,7 @@ #include #include -int sysctl_unprivileged_userfaultfd __read_mostly = 1; +int sysctl_unprivileged_userfaultfd __read_mostly; static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly; @@ -1966,8 +1966,14 @@ SYSCALL_DEFINE1(userfaultfd, int, flags) struct userfaultfd_ctx *ctx; int fd; - if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE)) + if (!sysctl_unprivileged_userfaultfd && + (flags & UFFD_USER_MODE_ONLY) == 0 && + !capable(CAP_SYS_PTRACE)) { + printk_once(KERN_WARNING "uffd: Set unprivileged_userfaultfd " + "sysctl knob to 1 if kernel faults must be handled " + "without obtaining CAP_SYS_PTRACE capability\n"); return -EPERM; + } BUG_ON(!current->mm); -- cgit v1.2.3 From 0d8359620d9be9823b6b9b3cf2dbe006cbfec594 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Mon, 14 Dec 2020 19:14:28 -0800 Subject: zram: support page writeback There is demand to writeback specific process pages to backing store instead of all idles pages in the system due to storage wear out concerns and to launching latency of apps which are most of the time idle but are critical for resume latency. This patch extends the writeback knob to support a specific page writeback. Link: https://lkml.kernel.org/r/20201020190506.3758660-1-minchan@kernel.org Signed-off-by: Minchan Kim Reviewed-by: Sergey Senozhatsky Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/blockdev/zram.rst | 5 +++++ drivers/block/zram/zram_drv.c | 21 +++++++++++++++++---- 2 files changed, 22 insertions(+), 4 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index a6fd1f9b5faf..8f3cfa42e443 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -334,6 +334,11 @@ Admin can request writeback of those idle pages at right timing via:: With the command, zram writeback idle pages from memory to the storage. +If admin want to write a specific page in zram device to backing device, +they could write a page index into the interface. + + echo "page_index=1251" > /sys/block/zramX/writeback + If there are lots of write IO with flash device, potentially, it has flash wearout problem so that admin needs to design write limitation to guarantee storage health for entire product life. diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 1b697208d661..9a1e6ee4dbef 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -620,15 +620,19 @@ static int read_from_bdev_async(struct zram *zram, struct bio_vec *bvec, return 1; } +#define PAGE_WB_SIG "page_index=" + +#define PAGE_WRITEBACK 0 #define HUGE_WRITEBACK 1 #define IDLE_WRITEBACK 2 + static ssize_t writeback_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t len) { struct zram *zram = dev_to_zram(dev); unsigned long nr_pages = zram->disksize >> PAGE_SHIFT; - unsigned long index; + unsigned long index = 0; struct bio bio; struct bio_vec bio_vec; struct page *page; @@ -640,8 +644,17 @@ static ssize_t writeback_store(struct device *dev, mode = IDLE_WRITEBACK; else if (sysfs_streq(buf, "huge")) mode = HUGE_WRITEBACK; - else - return -EINVAL; + else { + if (strncmp(buf, PAGE_WB_SIG, sizeof(PAGE_WB_SIG) - 1)) + return -EINVAL; + + ret = kstrtol(buf + sizeof(PAGE_WB_SIG) - 1, 10, &index); + if (ret || index >= nr_pages) + return -EINVAL; + + nr_pages = 1; + mode = PAGE_WRITEBACK; + } down_read(&zram->init_lock); if (!init_done(zram)) { @@ -660,7 +673,7 @@ static ssize_t writeback_store(struct device *dev, goto release_init_lock; } - for (index = 0; index < nr_pages; index++) { + while (nr_pages--) { struct bio_vec bvec; bvec.bv_page = page; -- cgit v1.2.3 From 194e28da1a0279ef6a106a5b621fd79c410432ef Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Mon, 14 Dec 2020 19:14:32 -0800 Subject: zram: add stat to gather incompressible pages since zram set up Currently, zram supports the stat via /sys/block/zram/mm_stat to represent how many of incompressible pages are stored at the moment but it couldn't show how many times incompressible pages were wrote down since zram set up. It's also good indication to see how zram is effective in the system. Link: https://lkml.kernel.org/r/20201130201907.1284910-1-minchan@kernel.org Signed-off-by: Minchan Kim Reviewed-by: Sergey Senozhatsky Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/blockdev/zram.rst | 1 + drivers/block/zram/zram_drv.c | 6 ++++-- drivers/block/zram/zram_drv.h | 1 + 3 files changed, 6 insertions(+), 2 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 8f3cfa42e443..03f1105b21b7 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -266,6 +266,7 @@ line of text and contains the following stats separated by whitespace: No memory is allocated for such pages. pages_compacted the number of pages freed during compaction huge_pages the number of incompressible pages + huge_pages_since the number of incompressible pages since zram set up ================ ============================================================= File /sys/block/zram/bd_stat diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 9a1e6ee4dbef..9481fa9eedd0 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1084,7 +1084,7 @@ static ssize_t mm_stat_show(struct device *dev, max_used = atomic_long_read(&zram->stats.max_used_pages); ret = scnprintf(buf, PAGE_SIZE, - "%8llu %8llu %8llu %8lu %8ld %8llu %8lu %8llu\n", + "%8llu %8llu %8llu %8lu %8ld %8llu %8lu %8llu %8llu\n", orig_size << PAGE_SHIFT, (u64)atomic64_read(&zram->stats.compr_data_size), mem_used << PAGE_SHIFT, @@ -1092,7 +1092,8 @@ static ssize_t mm_stat_show(struct device *dev, max_used << PAGE_SHIFT, (u64)atomic64_read(&zram->stats.same_pages), pool_stats.pages_compacted, - (u64)atomic64_read(&zram->stats.huge_pages)); + (u64)atomic64_read(&zram->stats.huge_pages), + (u64)atomic64_read(&zram->stats.huge_pages_since)); up_read(&zram->init_lock); return ret; @@ -1424,6 +1425,7 @@ out: if (comp_len == PAGE_SIZE) { zram_set_flag(zram, index, ZRAM_HUGE); atomic64_inc(&zram->stats.huge_pages); + atomic64_inc(&zram->stats.huge_pages_since); } if (flags) { diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h index f2fd46daa760..9cabcbb13fd9 100644 --- a/drivers/block/zram/zram_drv.h +++ b/drivers/block/zram/zram_drv.h @@ -78,6 +78,7 @@ struct zram_stats { atomic64_t notify_free; /* no. of swap slot free notifications */ atomic64_t same_pages; /* no. of same element filled pages */ atomic64_t huge_pages; /* no. of huge pages */ + atomic64_t huge_pages_since; /* no. of huge pages since zram set up */ atomic64_t pages_stored; /* no. of pages currently stored */ atomic_long_t max_used_pages; /* no. of maximum pages stored */ atomic64_t writestall; /* no. of write slow paths */ -- cgit v1.2.3