x86/mm/tlb: Avoid reading mm_tlb_gen when possible

On extreme TLB shootdown storms, the mm's tlb_gen cacheline is highly contended and reading it should (arguably) be avoided as much as possible. Currently, flush_tlb_func() reads the mm's tlb_gen unconditionally, even when it is not necessary (e.g., the mm was already switched). This is wasteful. Moreover, one of the existing optimizations is to read mm's tlb_gen to see if there are additional in-flight TLB invalidations and flush the entire TLB in such a case. However, if the request's tlb_gen was already flushed, the benefit of checking the mm's tlb_gen is likely to be offset by the overhead of the check itself. Running will-it-scale with tlb_flush1_threads show a considerable benefit on 56-core Skylake (up to +24%): threads Baseline (v5.17+) +Patch 1 159960 160202 5 310808 308378 (-0.7%) 10 479110 490728 15 526771 562528 20 534495 587316 25 547462 628296 30 579616 666313 35 594134 701814 40 612288 732967 45 617517 749727 50 637476 735497 55 614363 778913 (+24%) Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20220606180123.2485171-1-namit@vmware.com
author: Nadav Amit <namit@vmware.com> 2022-06-06 11:01:23 -0700
committer: Dave Hansen <dave.hansen@linux.intel.com> 2022-06-07 08:48:03 -0700
commit: aa44284960d550eb4d8614afdffebc68a432a9b4 (patch)
tree: a33218a6c83a849a64c50b01b3d42ee53db2276d /arch/x86/mm/tlb.c
parent: e19d11267f0e6c8aff2d15d2dfed12365b4c9184 (diff)
1 files changed, 17 insertions, 1 deletions
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index d400b6d9d246..d9314cc8b81f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -734,10 +734,10 @@ static void flush_tlb_func(void *info)
 	const struct flush_tlb_info *f = info;
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-	u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
 	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
 	bool local = smp_processor_id() == f->initiating_cpu;
 	unsigned long nr_invalidate = 0;
+	u64 mm_tlb_gen;
 
 	/* This code cannot presently handle being reentered. */
 	VM_WARN_ON(!irqs_disabled());
@@ -771,6 +771,22 @@ static void flush_tlb_func(void *info)
 		return;
 	}
 
+	if (f->new_tlb_gen <= local_tlb_gen) {
+		/*
+		 * The TLB is already up to date in respect to f->new_tlb_gen.
+		 * While the core might be still behind mm_tlb_gen, checking
+		 * mm_tlb_gen unnecessarily would have negative caching effects
+		 * so avoid it.
+		 */
+		return;
+	}
+
+	/*
+	 * Defer mm_tlb_gen reading as long as possible to avoid cache
+	 * contention.
+	 */
+	mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
+
 	if (unlikely(local_tlb_gen == mm_tlb_gen)) {
 		/*
 		 * There's nothing to do: we're already up to date.  This can
author	Nadav Amit <namit@vmware.com>	2022-06-06 11:01:23 -0700
committer	Dave Hansen <dave.hansen@linux.intel.com>	2022-06-07 08:48:03 -0700
commit	aa44284960d550eb4d8614afdffebc68a432a9b4 (patch)
tree	a33218a6c83a849a64c50b01b3d42ee53db2276d /arch/x86/mm/tlb.c
parent	e19d11267f0e6c8aff2d15d2dfed12365b4c9184 (diff)