From 6b05dfacd761c6ace11def4b3b42fc6a7583fec3 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 21 Apr 2020 19:04:02 +0200 Subject: docs: RCU: Convert checklist.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Use the right list markups; - Add it to RCU/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Paul E. McKenney --- Documentation/RCU/checklist.rst | 465 ++++++++++++++++++++++++++++++++++++++++ Documentation/RCU/checklist.txt | 458 --------------------------------------- Documentation/RCU/index.rst | 3 + 3 files changed, 468 insertions(+), 458 deletions(-) create mode 100644 Documentation/RCU/checklist.rst delete mode 100644 Documentation/RCU/checklist.txt diff --git a/Documentation/RCU/checklist.rst b/Documentation/RCU/checklist.rst new file mode 100644 index 000000000000..2efed9926c3f --- /dev/null +++ b/Documentation/RCU/checklist.rst @@ -0,0 +1,465 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================ +Review Checklist for RCU Patches +================================ + + +This document contains a checklist for producing and reviewing patches +that make use of RCU. Violating any of the rules listed below will +result in the same sorts of problems that leaving out a locking primitive +would cause. This list is based on experiences reviewing such patches +over a rather long period of time, but improvements are always welcome! + +0. Is RCU being applied to a read-mostly situation? If the data + structure is updated more than about 10% of the time, then you + should strongly consider some other approach, unless detailed + performance measurements show that RCU is nonetheless the right + tool for the job. Yes, RCU does reduce read-side overhead by + increasing write-side overhead, which is exactly why normal uses + of RCU will do much more reading than updating. + + Another exception is where performance is not an issue, and RCU + provides a simpler implementation. An example of this situation + is the dynamic NMI code in the Linux 2.6 kernel, at least on + architectures where NMIs are rare. + + Yet another exception is where the low real-time latency of RCU's + read-side primitives is critically important. + + One final exception is where RCU readers are used to prevent + the ABA problem (https://en.wikipedia.org/wiki/ABA_problem) + for lockless updates. This does result in the mildly + counter-intuitive situation where rcu_read_lock() and + rcu_read_unlock() are used to protect updates, however, this + approach provides the same potential simplifications that garbage + collectors do. + +1. Does the update code have proper mutual exclusion? + + RCU does allow -readers- to run (almost) naked, but -writers- must + still use some sort of mutual exclusion, such as: + + a. locking, + b. atomic operations, or + c. restricting updates to a single task. + + If you choose #b, be prepared to describe how you have handled + memory barriers on weakly ordered machines (pretty much all of + them -- even x86 allows later loads to be reordered to precede + earlier stores), and be prepared to explain why this added + complexity is worthwhile. If you choose #c, be prepared to + explain how this single task does not become a major bottleneck on + big multiprocessor machines (for example, if the task is updating + information relating to itself that other tasks can read, there + by definition can be no bottleneck). Note that the definition + of "large" has changed significantly: Eight CPUs was "large" + in the year 2000, but a hundred CPUs was unremarkable in 2017. + +2. Do the RCU read-side critical sections make proper use of + rcu_read_lock() and friends? These primitives are needed + to prevent grace periods from ending prematurely, which + could result in data being unceremoniously freed out from + under your read-side code, which can greatly increase the + actuarial risk of your kernel. + + As a rough rule of thumb, any dereference of an RCU-protected + pointer must be covered by rcu_read_lock(), rcu_read_lock_bh(), + rcu_read_lock_sched(), or by the appropriate update-side lock. + Disabling of preemption can serve as rcu_read_lock_sched(), but + is less readable and prevents lockdep from detecting locking issues. + + Letting RCU-protected pointers "leak" out of an RCU read-side + critical section is every bid as bad as letting them leak out + from under a lock. Unless, of course, you have arranged some + other means of protection, such as a lock or a reference count + -before- letting them out of the RCU read-side critical section. + +3. Does the update code tolerate concurrent accesses? + + The whole point of RCU is to permit readers to run without + any locks or atomic operations. This means that readers will + be running while updates are in progress. There are a number + of ways to handle this concurrency, depending on the situation: + + a. Use the RCU variants of the list and hlist update + primitives to add, remove, and replace elements on + an RCU-protected list. Alternatively, use the other + RCU-protected data structures that have been added to + the Linux kernel. + + This is almost always the best approach. + + b. Proceed as in (a) above, but also maintain per-element + locks (that are acquired by both readers and writers) + that guard per-element state. Of course, fields that + the readers refrain from accessing can be guarded by + some other lock acquired only by updaters, if desired. + + This works quite well, also. + + c. Make updates appear atomic to readers. For example, + pointer updates to properly aligned fields will + appear atomic, as will individual atomic primitives. + Sequences of operations performed under a lock will -not- + appear to be atomic to RCU readers, nor will sequences + of multiple atomic primitives. + + This can work, but is starting to get a bit tricky. + + d. Carefully order the updates and the reads so that + readers see valid data at all phases of the update. + This is often more difficult than it sounds, especially + given modern CPUs' tendency to reorder memory references. + One must usually liberally sprinkle memory barriers + (smp_wmb(), smp_rmb(), smp_mb()) through the code, + making it difficult to understand and to test. + + It is usually better to group the changing data into + a separate structure, so that the change may be made + to appear atomic by updating a pointer to reference + a new structure containing updated values. + +4. Weakly ordered CPUs pose special challenges. Almost all CPUs + are weakly ordered -- even x86 CPUs allow later loads to be + reordered to precede earlier stores. RCU code must take all of + the following measures to prevent memory-corruption problems: + + a. Readers must maintain proper ordering of their memory + accesses. The rcu_dereference() primitive ensures that + the CPU picks up the pointer before it picks up the data + that the pointer points to. This really is necessary + on Alpha CPUs. If you don't believe me, see: + + http://www.openvms.compaq.com/wizard/wiz_2637.html + + The rcu_dereference() primitive is also an excellent + documentation aid, letting the person reading the + code know exactly which pointers are protected by RCU. + Please note that compilers can also reorder code, and + they are becoming increasingly aggressive about doing + just that. The rcu_dereference() primitive therefore also + prevents destructive compiler optimizations. However, + with a bit of devious creativity, it is possible to + mishandle the return value from rcu_dereference(). + Please see rcu_dereference.txt in this directory for + more information. + + The rcu_dereference() primitive is used by the + various "_rcu()" list-traversal primitives, such + as the list_for_each_entry_rcu(). Note that it is + perfectly legal (if redundant) for update-side code to + use rcu_dereference() and the "_rcu()" list-traversal + primitives. This is particularly useful in code that + is common to readers and updaters. However, lockdep + will complain if you access rcu_dereference() outside + of an RCU read-side critical section. See lockdep.txt + to learn what to do about this. + + Of course, neither rcu_dereference() nor the "_rcu()" + list-traversal primitives can substitute for a good + concurrency design coordinating among multiple updaters. + + b. If the list macros are being used, the list_add_tail_rcu() + and list_add_rcu() primitives must be used in order + to prevent weakly ordered machines from misordering + structure initialization and pointer planting. + Similarly, if the hlist macros are being used, the + hlist_add_head_rcu() primitive is required. + + c. If the list macros are being used, the list_del_rcu() + primitive must be used to keep list_del()'s pointer + poisoning from inflicting toxic effects on concurrent + readers. Similarly, if the hlist macros are being used, + the hlist_del_rcu() primitive is required. + + The list_replace_rcu() and hlist_replace_rcu() primitives + may be used to replace an old structure with a new one + in their respective types of RCU-protected lists. + + d. Rules similar to (4b) and (4c) apply to the "hlist_nulls" + type of RCU-protected linked lists. + + e. Updates must ensure that initialization of a given + structure happens before pointers to that structure are + publicized. Use the rcu_assign_pointer() primitive + when publicizing a pointer to a structure that can + be traversed by an RCU read-side critical section. + +5. If call_rcu() or call_srcu() is used, the callback function will + be called from softirq context. In particular, it cannot block. + +6. Since synchronize_rcu() can block, it cannot be called + from any sort of irq context. The same rule applies + for synchronize_srcu(), synchronize_rcu_expedited(), and + synchronize_srcu_expedited(). + + The expedited forms of these primitives have the same semantics + as the non-expedited forms, but expediting is both expensive and + (with the exception of synchronize_srcu_expedited()) unfriendly + to real-time workloads. Use of the expedited primitives should + be restricted to rare configuration-change operations that would + not normally be undertaken while a real-time workload is running. + However, real-time workloads can use rcupdate.rcu_normal kernel + boot parameter to completely disable expedited grace periods, + though this might have performance implications. + + In particular, if you find yourself invoking one of the expedited + primitives repeatedly in a loop, please do everyone a favor: + Restructure your code so that it batches the updates, allowing + a single non-expedited primitive to cover the entire batch. + This will very likely be faster than the loop containing the + expedited primitive, and will be much much easier on the rest + of the system, especially to real-time workloads running on + the rest of the system. + +7. As of v4.20, a given kernel implements only one RCU flavor, + which is RCU-sched for PREEMPT=n and RCU-preempt for PREEMPT=y. + If the updater uses call_rcu() or synchronize_rcu(), + then the corresponding readers my use rcu_read_lock() and + rcu_read_unlock(), rcu_read_lock_bh() and rcu_read_unlock_bh(), + or any pair of primitives that disables and re-enables preemption, + for example, rcu_read_lock_sched() and rcu_read_unlock_sched(). + If the updater uses synchronize_srcu() or call_srcu(), + then the corresponding readers must use srcu_read_lock() and + srcu_read_unlock(), and with the same srcu_struct. The rules for + the expedited primitives are the same as for their non-expedited + counterparts. Mixing things up will result in confusion and + broken kernels, and has even resulted in an exploitable security + issue. + + One exception to this rule: rcu_read_lock() and rcu_read_unlock() + may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh() + in cases where local bottom halves are already known to be + disabled, for example, in irq or softirq context. Commenting + such cases is a must, of course! And the jury is still out on + whether the increased speed is worth it. + +8. Although synchronize_rcu() is slower than is call_rcu(), it + usually results in simpler code. So, unless update performance is + critically important, the updaters cannot block, or the latency of + synchronize_rcu() is visible from userspace, synchronize_rcu() + should be used in preference to call_rcu(). Furthermore, + kfree_rcu() usually results in even simpler code than does + synchronize_rcu() without synchronize_rcu()'s multi-millisecond + latency. So please take advantage of kfree_rcu()'s "fire and + forget" memory-freeing capabilities where it applies. + + An especially important property of the synchronize_rcu() + primitive is that it automatically self-limits: if grace periods + are delayed for whatever reason, then the synchronize_rcu() + primitive will correspondingly delay updates. In contrast, + code using call_rcu() should explicitly limit update rate in + cases where grace periods are delayed, as failing to do so can + result in excessive realtime latencies or even OOM conditions. + + Ways of gaining this self-limiting property when using call_rcu() + include: + + a. Keeping a count of the number of data-structure elements + used by the RCU-protected data structure, including + those waiting for a grace period to elapse. Enforce a + limit on this number, stalling updates as needed to allow + previously deferred frees to complete. Alternatively, + limit only the number awaiting deferred free rather than + the total number of elements. + + One way to stall the updates is to acquire the update-side + mutex. (Don't try this with a spinlock -- other CPUs + spinning on the lock could prevent the grace period + from ever ending.) Another way to stall the updates + is for the updates to use a wrapper function around + the memory allocator, so that this wrapper function + simulates OOM when there is too much memory awaiting an + RCU grace period. There are of course many other + variations on this theme. + + b. Limiting update rate. For example, if updates occur only + once per hour, then no explicit rate limiting is + required, unless your system is already badly broken. + Older versions of the dcache subsystem take this approach, + guarding updates with a global lock, limiting their rate. + + c. Trusted update -- if updates can only be done manually by + superuser or some other trusted user, then it might not + be necessary to automatically limit them. The theory + here is that superuser already has lots of ways to crash + the machine. + + d. Periodically invoke synchronize_rcu(), permitting a limited + number of updates per grace period. + + The same cautions apply to call_srcu() and kfree_rcu(). + + Note that although these primitives do take action to avoid memory + exhaustion when any given CPU has too many callbacks, a determined + user could still exhaust memory. This is especially the case + if a system with a large number of CPUs has been configured to + offload all of its RCU callbacks onto a single CPU, or if the + system has relatively little free memory. + +9. All RCU list-traversal primitives, which include + rcu_dereference(), list_for_each_entry_rcu(), and + list_for_each_safe_rcu(), must be either within an RCU read-side + critical section or must be protected by appropriate update-side + locks. RCU read-side critical sections are delimited by + rcu_read_lock() and rcu_read_unlock(), or by similar primitives + such as rcu_read_lock_bh() and rcu_read_unlock_bh(), in which + case the matching rcu_dereference() primitive must be used in + order to keep lockdep happy, in this case, rcu_dereference_bh(). + + The reason that it is permissible to use RCU list-traversal + primitives when the update-side lock is held is that doing so + can be quite helpful in reducing code bloat when common code is + shared between readers and updaters. Additional primitives + are provided for this case, as discussed in lockdep.txt. + +10. Conversely, if you are in an RCU read-side critical section, + and you don't hold the appropriate update-side lock, you -must- + use the "_rcu()" variants of the list macros. Failing to do so + will break Alpha, cause aggressive compilers to generate bad code, + and confuse people trying to read your code. + +11. Any lock acquired by an RCU callback must be acquired elsewhere + with softirq disabled, e.g., via spin_lock_irqsave(), + spin_lock_bh(), etc. Failing to disable softirq on a given + acquisition of that lock will result in deadlock as soon as + the RCU softirq handler happens to run your RCU callback while + interrupting that acquisition's critical section. + +12. RCU callbacks can be and are executed in parallel. In many cases, + the callback code simply wrappers around kfree(), so that this + is not an issue (or, more accurately, to the extent that it is + an issue, the memory-allocator locking handles it). However, + if the callbacks do manipulate a shared data structure, they + must use whatever locking or other synchronization is required + to safely access and/or modify that data structure. + + Do not assume that RCU callbacks will be executed on the same + CPU that executed the corresponding call_rcu() or call_srcu(). + For example, if a given CPU goes offline while having an RCU + callback pending, then that RCU callback will execute on some + surviving CPU. (If this was not the case, a self-spawning RCU + callback would prevent the victim CPU from ever going offline.) + Furthermore, CPUs designated by rcu_nocbs= might well -always- + have their RCU callbacks executed on some other CPUs, in fact, + for some real-time workloads, this is the whole point of using + the rcu_nocbs= kernel boot parameter. + +13. Unlike other forms of RCU, it -is- permissible to block in an + SRCU read-side critical section (demarked by srcu_read_lock() + and srcu_read_unlock()), hence the "SRCU": "sleepable RCU". + Please note that if you don't need to sleep in read-side critical + sections, you should be using RCU rather than SRCU, because RCU + is almost always faster and easier to use than is SRCU. + + Also unlike other forms of RCU, explicit initialization and + cleanup is required either at build time via DEFINE_SRCU() + or DEFINE_STATIC_SRCU() or at runtime via init_srcu_struct() + and cleanup_srcu_struct(). These last two are passed a + "struct srcu_struct" that defines the scope of a given + SRCU domain. Once initialized, the srcu_struct is passed + to srcu_read_lock(), srcu_read_unlock() synchronize_srcu(), + synchronize_srcu_expedited(), and call_srcu(). A given + synchronize_srcu() waits only for SRCU read-side critical + sections governed by srcu_read_lock() and srcu_read_unlock() + calls that have been passed the same srcu_struct. This property + is what makes sleeping read-side critical sections tolerable -- + a given subsystem delays only its own updates, not those of other + subsystems using SRCU. Therefore, SRCU is less prone to OOM the + system than RCU would be if RCU's read-side critical sections + were permitted to sleep. + + The ability to sleep in read-side critical sections does not + come for free. First, corresponding srcu_read_lock() and + srcu_read_unlock() calls must be passed the same srcu_struct. + Second, grace-period-detection overhead is amortized only + over those updates sharing a given srcu_struct, rather than + being globally amortized as they are for other forms of RCU. + Therefore, SRCU should be used in preference to rw_semaphore + only in extremely read-intensive situations, or in situations + requiring SRCU's read-side deadlock immunity or low read-side + realtime latency. You should also consider percpu_rw_semaphore + when you need lightweight readers. + + SRCU's expedited primitive (synchronize_srcu_expedited()) + never sends IPIs to other CPUs, so it is easier on + real-time workloads than is synchronize_rcu_expedited(). + + Note that rcu_assign_pointer() relates to SRCU just as it does to + other forms of RCU, but instead of rcu_dereference() you should + use srcu_dereference() in order to avoid lockdep splats. + +14. The whole point of call_rcu(), synchronize_rcu(), and friends + is to wait until all pre-existing readers have finished before + carrying out some otherwise-destructive operation. It is + therefore critically important to -first- remove any path + that readers can follow that could be affected by the + destructive operation, and -only- -then- invoke call_rcu(), + synchronize_rcu(), or friends. + + Because these primitives only wait for pre-existing readers, it + is the caller's responsibility to guarantee that any subsequent + readers will execute safely. + +15. The various RCU read-side primitives do -not- necessarily contain + memory barriers. You should therefore plan for the CPU + and the compiler to freely reorder code into and out of RCU + read-side critical sections. It is the responsibility of the + RCU update-side primitives to deal with this. + + For SRCU readers, you can use smp_mb__after_srcu_read_unlock() + immediately after an srcu_read_unlock() to get a full barrier. + +16. Use CONFIG_PROVE_LOCKING, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the + __rcu sparse checks to validate your RCU code. These can help + find problems as follows: + + CONFIG_PROVE_LOCKING: + check that accesses to RCU-protected data + structures are carried out under the proper RCU + read-side critical section, while holding the right + combination of locks, or whatever other conditions + are appropriate. + + CONFIG_DEBUG_OBJECTS_RCU_HEAD: + check that you don't pass the + same object to call_rcu() (or friends) before an RCU + grace period has elapsed since the last time that you + passed that same object to call_rcu() (or friends). + + __rcu sparse checks: + tag the pointer to the RCU-protected data + structure with __rcu, and sparse will warn you if you + access that pointer without the services of one of the + variants of rcu_dereference(). + + These debugging aids can help you find problems that are + otherwise extremely difficult to spot. + +17. If you register a callback using call_rcu() or call_srcu(), and + pass in a function defined within a loadable module, then it in + necessary to wait for all pending callbacks to be invoked after + the last invocation and before unloading that module. Note that + it is absolutely -not- sufficient to wait for a grace period! + The current (say) synchronize_rcu() implementation is -not- + guaranteed to wait for callbacks registered on other CPUs. + Or even on the current CPU if that CPU recently went offline + and came back online. + + You instead need to use one of the barrier functions: + + - call_rcu() -> rcu_barrier() + - call_srcu() -> srcu_barrier() + + However, these barrier functions are absolutely -not- guaranteed + to wait for a grace period. In fact, if there are no call_rcu() + callbacks waiting anywhere in the system, rcu_barrier() is within + its rights to return immediately. + + So if you need to wait for both an RCU grace period and for + all pre-existing call_rcu() callbacks, you will need to execute + both rcu_barrier() and synchronize_rcu(), if necessary, using + something like workqueues to to execute them concurrently. + + See rcubarrier.txt for more information. diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt deleted file mode 100644 index e98ff261a438..000000000000 --- a/Documentation/RCU/checklist.txt +++ /dev/null @@ -1,458 +0,0 @@ -Review Checklist for RCU Patches - - -This document contains a checklist for producing and reviewing patches -that make use of RCU. Violating any of the rules listed below will -result in the same sorts of problems that leaving out a locking primitive -would cause. This list is based on experiences reviewing such patches -over a rather long period of time, but improvements are always welcome! - -0. Is RCU being applied to a read-mostly situation? If the data - structure is updated more than about 10% of the time, then you - should strongly consider some other approach, unless detailed - performance measurements show that RCU is nonetheless the right - tool for the job. Yes, RCU does reduce read-side overhead by - increasing write-side overhead, which is exactly why normal uses - of RCU will do much more reading than updating. - - Another exception is where performance is not an issue, and RCU - provides a simpler implementation. An example of this situation - is the dynamic NMI code in the Linux 2.6 kernel, at least on - architectures where NMIs are rare. - - Yet another exception is where the low real-time latency of RCU's - read-side primitives is critically important. - - One final exception is where RCU readers are used to prevent - the ABA problem (https://en.wikipedia.org/wiki/ABA_problem) - for lockless updates. This does result in the mildly - counter-intuitive situation where rcu_read_lock() and - rcu_read_unlock() are used to protect updates, however, this - approach provides the same potential simplifications that garbage - collectors do. - -1. Does the update code have proper mutual exclusion? - - RCU does allow -readers- to run (almost) naked, but -writers- must - still use some sort of mutual exclusion, such as: - - a. locking, - b. atomic operations, or - c. restricting updates to a single task. - - If you choose #b, be prepared to describe how you have handled - memory barriers on weakly ordered machines (pretty much all of - them -- even x86 allows later loads to be reordered to precede - earlier stores), and be prepared to explain why this added - complexity is worthwhile. If you choose #c, be prepared to - explain how this single task does not become a major bottleneck on - big multiprocessor machines (for example, if the task is updating - information relating to itself that other tasks can read, there - by definition can be no bottleneck). Note that the definition - of "large" has changed significantly: Eight CPUs was "large" - in the year 2000, but a hundred CPUs was unremarkable in 2017. - -2. Do the RCU read-side critical sections make proper use of - rcu_read_lock() and friends? These primitives are needed - to prevent grace periods from ending prematurely, which - could result in data being unceremoniously freed out from - under your read-side code, which can greatly increase the - actuarial risk of your kernel. - - As a rough rule of thumb, any dereference of an RCU-protected - pointer must be covered by rcu_read_lock(), rcu_read_lock_bh(), - rcu_read_lock_sched(), or by the appropriate update-side lock. - Disabling of preemption can serve as rcu_read_lock_sched(), but - is less readable and prevents lockdep from detecting locking issues. - - Letting RCU-protected pointers "leak" out of an RCU read-side - critical section is every bid as bad as letting them leak out - from under a lock. Unless, of course, you have arranged some - other means of protection, such as a lock or a reference count - -before- letting them out of the RCU read-side critical section. - -3. Does the update code tolerate concurrent accesses? - - The whole point of RCU is to permit readers to run without - any locks or atomic operations. This means that readers will - be running while updates are in progress. There are a number - of ways to handle this concurrency, depending on the situation: - - a. Use the RCU variants of the list and hlist update - primitives to add, remove, and replace elements on - an RCU-protected list. Alternatively, use the other - RCU-protected data structures that have been added to - the Linux kernel. - - This is almost always the best approach. - - b. Proceed as in (a) above, but also maintain per-element - locks (that are acquired by both readers and writers) - that guard per-element state. Of course, fields that - the readers refrain from accessing can be guarded by - some other lock acquired only by updaters, if desired. - - This works quite well, also. - - c. Make updates appear atomic to readers. For example, - pointer updates to properly aligned fields will - appear atomic, as will individual atomic primitives. - Sequences of operations performed under a lock will -not- - appear to be atomic to RCU readers, nor will sequences - of multiple atomic primitives. - - This can work, but is starting to get a bit tricky. - - d. Carefully order the updates and the reads so that - readers see valid data at all phases of the update. - This is often more difficult than it sounds, especially - given modern CPUs' tendency to reorder memory references. - One must usually liberally sprinkle memory barriers - (smp_wmb(), smp_rmb(), smp_mb()) through the code, - making it difficult to understand and to test. - - It is usually better to group the changing data into - a separate structure, so that the change may be made - to appear atomic by updating a pointer to reference - a new structure containing updated values. - -4. Weakly ordered CPUs pose special challenges. Almost all CPUs - are weakly ordered -- even x86 CPUs allow later loads to be - reordered to precede earlier stores. RCU code must take all of - the following measures to prevent memory-corruption problems: - - a. Readers must maintain proper ordering of their memory - accesses. The rcu_dereference() primitive ensures that - the CPU picks up the pointer before it picks up the data - that the pointer points to. This really is necessary - on Alpha CPUs. If you don't believe me, see: - - http://www.openvms.compaq.com/wizard/wiz_2637.html - - The rcu_dereference() primitive is also an excellent - documentation aid, letting the person reading the - code know exactly which pointers are protected by RCU. - Please note that compilers can also reorder code, and - they are becoming increasingly aggressive about doing - just that. The rcu_dereference() primitive therefore also - prevents destructive compiler optimizations. However, - with a bit of devious creativity, it is possible to - mishandle the return value from rcu_dereference(). - Please see rcu_dereference.txt in this directory for - more information. - - The rcu_dereference() primitive is used by the - various "_rcu()" list-traversal primitives, such - as the list_for_each_entry_rcu(). Note that it is - perfectly legal (if redundant) for update-side code to - use rcu_dereference() and the "_rcu()" list-traversal - primitives. This is particularly useful in code that - is common to readers and updaters. However, lockdep - will complain if you access rcu_dereference() outside - of an RCU read-side critical section. See lockdep.txt - to learn what to do about this. - - Of course, neither rcu_dereference() nor the "_rcu()" - list-traversal primitives can substitute for a good - concurrency design coordinating among multiple updaters. - - b. If the list macros are being used, the list_add_tail_rcu() - and list_add_rcu() primitives must be used in order - to prevent weakly ordered machines from misordering - structure initialization and pointer planting. - Similarly, if the hlist macros are being used, the - hlist_add_head_rcu() primitive is required. - - c. If the list macros are being used, the list_del_rcu() - primitive must be used to keep list_del()'s pointer - poisoning from inflicting toxic effects on concurrent - readers. Similarly, if the hlist macros are being used, - the hlist_del_rcu() primitive is required. - - The list_replace_rcu() and hlist_replace_rcu() primitives - may be used to replace an old structure with a new one - in their respective types of RCU-protected lists. - - d. Rules similar to (4b) and (4c) apply to the "hlist_nulls" - type of RCU-protected linked lists. - - e. Updates must ensure that initialization of a given - structure happens before pointers to that structure are - publicized. Use the rcu_assign_pointer() primitive - when publicizing a pointer to a structure that can - be traversed by an RCU read-side critical section. - -5. If call_rcu() or call_srcu() is used, the callback function will - be called from softirq context. In particular, it cannot block. - -6. Since synchronize_rcu() can block, it cannot be called - from any sort of irq context. The same rule applies - for synchronize_srcu(), synchronize_rcu_expedited(), and - synchronize_srcu_expedited(). - - The expedited forms of these primitives have the same semantics - as the non-expedited forms, but expediting is both expensive and - (with the exception of synchronize_srcu_expedited()) unfriendly - to real-time workloads. Use of the expedited primitives should - be restricted to rare configuration-change operations that would - not normally be undertaken while a real-time workload is running. - However, real-time workloads can use rcupdate.rcu_normal kernel - boot parameter to completely disable expedited grace periods, - though this might have performance implications. - - In particular, if you find yourself invoking one of the expedited - primitives repeatedly in a loop, please do everyone a favor: - Restructure your code so that it batches the updates, allowing - a single non-expedited primitive to cover the entire batch. - This will very likely be faster than the loop containing the - expedited primitive, and will be much much easier on the rest - of the system, especially to real-time workloads running on - the rest of the system. - -7. As of v4.20, a given kernel implements only one RCU flavor, - which is RCU-sched for PREEMPT=n and RCU-preempt for PREEMPT=y. - If the updater uses call_rcu() or synchronize_rcu(), - then the corresponding readers my use rcu_read_lock() and - rcu_read_unlock(), rcu_read_lock_bh() and rcu_read_unlock_bh(), - or any pair of primitives that disables and re-enables preemption, - for example, rcu_read_lock_sched() and rcu_read_unlock_sched(). - If the updater uses synchronize_srcu() or call_srcu(), - then the corresponding readers must use srcu_read_lock() and - srcu_read_unlock(), and with the same srcu_struct. The rules for - the expedited primitives are the same as for their non-expedited - counterparts. Mixing things up will result in confusion and - broken kernels, and has even resulted in an exploitable security - issue. - - One exception to this rule: rcu_read_lock() and rcu_read_unlock() - may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh() - in cases where local bottom halves are already known to be - disabled, for example, in irq or softirq context. Commenting - such cases is a must, of course! And the jury is still out on - whether the increased speed is worth it. - -8. Although synchronize_rcu() is slower than is call_rcu(), it - usually results in simpler code. So, unless update performance is - critically important, the updaters cannot block, or the latency of - synchronize_rcu() is visible from userspace, synchronize_rcu() - should be used in preference to call_rcu(). Furthermore, - kfree_rcu() usually results in even simpler code than does - synchronize_rcu() without synchronize_rcu()'s multi-millisecond - latency. So please take advantage of kfree_rcu()'s "fire and - forget" memory-freeing capabilities where it applies. - - An especially important property of the synchronize_rcu() - primitive is that it automatically self-limits: if grace periods - are delayed for whatever reason, then the synchronize_rcu() - primitive will correspondingly delay updates. In contrast, - code using call_rcu() should explicitly limit update rate in - cases where grace periods are delayed, as failing to do so can - result in excessive realtime latencies or even OOM conditions. - - Ways of gaining this self-limiting property when using call_rcu() - include: - - a. Keeping a count of the number of data-structure elements - used by the RCU-protected data structure, including - those waiting for a grace period to elapse. Enforce a - limit on this number, stalling updates as needed to allow - previously deferred frees to complete. Alternatively, - limit only the number awaiting deferred free rather than - the total number of elements. - - One way to stall the updates is to acquire the update-side - mutex. (Don't try this with a spinlock -- other CPUs - spinning on the lock could prevent the grace period - from ever ending.) Another way to stall the updates - is for the updates to use a wrapper function around - the memory allocator, so that this wrapper function - simulates OOM when there is too much memory awaiting an - RCU grace period. There are of course many other - variations on this theme. - - b. Limiting update rate. For example, if updates occur only - once per hour, then no explicit rate limiting is - required, unless your system is already badly broken. - Older versions of the dcache subsystem take this approach, - guarding updates with a global lock, limiting their rate. - - c. Trusted update -- if updates can only be done manually by - superuser or some other trusted user, then it might not - be necessary to automatically limit them. The theory - here is that superuser already has lots of ways to crash - the machine. - - d. Periodically invoke synchronize_rcu(), permitting a limited - number of updates per grace period. - - The same cautions apply to call_srcu() and kfree_rcu(). - - Note that although these primitives do take action to avoid memory - exhaustion when any given CPU has too many callbacks, a determined - user could still exhaust memory. This is especially the case - if a system with a large number of CPUs has been configured to - offload all of its RCU callbacks onto a single CPU, or if the - system has relatively little free memory. - -9. All RCU list-traversal primitives, which include - rcu_dereference(), list_for_each_entry_rcu(), and - list_for_each_safe_rcu(), must be either within an RCU read-side - critical section or must be protected by appropriate update-side - locks. RCU read-side critical sections are delimited by - rcu_read_lock() and rcu_read_unlock(), or by similar primitives - such as rcu_read_lock_bh() and rcu_read_unlock_bh(), in which - case the matching rcu_dereference() primitive must be used in - order to keep lockdep happy, in this case, rcu_dereference_bh(). - - The reason that it is permissible to use RCU list-traversal - primitives when the update-side lock is held is that doing so - can be quite helpful in reducing code bloat when common code is - shared between readers and updaters. Additional primitives - are provided for this case, as discussed in lockdep.txt. - -10. Conversely, if you are in an RCU read-side critical section, - and you don't hold the appropriate update-side lock, you -must- - use the "_rcu()" variants of the list macros. Failing to do so - will break Alpha, cause aggressive compilers to generate bad code, - and confuse people trying to read your code. - -11. Any lock acquired by an RCU callback must be acquired elsewhere - with softirq disabled, e.g., via spin_lock_irqsave(), - spin_lock_bh(), etc. Failing to disable softirq on a given - acquisition of that lock will result in deadlock as soon as - the RCU softirq handler happens to run your RCU callback while - interrupting that acquisition's critical section. - -12. RCU callbacks can be and are executed in parallel. In many cases, - the callback code simply wrappers around kfree(), so that this - is not an issue (or, more accurately, to the extent that it is - an issue, the memory-allocator locking handles it). However, - if the callbacks do manipulate a shared data structure, they - must use whatever locking or other synchronization is required - to safely access and/or modify that data structure. - - Do not assume that RCU callbacks will be executed on the same - CPU that executed the corresponding call_rcu() or call_srcu(). - For example, if a given CPU goes offline while having an RCU - callback pending, then that RCU callback will execute on some - surviving CPU. (If this was not the case, a self-spawning RCU - callback would prevent the victim CPU from ever going offline.) - Furthermore, CPUs designated by rcu_nocbs= might well -always- - have their RCU callbacks executed on some other CPUs, in fact, - for some real-time workloads, this is the whole point of using - the rcu_nocbs= kernel boot parameter. - -13. Unlike other forms of RCU, it -is- permissible to block in an - SRCU read-side critical section (demarked by srcu_read_lock() - and srcu_read_unlock()), hence the "SRCU": "sleepable RCU". - Please note that if you don't need to sleep in read-side critical - sections, you should be using RCU rather than SRCU, because RCU - is almost always faster and easier to use than is SRCU. - - Also unlike other forms of RCU, explicit initialization and - cleanup is required either at build time via DEFINE_SRCU() - or DEFINE_STATIC_SRCU() or at runtime via init_srcu_struct() - and cleanup_srcu_struct(). These last two are passed a - "struct srcu_struct" that defines the scope of a given - SRCU domain. Once initialized, the srcu_struct is passed - to srcu_read_lock(), srcu_read_unlock() synchronize_srcu(), - synchronize_srcu_expedited(), and call_srcu(). A given - synchronize_srcu() waits only for SRCU read-side critical - sections governed by srcu_read_lock() and srcu_read_unlock() - calls that have been passed the same srcu_struct. This property - is what makes sleeping read-side critical sections tolerable -- - a given subsystem delays only its own updates, not those of other - subsystems using SRCU. Therefore, SRCU is less prone to OOM the - system than RCU would be if RCU's read-side critical sections - were permitted to sleep. - - The ability to sleep in read-side critical sections does not - come for free. First, corresponding srcu_read_lock() and - srcu_read_unlock() calls must be passed the same srcu_struct. - Second, grace-period-detection overhead is amortized only - over those updates sharing a given srcu_struct, rather than - being globally amortized as they are for other forms of RCU. - Therefore, SRCU should be used in preference to rw_semaphore - only in extremely read-intensive situations, or in situations - requiring SRCU's read-side deadlock immunity or low read-side - realtime latency. You should also consider percpu_rw_semaphore - when you need lightweight readers. - - SRCU's expedited primitive (synchronize_srcu_expedited()) - never sends IPIs to other CPUs, so it is easier on - real-time workloads than is synchronize_rcu_expedited(). - - Note that rcu_assign_pointer() relates to SRCU just as it does to - other forms of RCU, but instead of rcu_dereference() you should - use srcu_dereference() in order to avoid lockdep splats. - -14. The whole point of call_rcu(), synchronize_rcu(), and friends - is to wait until all pre-existing readers have finished before - carrying out some otherwise-destructive operation. It is - therefore critically important to -first- remove any path - that readers can follow that could be affected by the - destructive operation, and -only- -then- invoke call_rcu(), - synchronize_rcu(), or friends. - - Because these primitives only wait for pre-existing readers, it - is the caller's responsibility to guarantee that any subsequent - readers will execute safely. - -15. The various RCU read-side primitives do -not- necessarily contain - memory barriers. You should therefore plan for the CPU - and the compiler to freely reorder code into and out of RCU - read-side critical sections. It is the responsibility of the - RCU update-side primitives to deal with this. - - For SRCU readers, you can use smp_mb__after_srcu_read_unlock() - immediately after an srcu_read_unlock() to get a full barrier. - -16. Use CONFIG_PROVE_LOCKING, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the - __rcu sparse checks to validate your RCU code. These can help - find problems as follows: - - CONFIG_PROVE_LOCKING: check that accesses to RCU-protected data - structures are carried out under the proper RCU - read-side critical section, while holding the right - combination of locks, or whatever other conditions - are appropriate. - - CONFIG_DEBUG_OBJECTS_RCU_HEAD: check that you don't pass the - same object to call_rcu() (or friends) before an RCU - grace period has elapsed since the last time that you - passed that same object to call_rcu() (or friends). - - __rcu sparse checks: tag the pointer to the RCU-protected data - structure with __rcu, and sparse will warn you if you - access that pointer without the services of one of the - variants of rcu_dereference(). - - These debugging aids can help you find problems that are - otherwise extremely difficult to spot. - -17. If you register a callback using call_rcu() or call_srcu(), and - pass in a function defined within a loadable module, then it in - necessary to wait for all pending callbacks to be invoked after - the last invocation and before unloading that module. Note that - it is absolutely -not- sufficient to wait for a grace period! - The current (say) synchronize_rcu() implementation is -not- - guaranteed to wait for callbacks registered on other CPUs. - Or even on the current CPU if that CPU recently went offline - and came back online. - - You instead need to use one of the barrier functions: - - o call_rcu() -> rcu_barrier() - o call_srcu() -> srcu_barrier() - - However, these barrier functions are absolutely -not- guaranteed - to wait for a grace period. In fact, if there are no call_rcu() - callbacks waiting anywhere in the system, rcu_barrier() is within - its rights to return immediately. - - So if you need to wait for both an RCU grace period and for - all pre-existing call_rcu() callbacks, you will need to execute - both rcu_barrier() and synchronize_rcu(), if necessary, using - something like workqueues to to execute them concurrently. - - See rcubarrier.txt for more information. diff --git a/Documentation/RCU/index.rst b/Documentation/RCU/index.rst index 81a0a1e5f767..c1ba4d130bb0 100644 --- a/Documentation/RCU/index.rst +++ b/Documentation/RCU/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + .. _rcu_concepts: ============ @@ -8,6 +10,7 @@ RCU concepts :maxdepth: 3 arrayRCU + checklist rcubarrier rcu_dereference whatisRCU -- cgit v1.2.3 From a3b0a79f8903f955250505f99d1e37b6c7d7b060 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 21 Apr 2020 19:04:03 +0200 Subject: docs: RCU: Convert lockdep-splat.txt to ReST - Add a SPDX header; - Add a document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to RCU/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Paul E. McKenney --- Documentation/RCU/index.rst | 1 + Documentation/RCU/lockdep-splat.rst | 115 ++++++++++++++++++++++++++++++++++++ Documentation/RCU/lockdep-splat.txt | 110 ---------------------------------- 3 files changed, 116 insertions(+), 110 deletions(-) create mode 100644 Documentation/RCU/lockdep-splat.rst delete mode 100644 Documentation/RCU/lockdep-splat.txt diff --git a/Documentation/RCU/index.rst b/Documentation/RCU/index.rst index c1ba4d130bb0..430a37132b2c 100644 --- a/Documentation/RCU/index.rst +++ b/Documentation/RCU/index.rst @@ -11,6 +11,7 @@ RCU concepts arrayRCU checklist + lockdep-splat rcubarrier rcu_dereference whatisRCU diff --git a/Documentation/RCU/lockdep-splat.rst b/Documentation/RCU/lockdep-splat.rst new file mode 100644 index 000000000000..2a5c79db57dc --- /dev/null +++ b/Documentation/RCU/lockdep-splat.rst @@ -0,0 +1,115 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +Lockdep-RCU Splat +================= + +Lockdep-RCU was added to the Linux kernel in early 2010 +(http://lwn.net/Articles/371986/). This facility checks for some common +misuses of the RCU API, most notably using one of the rcu_dereference() +family to access an RCU-protected pointer without the proper protection. +When such misuse is detected, an lockdep-RCU splat is emitted. + +The usual cause of a lockdep-RCU slat is someone accessing an +RCU-protected data structure without either (1) being in the right kind of +RCU read-side critical section or (2) holding the right update-side lock. +This problem can therefore be serious: it might result in random memory +overwriting or worse. There can of course be false positives, this +being the real world and all that. + +So let's look at an example RCU lockdep splat from 3.0-rc5, one that +has long since been fixed:: + + ============================= + WARNING: suspicious RCU usage + ----------------------------- + block/cfq-iosched.c:2776 suspicious rcu_dereference_protected() usage! + +other info that might help us debug this:: + + rcu_scheduler_active = 1, debug_locks = 0 + 3 locks held by scsi_scan_6/1552: + #0: (&shost->scan_mutex){+.+.}, at: [] + scsi_scan_host_selected+0x5a/0x150 + #1: (&eq->sysfs_lock){+.+.}, at: [] + elevator_exit+0x22/0x60 + #2: (&(&q->__queue_lock)->rlock){-.-.}, at: [] + cfq_exit_queue+0x43/0x190 + + stack backtrace: + Pid: 1552, comm: scsi_scan_6 Not tainted 3.0.0-rc5 #17 + Call Trace: + [] lockdep_rcu_dereference+0xbb/0xc0 + [] __cfq_exit_single_io_context+0xe9/0x120 + [] cfq_exit_queue+0x7c/0x190 + [] elevator_exit+0x36/0x60 + [] blk_cleanup_queue+0x4a/0x60 + [] scsi_free_queue+0x9/0x10 + [] __scsi_remove_device+0x84/0xd0 + [] scsi_probe_and_add_lun+0x353/0xb10 + [] ? error_exit+0x29/0xb0 + [] ? _raw_spin_unlock_irqrestore+0x3d/0x80 + [] __scsi_scan_target+0x112/0x680 + [] ? trace_hardirqs_off_thunk+0x3a/0x3c + [] ? error_exit+0x29/0xb0 + [] ? kobject_del+0x40/0x40 + [] scsi_scan_channel+0x86/0xb0 + [] scsi_scan_host_selected+0x140/0x150 + [] do_scsi_scan_host+0x89/0x90 + [] do_scan_async+0x20/0x160 + [] ? do_scsi_scan_host+0x90/0x90 + [] kthread+0xa6/0xb0 + [] kernel_thread_helper+0x4/0x10 + [] ? finish_task_switch+0x80/0x110 + [] ? retint_restore_args+0xe/0xe + [] ? __kthread_init_worker+0x70/0x70 + [] ? gs_change+0xb/0xb + +Line 2776 of block/cfq-iosched.c in v3.0-rc5 is as follows:: + + if (rcu_dereference(ioc->ioc_data) == cic) { + +This form says that it must be in a plain vanilla RCU read-side critical +section, but the "other info" list above shows that this is not the +case. Instead, we hold three locks, one of which might be RCU related. +And maybe that lock really does protect this reference. If so, the fix +is to inform RCU, perhaps by changing __cfq_exit_single_io_context() to +take the struct request_queue "q" from cfq_exit_queue() as an argument, +which would permit us to invoke rcu_dereference_protected as follows:: + + if (rcu_dereference_protected(ioc->ioc_data, + lockdep_is_held(&q->queue_lock)) == cic) { + +With this change, there would be no lockdep-RCU splat emitted if this +code was invoked either from within an RCU read-side critical section +or with the ->queue_lock held. In particular, this would have suppressed +the above lockdep-RCU splat because ->queue_lock is held (see #2 in the +list above). + +On the other hand, perhaps we really do need an RCU read-side critical +section. In this case, the critical section must span the use of the +return value from rcu_dereference(), or at least until there is some +reference count incremented or some such. One way to handle this is to +add rcu_read_lock() and rcu_read_unlock() as follows:: + + rcu_read_lock(); + if (rcu_dereference(ioc->ioc_data) == cic) { + spin_lock(&ioc->lock); + rcu_assign_pointer(ioc->ioc_data, NULL); + spin_unlock(&ioc->lock); + } + rcu_read_unlock(); + +With this change, the rcu_dereference() is always within an RCU +read-side critical section, which again would have suppressed the +above lockdep-RCU splat. + +But in this particular case, we don't actually dereference the pointer +returned from rcu_dereference(). Instead, that pointer is just compared +to the cic pointer, which means that the rcu_dereference() can be replaced +by rcu_access_pointer() as follows:: + + if (rcu_access_pointer(ioc->ioc_data) == cic) { + +Because it is legal to invoke rcu_access_pointer() without protection, +this change would also suppress the above lockdep-RCU splat. diff --git a/Documentation/RCU/lockdep-splat.txt b/Documentation/RCU/lockdep-splat.txt deleted file mode 100644 index b8096316fd11..000000000000 --- a/Documentation/RCU/lockdep-splat.txt +++ /dev/null @@ -1,110 +0,0 @@ -Lockdep-RCU was added to the Linux kernel in early 2010 -(http://lwn.net/Articles/371986/). This facility checks for some common -misuses of the RCU API, most notably using one of the rcu_dereference() -family to access an RCU-protected pointer without the proper protection. -When such misuse is detected, an lockdep-RCU splat is emitted. - -The usual cause of a lockdep-RCU slat is someone accessing an -RCU-protected data structure without either (1) being in the right kind of -RCU read-side critical section or (2) holding the right update-side lock. -This problem can therefore be serious: it might result in random memory -overwriting or worse. There can of course be false positives, this -being the real world and all that. - -So let's look at an example RCU lockdep splat from 3.0-rc5, one that -has long since been fixed: - -============================= -WARNING: suspicious RCU usage ------------------------------ -block/cfq-iosched.c:2776 suspicious rcu_dereference_protected() usage! - -other info that might help us debug this: - - -rcu_scheduler_active = 1, debug_locks = 0 -3 locks held by scsi_scan_6/1552: - #0: (&shost->scan_mutex){+.+.}, at: [] -scsi_scan_host_selected+0x5a/0x150 - #1: (&eq->sysfs_lock){+.+.}, at: [] -elevator_exit+0x22/0x60 - #2: (&(&q->__queue_lock)->rlock){-.-.}, at: [] -cfq_exit_queue+0x43/0x190 - -stack backtrace: -Pid: 1552, comm: scsi_scan_6 Not tainted 3.0.0-rc5 #17 -Call Trace: - [] lockdep_rcu_dereference+0xbb/0xc0 - [] __cfq_exit_single_io_context+0xe9/0x120 - [] cfq_exit_queue+0x7c/0x190 - [] elevator_exit+0x36/0x60 - [] blk_cleanup_queue+0x4a/0x60 - [] scsi_free_queue+0x9/0x10 - [] __scsi_remove_device+0x84/0xd0 - [] scsi_probe_and_add_lun+0x353/0xb10 - [] ? error_exit+0x29/0xb0 - [] ? _raw_spin_unlock_irqrestore+0x3d/0x80 - [] __scsi_scan_target+0x112/0x680 - [] ? trace_hardirqs_off_thunk+0x3a/0x3c - [] ? error_exit+0x29/0xb0 - [] ? kobject_del+0x40/0x40 - [] scsi_scan_channel+0x86/0xb0 - [] scsi_scan_host_selected+0x140/0x150 - [] do_scsi_scan_host+0x89/0x90 - [] do_scan_async+0x20/0x160 - [] ? do_scsi_scan_host+0x90/0x90 - [] kthread+0xa6/0xb0 - [] kernel_thread_helper+0x4/0x10 - [] ? finish_task_switch+0x80/0x110 - [] ? retint_restore_args+0xe/0xe - [] ? __kthread_init_worker+0x70/0x70 - [] ? gs_change+0xb/0xb - -Line 2776 of block/cfq-iosched.c in v3.0-rc5 is as follows: - - if (rcu_dereference(ioc->ioc_data) == cic) { - -This form says that it must be in a plain vanilla RCU read-side critical -section, but the "other info" list above shows that this is not the -case. Instead, we hold three locks, one of which might be RCU related. -And maybe that lock really does protect this reference. If so, the fix -is to inform RCU, perhaps by changing __cfq_exit_single_io_context() to -take the struct request_queue "q" from cfq_exit_queue() as an argument, -which would permit us to invoke rcu_dereference_protected as follows: - - if (rcu_dereference_protected(ioc->ioc_data, - lockdep_is_held(&q->queue_lock)) == cic) { - -With this change, there would be no lockdep-RCU splat emitted if this -code was invoked either from within an RCU read-side critical section -or with the ->queue_lock held. In particular, this would have suppressed -the above lockdep-RCU splat because ->queue_lock is held (see #2 in the -list above). - -On the other hand, perhaps we really do need an RCU read-side critical -section. In this case, the critical section must span the use of the -return value from rcu_dereference(), or at least until there is some -reference count incremented or some such. One way to handle this is to -add rcu_read_lock() and rcu_read_unlock() as follows: - - rcu_read_lock(); - if (rcu_dereference(ioc->ioc_data) == cic) { - spin_lock(&ioc->lock); - rcu_assign_pointer(ioc->ioc_data, NULL); - spin_unlock(&ioc->lock); - } - rcu_read_unlock(); - -With this change, the rcu_dereference() is always within an RCU -read-side critical section, which again would have suppressed the -above lockdep-RCU splat. - -But in this particular case, we don't actually dereference the pointer -returned from rcu_dereference(). Instead, that pointer is just compared -to the cic pointer, which means that the rcu_dereference() can be replaced -by rcu_access_pointer() as follows: - - if (rcu_access_pointer(ioc->ioc_data) == cic) { - -Because it is legal to invoke rcu_access_pointer() without protection, -this change would also suppress the above lockdep-RCU splat. -- cgit v1.2.3 From 058cc23bcad08aca62987cc795fe406ac39146d0 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 21 Apr 2020 19:04:04 +0200 Subject: docs: RCU: Convert lockdep.txt to ReST - Add a SPDX header; - Adjust document title; - Mark literal blocks as such; - Add it to RCU/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Paul E. McKenney --- Documentation/RCU/index.rst | 1 + Documentation/RCU/lockdep.rst | 116 ++++++++++++++++++++++++++++++++++++++++++ Documentation/RCU/lockdep.txt | 112 ---------------------------------------- 3 files changed, 117 insertions(+), 112 deletions(-) create mode 100644 Documentation/RCU/lockdep.rst delete mode 100644 Documentation/RCU/lockdep.txt diff --git a/Documentation/RCU/index.rst b/Documentation/RCU/index.rst index 430a37132b2c..fa7a2a8949b7 100644 --- a/Documentation/RCU/index.rst +++ b/Documentation/RCU/index.rst @@ -11,6 +11,7 @@ RCU concepts arrayRCU checklist + lockdep lockdep-splat rcubarrier rcu_dereference diff --git a/Documentation/RCU/lockdep.rst b/Documentation/RCU/lockdep.rst new file mode 100644 index 000000000000..f1fc8ae3846a --- /dev/null +++ b/Documentation/RCU/lockdep.rst @@ -0,0 +1,116 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================== +RCU and lockdep checking +======================== + +All flavors of RCU have lockdep checking available, so that lockdep is +aware of when each task enters and leaves any flavor of RCU read-side +critical section. Each flavor of RCU is tracked separately (but note +that this is not the case in 2.6.32 and earlier). This allows lockdep's +tracking to include RCU state, which can sometimes help when debugging +deadlocks and the like. + +In addition, RCU provides the following primitives that check lockdep's +state:: + + rcu_read_lock_held() for normal RCU. + rcu_read_lock_bh_held() for RCU-bh. + rcu_read_lock_sched_held() for RCU-sched. + srcu_read_lock_held() for SRCU. + +These functions are conservative, and will therefore return 1 if they +aren't certain (for example, if CONFIG_DEBUG_LOCK_ALLOC is not set). +This prevents things like WARN_ON(!rcu_read_lock_held()) from giving false +positives when lockdep is disabled. + +In addition, a separate kernel config parameter CONFIG_PROVE_RCU enables +checking of rcu_dereference() primitives: + + rcu_dereference(p): + Check for RCU read-side critical section. + rcu_dereference_bh(p): + Check for RCU-bh read-side critical section. + rcu_dereference_sched(p): + Check for RCU-sched read-side critical section. + srcu_dereference(p, sp): + Check for SRCU read-side critical section. + rcu_dereference_check(p, c): + Use explicit check expression "c" along with + rcu_read_lock_held(). This is useful in code that is + invoked by both RCU readers and updaters. + rcu_dereference_bh_check(p, c): + Use explicit check expression "c" along with + rcu_read_lock_bh_held(). This is useful in code that + is invoked by both RCU-bh readers and updaters. + rcu_dereference_sched_check(p, c): + Use explicit check expression "c" along with + rcu_read_lock_sched_held(). This is useful in code that + is invoked by both RCU-sched readers and updaters. + srcu_dereference_check(p, c): + Use explicit check expression "c" along with + srcu_read_lock_held()(). This is useful in code that + is invoked by both SRCU readers and updaters. + rcu_dereference_raw(p): + Don't check. (Use sparingly, if at all.) + rcu_dereference_protected(p, c): + Use explicit check expression "c", and omit all barriers + and compiler constraints. This is useful when the data + structure cannot change, for example, in code that is + invoked only by updaters. + rcu_access_pointer(p): + Return the value of the pointer and omit all barriers, + but retain the compiler constraints that prevent duplicating + or coalescsing. This is useful when when testing the + value of the pointer itself, for example, against NULL. + +The rcu_dereference_check() check expression can be any boolean +expression, but would normally include a lockdep expression. However, +any boolean expression can be used. For a moderately ornate example, +consider the following:: + + file = rcu_dereference_check(fdt->fd[fd], + lockdep_is_held(&files->file_lock) || + atomic_read(&files->count) == 1); + +This expression picks up the pointer "fdt->fd[fd]" in an RCU-safe manner, +and, if CONFIG_PROVE_RCU is configured, verifies that this expression +is used in: + +1. An RCU read-side critical section (implicit), or +2. with files->file_lock held, or +3. on an unshared files_struct. + +In case (1), the pointer is picked up in an RCU-safe manner for vanilla +RCU read-side critical sections, in case (2) the ->file_lock prevents +any change from taking place, and finally, in case (3) the current task +is the only task accessing the file_struct, again preventing any change +from taking place. If the above statement was invoked only from updater +code, it could instead be written as follows:: + + file = rcu_dereference_protected(fdt->fd[fd], + lockdep_is_held(&files->file_lock) || + atomic_read(&files->count) == 1); + +This would verify cases #2 and #3 above, and furthermore lockdep would +complain if this was used in an RCU read-side critical section unless one +of these two cases held. Because rcu_dereference_protected() omits all +barriers and compiler constraints, it generates better code than do the +other flavors of rcu_dereference(). On the other hand, it is illegal +to use rcu_dereference_protected() if either the RCU-protected pointer +or the RCU-protected data that it points to can change concurrently. + +Like rcu_dereference(), when lockdep is enabled, RCU list and hlist +traversal primitives check for being called from within an RCU read-side +critical section. However, a lockdep expression can be passed to them +as a additional optional argument. With this lockdep expression, these +traversal primitives will complain only if the lockdep expression is +false and they are called from outside any RCU read-side critical section. + +For example, the workqueue for_each_pwq() macro is intended to be used +either within an RCU read-side critical section or with wq->mutex held. +It is thus implemented as follows:: + + #define for_each_pwq(pwq, wq) + list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node, + lock_is_held(&(wq->mutex).dep_map)) diff --git a/Documentation/RCU/lockdep.txt b/Documentation/RCU/lockdep.txt deleted file mode 100644 index 89db949eeca0..000000000000 --- a/Documentation/RCU/lockdep.txt +++ /dev/null @@ -1,112 +0,0 @@ -RCU and lockdep checking - -All flavors of RCU have lockdep checking available, so that lockdep is -aware of when each task enters and leaves any flavor of RCU read-side -critical section. Each flavor of RCU is tracked separately (but note -that this is not the case in 2.6.32 and earlier). This allows lockdep's -tracking to include RCU state, which can sometimes help when debugging -deadlocks and the like. - -In addition, RCU provides the following primitives that check lockdep's -state: - - rcu_read_lock_held() for normal RCU. - rcu_read_lock_bh_held() for RCU-bh. - rcu_read_lock_sched_held() for RCU-sched. - srcu_read_lock_held() for SRCU. - -These functions are conservative, and will therefore return 1 if they -aren't certain (for example, if CONFIG_DEBUG_LOCK_ALLOC is not set). -This prevents things like WARN_ON(!rcu_read_lock_held()) from giving false -positives when lockdep is disabled. - -In addition, a separate kernel config parameter CONFIG_PROVE_RCU enables -checking of rcu_dereference() primitives: - - rcu_dereference(p): - Check for RCU read-side critical section. - rcu_dereference_bh(p): - Check for RCU-bh read-side critical section. - rcu_dereference_sched(p): - Check for RCU-sched read-side critical section. - srcu_dereference(p, sp): - Check for SRCU read-side critical section. - rcu_dereference_check(p, c): - Use explicit check expression "c" along with - rcu_read_lock_held(). This is useful in code that is - invoked by both RCU readers and updaters. - rcu_dereference_bh_check(p, c): - Use explicit check expression "c" along with - rcu_read_lock_bh_held(). This is useful in code that - is invoked by both RCU-bh readers and updaters. - rcu_dereference_sched_check(p, c): - Use explicit check expression "c" along with - rcu_read_lock_sched_held(). This is useful in code that - is invoked by both RCU-sched readers and updaters. - srcu_dereference_check(p, c): - Use explicit check expression "c" along with - srcu_read_lock_held()(). This is useful in code that - is invoked by both SRCU readers and updaters. - rcu_dereference_raw(p): - Don't check. (Use sparingly, if at all.) - rcu_dereference_protected(p, c): - Use explicit check expression "c", and omit all barriers - and compiler constraints. This is useful when the data - structure cannot change, for example, in code that is - invoked only by updaters. - rcu_access_pointer(p): - Return the value of the pointer and omit all barriers, - but retain the compiler constraints that prevent duplicating - or coalescsing. This is useful when when testing the - value of the pointer itself, for example, against NULL. - -The rcu_dereference_check() check expression can be any boolean -expression, but would normally include a lockdep expression. However, -any boolean expression can be used. For a moderately ornate example, -consider the following: - - file = rcu_dereference_check(fdt->fd[fd], - lockdep_is_held(&files->file_lock) || - atomic_read(&files->count) == 1); - -This expression picks up the pointer "fdt->fd[fd]" in an RCU-safe manner, -and, if CONFIG_PROVE_RCU is configured, verifies that this expression -is used in: - -1. An RCU read-side critical section (implicit), or -2. with files->file_lock held, or -3. on an unshared files_struct. - -In case (1), the pointer is picked up in an RCU-safe manner for vanilla -RCU read-side critical sections, in case (2) the ->file_lock prevents -any change from taking place, and finally, in case (3) the current task -is the only task accessing the file_struct, again preventing any change -from taking place. If the above statement was invoked only from updater -code, it could instead be written as follows: - - file = rcu_dereference_protected(fdt->fd[fd], - lockdep_is_held(&files->file_lock) || - atomic_read(&files->count) == 1); - -This would verify cases #2 and #3 above, and furthermore lockdep would -complain if this was used in an RCU read-side critical section unless one -of these two cases held. Because rcu_dereference_protected() omits all -barriers and compiler constraints, it generates better code than do the -other flavors of rcu_dereference(). On the other hand, it is illegal -to use rcu_dereference_protected() if either the RCU-protected pointer -or the RCU-protected data that it points to can change concurrently. - -Like rcu_dereference(), when lockdep is enabled, RCU list and hlist -traversal primitives check for being called from within an RCU read-side -critical section. However, a lockdep expression can be passed to them -as a additional optional argument. With this lockdep expression, these -traversal primitives will complain only if the lockdep expression is -false and they are called from outside any RCU read-side critical section. - -For example, the workqueue for_each_pwq() macro is intended to be used -either within an RCU read-side critical section or with wq->mutex held. -It is thus implemented as follows: - - #define for_each_pwq(pwq, wq) - list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node, - lock_is_held(&(wq->mutex).dep_map)) -- cgit v1.2.3 From 2cdb54c93a7e5beb6f3f8b63575d9fb664dfc603 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 21 Apr 2020 19:04:05 +0200 Subject: docs: RCU: Convert rculist_nulls.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to RCU/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Paul E. McKenney --- Documentation/RCU/index.rst | 1 + Documentation/RCU/rculist_nulls.rst | 194 ++++++++++++++++++++++++++++++++++++ Documentation/RCU/rculist_nulls.txt | 172 -------------------------------- include/linux/rculist_nulls.h | 2 +- net/core/sock.c | 4 +- 5 files changed, 198 insertions(+), 175 deletions(-) create mode 100644 Documentation/RCU/rculist_nulls.rst delete mode 100644 Documentation/RCU/rculist_nulls.txt diff --git a/Documentation/RCU/index.rst b/Documentation/RCU/index.rst index fa7a2a8949b7..577a47e27f5d 100644 --- a/Documentation/RCU/index.rst +++ b/Documentation/RCU/index.rst @@ -17,6 +17,7 @@ RCU concepts rcu_dereference whatisRCU rcu + rculist_nulls listRCU NMI-RCU UP diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst new file mode 100644 index 000000000000..d40374221d69 --- /dev/null +++ b/Documentation/RCU/rculist_nulls.rst @@ -0,0 +1,194 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================================= +Using RCU hlist_nulls to protect list and objects +================================================= + +This section describes how to use hlist_nulls to +protect read-mostly linked lists and +objects using SLAB_TYPESAFE_BY_RCU allocations. + +Please read the basics in Documentation/RCU/listRCU.rst + +Using special makers (called 'nulls') is a convenient way +to solve following problem : + +A typical RCU linked list managing objects which are +allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can +use following algos : + +1) Lookup algo +-------------- + +:: + + rcu_read_lock() + begin: + obj = lockless_lookup(key); + if (obj) { + if (!try_get_ref(obj)) // might fail for free objects + goto begin; + /* + * Because a writer could delete object, and a writer could + * reuse these object before the RCU grace period, we + * must check key after getting the reference on object + */ + if (obj->key != key) { // not the object we expected + put_ref(obj); + goto begin; + } + } + rcu_read_unlock(); + +Beware that lockless_lookup(key) cannot use traditional hlist_for_each_entry_rcu() +but a version with an additional memory barrier (smp_rmb()) + +:: + + lockless_lookup(key) + { + struct hlist_node *node, *next; + for (pos = rcu_dereference((head)->first); + pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && + ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); + pos = rcu_dereference(next)) + if (obj->key == key) + return obj; + return NULL; + } + +And note the traditional hlist_for_each_entry_rcu() misses this smp_rmb():: + + struct hlist_node *node; + for (pos = rcu_dereference((head)->first); + pos && ({ prefetch(pos->next); 1; }) && + ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); + pos = rcu_dereference(pos->next)) + if (obj->key == key) + return obj; + return NULL; + +Quoting Corey Minyard:: + + "If the object is moved from one list to another list in-between the + time the hash is calculated and the next field is accessed, and the + object has moved to the end of a new list, the traversal will not + complete properly on the list it should have, since the object will + be on the end of the new list and there's not a way to tell it's on a + new list and restart the list traversal. I think that this can be + solved by pre-fetching the "next" field (with proper barriers) before + checking the key." + +2) Insert algo +-------------- + +We need to make sure a reader cannot read the new 'obj->obj_next' value +and previous value of 'obj->key'. Or else, an item could be deleted +from a chain, and inserted into another chain. If new chain was empty +before the move, 'next' pointer is NULL, and lockless reader can +not detect it missed following items in original chain. + +:: + + /* + * Please note that new inserts are done at the head of list, + * not in the middle or end. + */ + obj = kmem_cache_alloc(...); + lock_chain(); // typically a spin_lock() + obj->key = key; + /* + * we need to make sure obj->key is updated before obj->next + * or obj->refcnt + */ + smp_wmb(); + atomic_set(&obj->refcnt, 1); + hlist_add_head_rcu(&obj->obj_node, list); + unlock_chain(); // typically a spin_unlock() + + +3) Remove algo +-------------- +Nothing special here, we can use a standard RCU hlist deletion. +But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused +very very fast (before the end of RCU grace period) + +:: + + if (put_last_reference_on(obj) { + lock_chain(); // typically a spin_lock() + hlist_del_init_rcu(&obj->obj_node); + unlock_chain(); // typically a spin_unlock() + kmem_cache_free(cachep, obj); + } + + + +-------------------------------------------------------------------------- + +With hlist_nulls we can avoid extra smp_rmb() in lockless_lookup() +and extra smp_wmb() in insert function. + +For example, if we choose to store the slot number as the 'nulls' +end-of-list marker for each slot of the hash table, we can detect +a race (some writer did a delete and/or a move of an object +to another chain) checking the final 'nulls' value if +the lookup met the end of chain. If final 'nulls' value +is not the slot number, then we must restart the lookup at +the beginning. If the object was moved to the same chain, +then the reader doesn't care : It might eventually +scan the list again without harm. + + +1) lookup algo +-------------- + +:: + + head = &table[slot]; + rcu_read_lock(); + begin: + hlist_nulls_for_each_entry_rcu(obj, node, head, member) { + if (obj->key == key) { + if (!try_get_ref(obj)) // might fail for free objects + goto begin; + if (obj->key != key) { // not the object we expected + put_ref(obj); + goto begin; + } + goto out; + } + /* + * if the nulls value we got at the end of this lookup is + * not the expected one, we must restart lookup. + * We probably met an item that was moved to another chain. + */ + if (get_nulls_value(node) != slot) + goto begin; + obj = NULL; + + out: + rcu_read_unlock(); + +2) Insert function +------------------ + +:: + + /* + * Please note that new inserts are done at the head of list, + * not in the middle or end. + */ + obj = kmem_cache_alloc(cachep); + lock_chain(); // typically a spin_lock() + obj->key = key; + /* + * changes to obj->key must be visible before refcnt one + */ + smp_wmb(); + atomic_set(&obj->refcnt, 1); + /* + * insert obj in RCU way (readers might be traversing chain) + */ + hlist_nulls_add_head_rcu(&obj->obj_node, list); + unlock_chain(); // typically a spin_unlock() diff --git a/Documentation/RCU/rculist_nulls.txt b/Documentation/RCU/rculist_nulls.txt deleted file mode 100644 index 23f115dc87cf..000000000000 --- a/Documentation/RCU/rculist_nulls.txt +++ /dev/null @@ -1,172 +0,0 @@ -Using hlist_nulls to protect read-mostly linked lists and -objects using SLAB_TYPESAFE_BY_RCU allocations. - -Please read the basics in Documentation/RCU/listRCU.rst - -Using special makers (called 'nulls') is a convenient way -to solve following problem : - -A typical RCU linked list managing objects which are -allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can -use following algos : - -1) Lookup algo --------------- -rcu_read_lock() -begin: -obj = lockless_lookup(key); -if (obj) { - if (!try_get_ref(obj)) // might fail for free objects - goto begin; - /* - * Because a writer could delete object, and a writer could - * reuse these object before the RCU grace period, we - * must check key after getting the reference on object - */ - if (obj->key != key) { // not the object we expected - put_ref(obj); - goto begin; - } -} -rcu_read_unlock(); - -Beware that lockless_lookup(key) cannot use traditional hlist_for_each_entry_rcu() -but a version with an additional memory barrier (smp_rmb()) - -lockless_lookup(key) -{ - struct hlist_node *node, *next; - for (pos = rcu_dereference((head)->first); - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && - ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); - pos = rcu_dereference(next)) - if (obj->key == key) - return obj; - return NULL; - -And note the traditional hlist_for_each_entry_rcu() misses this smp_rmb() : - - struct hlist_node *node; - for (pos = rcu_dereference((head)->first); - pos && ({ prefetch(pos->next); 1; }) && - ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); - pos = rcu_dereference(pos->next)) - if (obj->key == key) - return obj; - return NULL; -} - -Quoting Corey Minyard : - -"If the object is moved from one list to another list in-between the - time the hash is calculated and the next field is accessed, and the - object has moved to the end of a new list, the traversal will not - complete properly on the list it should have, since the object will - be on the end of the new list and there's not a way to tell it's on a - new list and restart the list traversal. I think that this can be - solved by pre-fetching the "next" field (with proper barriers) before - checking the key." - -2) Insert algo : ----------------- - -We need to make sure a reader cannot read the new 'obj->obj_next' value -and previous value of 'obj->key'. Or else, an item could be deleted -from a chain, and inserted into another chain. If new chain was empty -before the move, 'next' pointer is NULL, and lockless reader can -not detect it missed following items in original chain. - -/* - * Please note that new inserts are done at the head of list, - * not in the middle or end. - */ -obj = kmem_cache_alloc(...); -lock_chain(); // typically a spin_lock() -obj->key = key; -/* - * we need to make sure obj->key is updated before obj->next - * or obj->refcnt - */ -smp_wmb(); -atomic_set(&obj->refcnt, 1); -hlist_add_head_rcu(&obj->obj_node, list); -unlock_chain(); // typically a spin_unlock() - - -3) Remove algo --------------- -Nothing special here, we can use a standard RCU hlist deletion. -But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused -very very fast (before the end of RCU grace period) - -if (put_last_reference_on(obj) { - lock_chain(); // typically a spin_lock() - hlist_del_init_rcu(&obj->obj_node); - unlock_chain(); // typically a spin_unlock() - kmem_cache_free(cachep, obj); -} - - - --------------------------------------------------------------------------- -With hlist_nulls we can avoid extra smp_rmb() in lockless_lookup() -and extra smp_wmb() in insert function. - -For example, if we choose to store the slot number as the 'nulls' -end-of-list marker for each slot of the hash table, we can detect -a race (some writer did a delete and/or a move of an object -to another chain) checking the final 'nulls' value if -the lookup met the end of chain. If final 'nulls' value -is not the slot number, then we must restart the lookup at -the beginning. If the object was moved to the same chain, -then the reader doesn't care : It might eventually -scan the list again without harm. - - -1) lookup algo - - head = &table[slot]; - rcu_read_lock(); -begin: - hlist_nulls_for_each_entry_rcu(obj, node, head, member) { - if (obj->key == key) { - if (!try_get_ref(obj)) // might fail for free objects - goto begin; - if (obj->key != key) { // not the object we expected - put_ref(obj); - goto begin; - } - goto out; - } -/* - * if the nulls value we got at the end of this lookup is - * not the expected one, we must restart lookup. - * We probably met an item that was moved to another chain. - */ - if (get_nulls_value(node) != slot) - goto begin; - obj = NULL; - -out: - rcu_read_unlock(); - -2) Insert function : --------------------- - -/* - * Please note that new inserts are done at the head of list, - * not in the middle or end. - */ -obj = kmem_cache_alloc(cachep); -lock_chain(); // typically a spin_lock() -obj->key = key; -/* - * changes to obj->key must be visible before refcnt one - */ -smp_wmb(); -atomic_set(&obj->refcnt, 1); -/* - * insert obj in RCU way (readers might be traversing chain) - */ -hlist_nulls_add_head_rcu(&obj->obj_node, list); -unlock_chain(); // typically a spin_unlock() diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h index 9670b54b484a..ff3e94779e73 100644 --- a/include/linux/rculist_nulls.h +++ b/include/linux/rculist_nulls.h @@ -162,7 +162,7 @@ static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n) * The barrier() is needed to make sure compiler doesn't cache first element [1], * as this loop can be restarted [2] * [1] Documentation/core-api/atomic_ops.rst around line 114 - * [2] Documentation/RCU/rculist_nulls.txt around line 146 + * [2] Documentation/RCU/rculist_nulls.rst around line 146 */ #define hlist_nulls_for_each_entry_rcu(tpos, pos, head, member) \ for (({barrier();}), \ diff --git a/net/core/sock.c b/net/core/sock.c index d832c650287c..6921a85a1177 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1973,7 +1973,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority) /* * Before updating sk_refcnt, we must commit prior changes to memory - * (Documentation/RCU/rculist_nulls.txt for details) + * (Documentation/RCU/rculist_nulls.rst for details) */ smp_wmb(); refcount_set(&newsk->sk_refcnt, 2); @@ -3035,7 +3035,7 @@ void sock_init_data(struct socket *sock, struct sock *sk) sk_rx_queue_clear(sk); /* * Before updating sk_refcnt, we must commit prior changes to memory - * (Documentation/RCU/rculist_nulls.txt for details) + * (Documentation/RCU/rculist_nulls.rst for details) */ smp_wmb(); refcount_set(&sk->sk_refcnt, 1); -- cgit v1.2.3 From 43cb5451dffe0bc5d59688d4898c9a1f7c40d3b4 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 21 Apr 2020 19:04:06 +0200 Subject: docs: RCU: Convert torture.txt to ReST - Add a SPDX header; - Adjust document and section titles; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to RCU/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Paul E. McKenney --- Documentation/RCU/index.rst | 1 + Documentation/RCU/torture.rst | 293 ++++++++++++++++++++++++++++++++++ Documentation/RCU/torture.txt | 282 -------------------------------- Documentation/locking/locktorture.rst | 2 +- MAINTAINERS | 4 +- kernel/rcu/rcutorture.c | 2 +- 6 files changed, 298 insertions(+), 286 deletions(-) create mode 100644 Documentation/RCU/torture.rst delete mode 100644 Documentation/RCU/torture.txt diff --git a/Documentation/RCU/index.rst b/Documentation/RCU/index.rst index 577a47e27f5d..5d5f9a1ab8f9 100644 --- a/Documentation/RCU/index.rst +++ b/Documentation/RCU/index.rst @@ -18,6 +18,7 @@ RCU concepts whatisRCU rcu rculist_nulls + torture listRCU NMI-RCU UP diff --git a/Documentation/RCU/torture.rst b/Documentation/RCU/torture.rst new file mode 100644 index 000000000000..a90147713062 --- /dev/null +++ b/Documentation/RCU/torture.rst @@ -0,0 +1,293 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +RCU Torture Test Operation +========================== + + +CONFIG_RCU_TORTURE_TEST +======================= + +The CONFIG_RCU_TORTURE_TEST config option is available for all RCU +implementations. It creates an rcutorture kernel module that can +be loaded to run a torture test. The test periodically outputs +status messages via printk(), which can be examined via the dmesg +command (perhaps grepping for "torture"). The test is started +when the module is loaded, and stops when the module is unloaded. + +Module parameters are prefixed by "rcutorture." in +Documentation/admin-guide/kernel-parameters.txt. + +Output +====== + +The statistics output is as follows:: + + rcu-torture:--- Start of test: nreaders=16 nfakewriters=4 stat_interval=30 verbose=0 test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0 test_boost_interval=7 test_boost_duration=4 + rcu-torture: rtc: (null) ver: 155441 tfle: 0 rta: 155441 rtaf: 8884 rtf: 155440 rtmbe: 0 rtbe: 0 rtbke: 0 rtbre: 0 rtbf: 0 rtb: 0 nt: 3055767 + rcu-torture: Reader Pipe: 727860534 34213 0 0 0 0 0 0 0 0 0 + rcu-torture: Reader Batch: 727877838 17003 0 0 0 0 0 0 0 0 0 + rcu-torture: Free-Block Circulation: 155440 155440 155440 155440 155440 155440 155440 155440 155440 155440 0 + rcu-torture:--- End of test: SUCCESS: nreaders=16 nfakewriters=4 stat_interval=30 verbose=0 test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0 test_boost_interval=7 test_boost_duration=4 + +The command "dmesg | grep torture:" will extract this information on +most systems. On more esoteric configurations, it may be necessary to +use other commands to access the output of the printk()s used by +the RCU torture test. The printk()s use KERN_ALERT, so they should +be evident. ;-) + +The first and last lines show the rcutorture module parameters, and the +last line shows either "SUCCESS" or "FAILURE", based on rcutorture's +automatic determination as to whether RCU operated correctly. + +The entries are as follows: + +* "rtc": The hexadecimal address of the structure currently visible + to readers. + +* "ver": The number of times since boot that the RCU writer task + has changed the structure visible to readers. + +* "tfle": If non-zero, indicates that the "torture freelist" + containing structures to be placed into the "rtc" area is empty. + This condition is important, since it can fool you into thinking + that RCU is working when it is not. :-/ + +* "rta": Number of structures allocated from the torture freelist. + +* "rtaf": Number of allocations from the torture freelist that have + failed due to the list being empty. It is not unusual for this + to be non-zero, but it is bad for it to be a large fraction of + the value indicated by "rta". + +* "rtf": Number of frees into the torture freelist. + +* "rtmbe": A non-zero value indicates that rcutorture believes that + rcu_assign_pointer() and rcu_dereference() are not working + correctly. This value should be zero. + +* "rtbe": A non-zero value indicates that one of the rcu_barrier() + family of functions is not working correctly. + +* "rtbke": rcutorture was unable to create the real-time kthreads + used to force RCU priority inversion. This value should be zero. + +* "rtbre": Although rcutorture successfully created the kthreads + used to force RCU priority inversion, it was unable to set them + to the real-time priority level of 1. This value should be zero. + +* "rtbf": The number of times that RCU priority boosting failed + to resolve RCU priority inversion. + +* "rtb": The number of times that rcutorture attempted to force + an RCU priority inversion condition. If you are testing RCU + priority boosting via the "test_boost" module parameter, this + value should be non-zero. + +* "nt": The number of times rcutorture ran RCU read-side code from + within a timer handler. This value should be non-zero only + if you specified the "irqreader" module parameter. + +* "Reader Pipe": Histogram of "ages" of structures seen by readers. + If any entries past the first two are non-zero, RCU is broken. + And rcutorture prints the error flag string "!!!" to make sure + you notice. The age of a newly allocated structure is zero, + it becomes one when removed from reader visibility, and is + incremented once per grace period subsequently -- and is freed + after passing through (RCU_TORTURE_PIPE_LEN-2) grace periods. + + The output displayed above was taken from a correctly working + RCU. If you want to see what it looks like when broken, break + it yourself. ;-) + +* "Reader Batch": Another histogram of "ages" of structures seen + by readers, but in terms of counter flips (or batches) rather + than in terms of grace periods. The legal number of non-zero + entries is again two. The reason for this separate view is that + it is sometimes easier to get the third entry to show up in the + "Reader Batch" list than in the "Reader Pipe" list. + +* "Free-Block Circulation": Shows the number of torture structures + that have reached a given point in the pipeline. The first element + should closely correspond to the number of structures allocated, + the second to the number that have been removed from reader view, + and all but the last remaining to the corresponding number of + passes through a grace period. The last entry should be zero, + as it is only incremented if a torture structure's counter + somehow gets incremented farther than it should. + +Different implementations of RCU can provide implementation-specific +additional information. For example, Tree SRCU provides the following +additional line:: + + srcud-torture: Tree SRCU per-CPU(idx=0): 0(35,-21) 1(-4,24) 2(1,1) 3(-26,20) 4(28,-47) 5(-9,4) 6(-10,14) 7(-14,11) T(1,6) + +This line shows the per-CPU counter state, in this case for Tree SRCU +using a dynamically allocated srcu_struct (hence "srcud-" rather than +"srcu-"). The numbers in parentheses are the values of the "old" and +"current" counters for the corresponding CPU. The "idx" value maps the +"old" and "current" values to the underlying array, and is useful for +debugging. The final "T" entry contains the totals of the counters. + +Usage on Specific Kernel Builds +=============================== + +It is sometimes desirable to torture RCU on a specific kernel build, +for example, when preparing to put that kernel build into production. +In that case, the kernel should be built with CONFIG_RCU_TORTURE_TEST=m +so that the test can be started using modprobe and terminated using rmmod. + +For example, the following script may be used to torture RCU:: + + #!/bin/sh + + modprobe rcutorture + sleep 3600 + rmmod rcutorture + dmesg | grep torture: + +The output can be manually inspected for the error flag of "!!!". +One could of course create a more elaborate script that automatically +checked for such errors. The "rmmod" command forces a "SUCCESS", +"FAILURE", or "RCU_HOTPLUG" indication to be printk()ed. The first +two are self-explanatory, while the last indicates that while there +were no RCU failures, CPU-hotplug problems were detected. + + +Usage on Mainline Kernels +========================= + +When using rcutorture to test changes to RCU itself, it is often +necessary to build a number of kernels in order to test that change +across a broad range of combinations of the relevant Kconfig options +and of the relevant kernel boot parameters. In this situation, use +of modprobe and rmmod can be quite time-consuming and error-prone. + +Therefore, the tools/testing/selftests/rcutorture/bin/kvm.sh +script is available for mainline testing for x86, arm64, and +powerpc. By default, it will run the series of tests specified by +tools/testing/selftests/rcutorture/configs/rcu/CFLIST, with each test +running for 30 minutes within a guest OS using a minimal userspace +supplied by an automatically generated initrd. After the tests are +complete, the resulting build products and console output are analyzed +for errors and the results of the runs are summarized. + +On larger systems, rcutorture testing can be accelerated by passing the +--cpus argument to kvm.sh. For example, on a 64-CPU system, "--cpus 43" +would use up to 43 CPUs to run tests concurrently, which as of v5.4 would +complete all the scenarios in two batches, reducing the time to complete +from about eight hours to about one hour (not counting the time to build +the sixteen kernels). The "--dryrun sched" argument will not run tests, +but rather tell you how the tests would be scheduled into batches. This +can be useful when working out how many CPUs to specify in the --cpus +argument. + +Not all changes require that all scenarios be run. For example, a change +to Tree SRCU might run only the SRCU-N and SRCU-P scenarios using the +--configs argument to kvm.sh as follows: "--configs 'SRCU-N SRCU-P'". +Large systems can run multiple copies of of the full set of scenarios, +for example, a system with 448 hardware threads can run five instances +of the full set concurrently. To make this happen:: + + kvm.sh --cpus 448 --configs '5*CFLIST' + +Alternatively, such a system can run 56 concurrent instances of a single +eight-CPU scenario:: + + kvm.sh --cpus 448 --configs '56*TREE04' + +Or 28 concurrent instances of each of two eight-CPU scenarios:: + + kvm.sh --cpus 448 --configs '28*TREE03 28*TREE04' + +Of course, each concurrent instance will use memory, which can be +limited using the --memory argument, which defaults to 512M. Small +values for memory may require disabling the callback-flooding tests +using the --bootargs parameter discussed below. + +Sometimes additional debugging is useful, and in such cases the --kconfig +parameter to kvm.sh may be used, for example, ``--kconfig 'CONFIG_KASAN=y'``. + +Kernel boot arguments can also be supplied, for example, to control +rcutorture's module parameters. For example, to test a change to RCU's +CPU stall-warning code, use "--bootargs 'rcutorture.stall_cpu=30'". +This will of course result in the scripting reporting a failure, namely +the resuling RCU CPU stall warning. As noted above, reducing memory may +require disabling rcutorture's callback-flooding tests:: + + kvm.sh --cpus 448 --configs '56*TREE04' --memory 128M \ + --bootargs 'rcutorture.fwd_progress=0' + +Sometimes all that is needed is a full set of kernel builds. This is +what the --buildonly argument does. + +Finally, the --trust-make argument allows each kernel build to reuse what +it can from the previous kernel build. + +There are additional more arcane arguments that are documented in the +source code of the kvm.sh script. + +If a run contains failures, the number of buildtime and runtime failures +is listed at the end of the kvm.sh output, which you really should redirect +to a file. The build products and console output of each run is kept in +tools/testing/selftests/rcutorture/res in timestamped directories. A +given directory can be supplied to kvm-find-errors.sh in order to have +it cycle you through summaries of errors and full error logs. For example:: + + tools/testing/selftests/rcutorture/bin/kvm-find-errors.sh \ + tools/testing/selftests/rcutorture/res/2020.01.20-15.54.23 + +However, it is often more convenient to access the files directly. +Files pertaining to all scenarios in a run reside in the top-level +directory (2020.01.20-15.54.23 in the example above), while per-scenario +files reside in a subdirectory named after the scenario (for example, +"TREE04"). If a given scenario ran more than once (as in "--configs +'56*TREE04'" above), the directories corresponding to the second and +subsequent runs of that scenario include a sequence number, for example, +"TREE04.2", "TREE04.3", and so on. + +The most frequently used file in the top-level directory is testid.txt. +If the test ran in a git repository, then this file contains the commit +that was tested and any uncommitted changes in diff format. + +The most frequently used files in each per-scenario-run directory are: + +.config: + This file contains the Kconfig options. + +Make.out: + This contains build output for a specific scenario. + +console.log: + This contains the console output for a specific scenario. + This file may be examined once the kernel has booted, but + it might not exist if the build failed. + +vmlinux: + This contains the kernel, which can be useful with tools like + objdump and gdb. + +A number of additional files are available, but are less frequently used. +Many are intended for debugging of rcutorture itself or of its scripting. + +As of v5.4, a successful run with the default set of scenarios produces +the following summary at the end of the run on a 12-CPU system:: + + SRCU-N ------- 804233 GPs (148.932/s) [srcu: g10008272 f0x0 ] + SRCU-P ------- 202320 GPs (37.4667/s) [srcud: g1809476 f0x0 ] + SRCU-t ------- 1122086 GPs (207.794/s) [srcu: g0 f0x0 ] + SRCU-u ------- 1111285 GPs (205.794/s) [srcud: g1 f0x0 ] + TASKS01 ------- 19666 GPs (3.64185/s) [tasks: g0 f0x0 ] + TASKS02 ------- 20541 GPs (3.80389/s) [tasks: g0 f0x0 ] + TASKS03 ------- 19416 GPs (3.59556/s) [tasks: g0 f0x0 ] + TINY01 ------- 836134 GPs (154.84/s) [rcu: g0 f0x0 ] n_max_cbs: 34198 + TINY02 ------- 850371 GPs (157.476/s) [rcu: g0 f0x0 ] n_max_cbs: 2631 + TREE01 ------- 162625 GPs (30.1157/s) [rcu: g1124169 f0x0 ] + TREE02 ------- 333003 GPs (61.6672/s) [rcu: g2647753 f0x0 ] n_max_cbs: 35844 + TREE03 ------- 306623 GPs (56.782/s) [rcu: g2975325 f0x0 ] n_max_cbs: 1496497 + CPU count limited from 16 to 12 + TREE04 ------- 246149 GPs (45.5831/s) [rcu: g1695737 f0x0 ] n_max_cbs: 434961 + TREE05 ------- 314603 GPs (58.2598/s) [rcu: g2257741 f0x2 ] n_max_cbs: 193997 + TREE07 ------- 167347 GPs (30.9902/s) [rcu: g1079021 f0x0 ] n_max_cbs: 478732 + CPU count limited from 16 to 12 + TREE09 ------- 752238 GPs (139.303/s) [rcu: g13075057 f0x0 ] n_max_cbs: 99011 diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt deleted file mode 100644 index af712a3c5b6a..000000000000 --- a/Documentation/RCU/torture.txt +++ /dev/null @@ -1,282 +0,0 @@ -RCU Torture Test Operation - - -CONFIG_RCU_TORTURE_TEST - -The CONFIG_RCU_TORTURE_TEST config option is available for all RCU -implementations. It creates an rcutorture kernel module that can -be loaded to run a torture test. The test periodically outputs -status messages via printk(), which can be examined via the dmesg -command (perhaps grepping for "torture"). The test is started -when the module is loaded, and stops when the module is unloaded. - -Module parameters are prefixed by "rcutorture." in -Documentation/admin-guide/kernel-parameters.txt. - -OUTPUT - -The statistics output is as follows: - - rcu-torture:--- Start of test: nreaders=16 nfakewriters=4 stat_interval=30 verbose=0 test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0 test_boost_interval=7 test_boost_duration=4 - rcu-torture: rtc: (null) ver: 155441 tfle: 0 rta: 155441 rtaf: 8884 rtf: 155440 rtmbe: 0 rtbe: 0 rtbke: 0 rtbre: 0 rtbf: 0 rtb: 0 nt: 3055767 - rcu-torture: Reader Pipe: 727860534 34213 0 0 0 0 0 0 0 0 0 - rcu-torture: Reader Batch: 727877838 17003 0 0 0 0 0 0 0 0 0 - rcu-torture: Free-Block Circulation: 155440 155440 155440 155440 155440 155440 155440 155440 155440 155440 0 - rcu-torture:--- End of test: SUCCESS: nreaders=16 nfakewriters=4 stat_interval=30 verbose=0 test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0 test_boost_interval=7 test_boost_duration=4 - -The command "dmesg | grep torture:" will extract this information on -most systems. On more esoteric configurations, it may be necessary to -use other commands to access the output of the printk()s used by -the RCU torture test. The printk()s use KERN_ALERT, so they should -be evident. ;-) - -The first and last lines show the rcutorture module parameters, and the -last line shows either "SUCCESS" or "FAILURE", based on rcutorture's -automatic determination as to whether RCU operated correctly. - -The entries are as follows: - -o "rtc": The hexadecimal address of the structure currently visible - to readers. - -o "ver": The number of times since boot that the RCU writer task - has changed the structure visible to readers. - -o "tfle": If non-zero, indicates that the "torture freelist" - containing structures to be placed into the "rtc" area is empty. - This condition is important, since it can fool you into thinking - that RCU is working when it is not. :-/ - -o "rta": Number of structures allocated from the torture freelist. - -o "rtaf": Number of allocations from the torture freelist that have - failed due to the list being empty. It is not unusual for this - to be non-zero, but it is bad for it to be a large fraction of - the value indicated by "rta". - -o "rtf": Number of frees into the torture freelist. - -o "rtmbe": A non-zero value indicates that rcutorture believes that - rcu_assign_pointer() and rcu_dereference() are not working - correctly. This value should be zero. - -o "rtbe": A non-zero value indicates that one of the rcu_barrier() - family of functions is not working correctly. - -o "rtbke": rcutorture was unable to create the real-time kthreads - used to force RCU priority inversion. This value should be zero. - -o "rtbre": Although rcutorture successfully created the kthreads - used to force RCU priority inversion, it was unable to set them - to the real-time priority level of 1. This value should be zero. - -o "rtbf": The number of times that RCU priority boosting failed - to resolve RCU priority inversion. - -o "rtb": The number of times that rcutorture attempted to force - an RCU priority inversion condition. If you are testing RCU - priority boosting via the "test_boost" module parameter, this - value should be non-zero. - -o "nt": The number of times rcutorture ran RCU read-side code from - within a timer handler. This value should be non-zero only - if you specified the "irqreader" module parameter. - -o "Reader Pipe": Histogram of "ages" of structures seen by readers. - If any entries past the first two are non-zero, RCU is broken. - And rcutorture prints the error flag string "!!!" to make sure - you notice. The age of a newly allocated structure is zero, - it becomes one when removed from reader visibility, and is - incremented once per grace period subsequently -- and is freed - after passing through (RCU_TORTURE_PIPE_LEN-2) grace periods. - - The output displayed above was taken from a correctly working - RCU. If you want to see what it looks like when broken, break - it yourself. ;-) - -o "Reader Batch": Another histogram of "ages" of structures seen - by readers, but in terms of counter flips (or batches) rather - than in terms of grace periods. The legal number of non-zero - entries is again two. The reason for this separate view is that - it is sometimes easier to get the third entry to show up in the - "Reader Batch" list than in the "Reader Pipe" list. - -o "Free-Block Circulation": Shows the number of torture structures - that have reached a given point in the pipeline. The first element - should closely correspond to the number of structures allocated, - the second to the number that have been removed from reader view, - and all but the last remaining to the corresponding number of - passes through a grace period. The last entry should be zero, - as it is only incremented if a torture structure's counter - somehow gets incremented farther than it should. - -Different implementations of RCU can provide implementation-specific -additional information. For example, Tree SRCU provides the following -additional line: - - srcud-torture: Tree SRCU per-CPU(idx=0): 0(35,-21) 1(-4,24) 2(1,1) 3(-26,20) 4(28,-47) 5(-9,4) 6(-10,14) 7(-14,11) T(1,6) - -This line shows the per-CPU counter state, in this case for Tree SRCU -using a dynamically allocated srcu_struct (hence "srcud-" rather than -"srcu-"). The numbers in parentheses are the values of the "old" and -"current" counters for the corresponding CPU. The "idx" value maps the -"old" and "current" values to the underlying array, and is useful for -debugging. The final "T" entry contains the totals of the counters. - - -USAGE ON SPECIFIC KERNEL BUILDS - -It is sometimes desirable to torture RCU on a specific kernel build, -for example, when preparing to put that kernel build into production. -In that case, the kernel should be built with CONFIG_RCU_TORTURE_TEST=m -so that the test can be started using modprobe and terminated using rmmod. - -For example, the following script may be used to torture RCU: - - #!/bin/sh - - modprobe rcutorture - sleep 3600 - rmmod rcutorture - dmesg | grep torture: - -The output can be manually inspected for the error flag of "!!!". -One could of course create a more elaborate script that automatically -checked for such errors. The "rmmod" command forces a "SUCCESS", -"FAILURE", or "RCU_HOTPLUG" indication to be printk()ed. The first -two are self-explanatory, while the last indicates that while there -were no RCU failures, CPU-hotplug problems were detected. - - -USAGE ON MAINLINE KERNELS - -When using rcutorture to test changes to RCU itself, it is often -necessary to build a number of kernels in order to test that change -across a broad range of combinations of the relevant Kconfig options -and of the relevant kernel boot parameters. In this situation, use -of modprobe and rmmod can be quite time-consuming and error-prone. - -Therefore, the tools/testing/selftests/rcutorture/bin/kvm.sh -script is available for mainline testing for x86, arm64, and -powerpc. By default, it will run the series of tests specified by -tools/testing/selftests/rcutorture/configs/rcu/CFLIST, with each test -running for 30 minutes within a guest OS using a minimal userspace -supplied by an automatically generated initrd. After the tests are -complete, the resulting build products and console output are analyzed -for errors and the results of the runs are summarized. - -On larger systems, rcutorture testing can be accelerated by passing the ---cpus argument to kvm.sh. For example, on a 64-CPU system, "--cpus 43" -would use up to 43 CPUs to run tests concurrently, which as of v5.4 would -complete all the scenarios in two batches, reducing the time to complete -from about eight hours to about one hour (not counting the time to build -the sixteen kernels). The "--dryrun sched" argument will not run tests, -but rather tell you how the tests would be scheduled into batches. This -can be useful when working out how many CPUs to specify in the --cpus -argument. - -Not all changes require that all scenarios be run. For example, a change -to Tree SRCU might run only the SRCU-N and SRCU-P scenarios using the ---configs argument to kvm.sh as follows: "--configs 'SRCU-N SRCU-P'". -Large systems can run multiple copies of of the full set of scenarios, -for example, a system with 448 hardware threads can run five instances -of the full set concurrently. To make this happen: - - kvm.sh --cpus 448 --configs '5*CFLIST' - -Alternatively, such a system can run 56 concurrent instances of a single -eight-CPU scenario: - - kvm.sh --cpus 448 --configs '56*TREE04' - -Or 28 concurrent instances of each of two eight-CPU scenarios: - - kvm.sh --cpus 448 --configs '28*TREE03 28*TREE04' - -Of course, each concurrent instance will use memory, which can be -limited using the --memory argument, which defaults to 512M. Small -values for memory may require disabling the callback-flooding tests -using the --bootargs parameter discussed below. - -Sometimes additional debugging is useful, and in such cases the --kconfig -parameter to kvm.sh may be used, for example, "--kconfig 'CONFIG_KASAN=y'". - -Kernel boot arguments can also be supplied, for example, to control -rcutorture's module parameters. For example, to test a change to RCU's -CPU stall-warning code, use "--bootargs 'rcutorture.stall_cpu=30'". -This will of course result in the scripting reporting a failure, namely -the resuling RCU CPU stall warning. As noted above, reducing memory may -require disabling rcutorture's callback-flooding tests: - - kvm.sh --cpus 448 --configs '56*TREE04' --memory 128M \ - --bootargs 'rcutorture.fwd_progress=0' - -Sometimes all that is needed is a full set of kernel builds. This is -what the --buildonly argument does. - -Finally, the --trust-make argument allows each kernel build to reuse what -it can from the previous kernel build. - -There are additional more arcane arguments that are documented in the -source code of the kvm.sh script. - -If a run contains failures, the number of buildtime and runtime failures -is listed at the end of the kvm.sh output, which you really should redirect -to a file. The build products and console output of each run is kept in -tools/testing/selftests/rcutorture/res in timestamped directories. A -given directory can be supplied to kvm-find-errors.sh in order to have -it cycle you through summaries of errors and full error logs. For example: - - tools/testing/selftests/rcutorture/bin/kvm-find-errors.sh \ - tools/testing/selftests/rcutorture/res/2020.01.20-15.54.23 - -However, it is often more convenient to access the files directly. -Files pertaining to all scenarios in a run reside in the top-level -directory (2020.01.20-15.54.23 in the example above), while per-scenario -files reside in a subdirectory named after the scenario (for example, -"TREE04"). If a given scenario ran more than once (as in "--configs -'56*TREE04'" above), the directories corresponding to the second and -subsequent runs of that scenario include a sequence number, for example, -"TREE04.2", "TREE04.3", and so on. - -The most frequently used file in the top-level directory is testid.txt. -If the test ran in a git repository, then this file contains the commit -that was tested and any uncommitted changes in diff format. - -The most frequently used files in each per-scenario-run directory are: - -.config: This file contains the Kconfig options. - -Make.out: This contains build output for a specific scenario. - -console.log: This contains the console output for a specific scenario. - This file may be examined once the kernel has booted, but - it might not exist if the build failed. - -vmlinux: This contains the kernel, which can be useful with tools like - objdump and gdb. - -A number of additional files are available, but are less frequently used. -Many are intended for debugging of rcutorture itself or of its scripting. - -As of v5.4, a successful run with the default set of scenarios produces -the following summary at the end of the run on a 12-CPU system: - -SRCU-N ------- 804233 GPs (148.932/s) [srcu: g10008272 f0x0 ] -SRCU-P ------- 202320 GPs (37.4667/s) [srcud: g1809476 f0x0 ] -SRCU-t ------- 1122086 GPs (207.794/s) [srcu: g0 f0x0 ] -SRCU-u ------- 1111285 GPs (205.794/s) [srcud: g1 f0x0 ] -TASKS01 ------- 19666 GPs (3.64185/s) [tasks: g0 f0x0 ] -TASKS02 ------- 20541 GPs (3.80389/s) [tasks: g0 f0x0 ] -TASKS03 ------- 19416 GPs (3.59556/s) [tasks: g0 f0x0 ] -TINY01 ------- 836134 GPs (154.84/s) [rcu: g0 f0x0 ] n_max_cbs: 34198 -TINY02 ------- 850371 GPs (157.476/s) [rcu: g0 f0x0 ] n_max_cbs: 2631 -TREE01 ------- 162625 GPs (30.1157/s) [rcu: g1124169 f0x0 ] -TREE02 ------- 333003 GPs (61.6672/s) [rcu: g2647753 f0x0 ] n_max_cbs: 35844 -TREE03 ------- 306623 GPs (56.782/s) [rcu: g2975325 f0x0 ] n_max_cbs: 1496497 -CPU count limited from 16 to 12 -TREE04 ------- 246149 GPs (45.5831/s) [rcu: g1695737 f0x0 ] n_max_cbs: 434961 -TREE05 ------- 314603 GPs (58.2598/s) [rcu: g2257741 f0x2 ] n_max_cbs: 193997 -TREE07 ------- 167347 GPs (30.9902/s) [rcu: g1079021 f0x0 ] n_max_cbs: 478732 -CPU count limited from 16 to 12 -TREE09 ------- 752238 GPs (139.303/s) [rcu: g13075057 f0x0 ] n_max_cbs: 99011 diff --git a/Documentation/locking/locktorture.rst b/Documentation/locking/locktorture.rst index 8012a74555e7..dfaf9fc883f4 100644 --- a/Documentation/locking/locktorture.rst +++ b/Documentation/locking/locktorture.rst @@ -166,4 +166,4 @@ checked for such errors. The "rmmod" command forces a "SUCCESS", two are self-explanatory, while the last indicates that while there were no locking failures, CPU-hotplug problems were detected. -Also see: Documentation/RCU/torture.txt +Also see: Documentation/RCU/torture.rst diff --git a/MAINTAINERS b/MAINTAINERS index 496fd4eafb68..4429ce965b3a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14437,7 +14437,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git dev F: Documentation/RCU/ F: include/linux/rcu* F: kernel/rcu/ -X: Documentation/RCU/torture.txt +X: Documentation/RCU/torture.rst X: include/linux/srcu*.h X: kernel/rcu/srcu*.c @@ -17288,7 +17288,7 @@ M: Josh Triplett L: linux-kernel@vger.kernel.org S: Supported T: git git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git dev -F: Documentation/RCU/torture.txt +F: Documentation/RCU/torture.rst F: kernel/locking/locktorture.c F: kernel/rcu/rcuperf.c F: kernel/rcu/rcutorture.c diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index efb792e13fca..8205295fc33e 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -7,7 +7,7 @@ * Authors: Paul E. McKenney * Josh Triplett * - * See also: Documentation/RCU/torture.txt + * See also: Documentation/RCU/torture.rst */ #define pr_fmt(fmt) fmt -- cgit v1.2.3 From 90c73cb2c65f9e78eb09a8cbcd4bcd4add2d3f4d Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 21 Apr 2020 19:04:07 +0200 Subject: docs: RCU: Convert rcuref.txt to ReST - Add a SPDX header; - Adjust document title; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to RCU/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Paul E. McKenney --- Documentation/RCU/index.rst | 1 + Documentation/RCU/rcuref.rst | 158 +++++++++++++++++++++++++++++++++++++++++++ Documentation/RCU/rcuref.txt | 151 ----------------------------------------- 3 files changed, 159 insertions(+), 151 deletions(-) create mode 100644 Documentation/RCU/rcuref.rst delete mode 100644 Documentation/RCU/rcuref.txt diff --git a/Documentation/RCU/index.rst b/Documentation/RCU/index.rst index 5d5f9a1ab8f9..9a1d51f394dc 100644 --- a/Documentation/RCU/index.rst +++ b/Documentation/RCU/index.rst @@ -18,6 +18,7 @@ RCU concepts whatisRCU rcu rculist_nulls + rcuref torture listRCU NMI-RCU diff --git a/Documentation/RCU/rcuref.rst b/Documentation/RCU/rcuref.rst new file mode 100644 index 000000000000..b33aeb14fde3 --- /dev/null +++ b/Documentation/RCU/rcuref.rst @@ -0,0 +1,158 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================================================================== +Reference-count design for elements of lists/arrays protected by RCU +==================================================================== + + +Please note that the percpu-ref feature is likely your first +stop if you need to combine reference counts and RCU. Please see +include/linux/percpu-refcount.h for more information. However, in +those unusual cases where percpu-ref would consume too much memory, +please read on. + +------------------------------------------------------------------------ + +Reference counting on elements of lists which are protected by traditional +reader/writer spinlocks or semaphores are straightforward: + +CODE LISTING A:: + + 1. 2. + add() search_and_reference() + { { + alloc_object read_lock(&list_lock); + ... search_for_element + atomic_set(&el->rc, 1); atomic_inc(&el->rc); + write_lock(&list_lock); ... + add_element read_unlock(&list_lock); + ... ... + write_unlock(&list_lock); } + } + + 3. 4. + release_referenced() delete() + { { + ... write_lock(&list_lock); + if(atomic_dec_and_test(&el->rc)) ... + kfree(el); + ... remove_element + } write_unlock(&list_lock); + ... + if (atomic_dec_and_test(&el->rc)) + kfree(el); + ... + } + +If this list/array is made lock free using RCU as in changing the +write_lock() in add() and delete() to spin_lock() and changing read_lock() +in search_and_reference() to rcu_read_lock(), the atomic_inc() in +search_and_reference() could potentially hold reference to an element which +has already been deleted from the list/array. Use atomic_inc_not_zero() +in this scenario as follows: + +CODE LISTING B:: + + 1. 2. + add() search_and_reference() + { { + alloc_object rcu_read_lock(); + ... search_for_element + atomic_set(&el->rc, 1); if (!atomic_inc_not_zero(&el->rc)) { + spin_lock(&list_lock); rcu_read_unlock(); + return FAIL; + add_element } + ... ... + spin_unlock(&list_lock); rcu_read_unlock(); + } } + 3. 4. + release_referenced() delete() + { { + ... spin_lock(&list_lock); + if (atomic_dec_and_test(&el->rc)) ... + call_rcu(&el->head, el_free); remove_element + ... spin_unlock(&list_lock); + } ... + if (atomic_dec_and_test(&el->rc)) + call_rcu(&el->head, el_free); + ... + } + +Sometimes, a reference to the element needs to be obtained in the +update (write) stream. In such cases, atomic_inc_not_zero() might be +overkill, since we hold the update-side spinlock. One might instead +use atomic_inc() in such cases. + +It is not always convenient to deal with "FAIL" in the +search_and_reference() code path. In such cases, the +atomic_dec_and_test() may be moved from delete() to el_free() +as follows: + +CODE LISTING C:: + + 1. 2. + add() search_and_reference() + { { + alloc_object rcu_read_lock(); + ... search_for_element + atomic_set(&el->rc, 1); atomic_inc(&el->rc); + spin_lock(&list_lock); ... + + add_element rcu_read_unlock(); + ... } + spin_unlock(&list_lock); 4. + } delete() + 3. { + release_referenced() spin_lock(&list_lock); + { ... + ... remove_element + if (atomic_dec_and_test(&el->rc)) spin_unlock(&list_lock); + kfree(el); ... + ... call_rcu(&el->head, el_free); + } ... + 5. } + void el_free(struct rcu_head *rhp) + { + release_referenced(); + } + +The key point is that the initial reference added by add() is not removed +until after a grace period has elapsed following removal. This means that +search_and_reference() cannot find this element, which means that the value +of el->rc cannot increase. Thus, once it reaches zero, there are no +readers that can or ever will be able to reference the element. The +element can therefore safely be freed. This in turn guarantees that if +any reader finds the element, that reader may safely acquire a reference +without checking the value of the reference counter. + +A clear advantage of the RCU-based pattern in listing C over the one +in listing B is that any call to search_and_reference() that locates +a given object will succeed in obtaining a reference to that object, +even given a concurrent invocation of delete() for that same object. +Similarly, a clear advantage of both listings B and C over listing A is +that a call to delete() is not delayed even if there are an arbitrarily +large number of calls to search_and_reference() searching for the same +object that delete() was invoked on. Instead, all that is delayed is +the eventual invocation of kfree(), which is usually not a problem on +modern computer systems, even the small ones. + +In cases where delete() can sleep, synchronize_rcu() can be called from +delete(), so that el_free() can be subsumed into delete as follows:: + + 4. + delete() + { + spin_lock(&list_lock); + ... + remove_element + spin_unlock(&list_lock); + ... + synchronize_rcu(); + if (atomic_dec_and_test(&el->rc)) + kfree(el); + ... + } + +As additional examples in the kernel, the pattern in listing C is used by +reference counting of struct pid, while the pattern in listing B is used by +struct posix_acl. diff --git a/Documentation/RCU/rcuref.txt b/Documentation/RCU/rcuref.txt deleted file mode 100644 index 5e6429d66c24..000000000000 --- a/Documentation/RCU/rcuref.txt +++ /dev/null @@ -1,151 +0,0 @@ -Reference-count design for elements of lists/arrays protected by RCU. - - -Please note that the percpu-ref feature is likely your first -stop if you need to combine reference counts and RCU. Please see -include/linux/percpu-refcount.h for more information. However, in -those unusual cases where percpu-ref would consume too much memory, -please read on. - ------------------------------------------------------------------------- - -Reference counting on elements of lists which are protected by traditional -reader/writer spinlocks or semaphores are straightforward: - -CODE LISTING A: -1. 2. -add() search_and_reference() -{ { - alloc_object read_lock(&list_lock); - ... search_for_element - atomic_set(&el->rc, 1); atomic_inc(&el->rc); - write_lock(&list_lock); ... - add_element read_unlock(&list_lock); - ... ... - write_unlock(&list_lock); } -} - -3. 4. -release_referenced() delete() -{ { - ... write_lock(&list_lock); - if(atomic_dec_and_test(&el->rc)) ... - kfree(el); - ... remove_element -} write_unlock(&list_lock); - ... - if (atomic_dec_and_test(&el->rc)) - kfree(el); - ... - } - -If this list/array is made lock free using RCU as in changing the -write_lock() in add() and delete() to spin_lock() and changing read_lock() -in search_and_reference() to rcu_read_lock(), the atomic_inc() in -search_and_reference() could potentially hold reference to an element which -has already been deleted from the list/array. Use atomic_inc_not_zero() -in this scenario as follows: - -CODE LISTING B: -1. 2. -add() search_and_reference() -{ { - alloc_object rcu_read_lock(); - ... search_for_element - atomic_set(&el->rc, 1); if (!atomic_inc_not_zero(&el->rc)) { - spin_lock(&list_lock); rcu_read_unlock(); - return FAIL; - add_element } - ... ... - spin_unlock(&list_lock); rcu_read_unlock(); -} } -3. 4. -release_referenced() delete() -{ { - ... spin_lock(&list_lock); - if (atomic_dec_and_test(&el->rc)) ... - call_rcu(&el->head, el_free); remove_element - ... spin_unlock(&list_lock); -} ... - if (atomic_dec_and_test(&el->rc)) - call_rcu(&el->head, el_free); - ... - } - -Sometimes, a reference to the element needs to be obtained in the -update (write) stream. In such cases, atomic_inc_not_zero() might be -overkill, since we hold the update-side spinlock. One might instead -use atomic_inc() in such cases. - -It is not always convenient to deal with "FAIL" in the -search_and_reference() code path. In such cases, the -atomic_dec_and_test() may be moved from delete() to el_free() -as follows: - -CODE LISTING C: -1. 2. -add() search_and_reference() -{ { - alloc_object rcu_read_lock(); - ... search_for_element - atomic_set(&el->rc, 1); atomic_inc(&el->rc); - spin_lock(&list_lock); ... - - add_element rcu_read_unlock(); - ... } - spin_unlock(&list_lock); 4. -} delete() -3. { -release_referenced() spin_lock(&list_lock); -{ ... - ... remove_element - if (atomic_dec_and_test(&el->rc)) spin_unlock(&list_lock); - kfree(el); ... - ... call_rcu(&el->head, el_free); -} ... -5. } -void el_free(struct rcu_head *rhp) -{ - release_referenced(); -} - -The key point is that the initial reference added by add() is not removed -until after a grace period has elapsed following removal. This means that -search_and_reference() cannot find this element, which means that the value -of el->rc cannot increase. Thus, once it reaches zero, there are no -readers that can or ever will be able to reference the element. The -element can therefore safely be freed. This in turn guarantees that if -any reader finds the element, that reader may safely acquire a reference -without checking the value of the reference counter. - -A clear advantage of the RCU-based pattern in listing C over the one -in listing B is that any call to search_and_reference() that locates -a given object will succeed in obtaining a reference to that object, -even given a concurrent invocation of delete() for that same object. -Similarly, a clear advantage of both listings B and C over listing A is -that a call to delete() is not delayed even if there are an arbitrarily -large number of calls to search_and_reference() searching for the same -object that delete() was invoked on. Instead, all that is delayed is -the eventual invocation of kfree(), which is usually not a problem on -modern computer systems, even the small ones. - -In cases where delete() can sleep, synchronize_rcu() can be called from -delete(), so that el_free() can be subsumed into delete as follows: - -4. -delete() -{ - spin_lock(&list_lock); - ... - remove_element - spin_unlock(&list_lock); - ... - synchronize_rcu(); - if (atomic_dec_and_test(&el->rc)) - kfree(el); - ... -} - -As additional examples in the kernel, the pattern in listing C is used by -reference counting of struct pid, while the pattern in listing B is used by -struct posix_acl. -- cgit v1.2.3 From f2286ab99549271f3cec73e305b9ecca95d91394 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 21 Apr 2020 19:04:10 +0200 Subject: docs: RCU: Convert stallwarn.txt to ReST - Add a SPDX header; - Adjust document and section titles; - Fix list markups; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to RCU/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Paul E. McKenney --- Documentation/RCU/index.rst | 1 + Documentation/RCU/stallwarn.rst | 329 ++++++++++++++++++++++++++++++++++++++++ Documentation/RCU/stallwarn.txt | 316 -------------------------------------- kernel/rcu/tree_stall.h | 4 +- 4 files changed, 332 insertions(+), 318 deletions(-) create mode 100644 Documentation/RCU/stallwarn.rst delete mode 100644 Documentation/RCU/stallwarn.txt diff --git a/Documentation/RCU/index.rst b/Documentation/RCU/index.rst index 9a1d51f394dc..e703d3dbe60c 100644 --- a/Documentation/RCU/index.rst +++ b/Documentation/RCU/index.rst @@ -20,6 +20,7 @@ RCU concepts rculist_nulls rcuref torture + stallwarn listRCU NMI-RCU UP diff --git a/Documentation/RCU/stallwarn.rst b/Documentation/RCU/stallwarn.rst new file mode 100644 index 000000000000..08bc9aec4606 --- /dev/null +++ b/Documentation/RCU/stallwarn.rst @@ -0,0 +1,329 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================== +Using RCU's CPU Stall Detector +============================== + +This document first discusses what sorts of issues RCU's CPU stall +detector can locate, and then discusses kernel parameters and Kconfig +options that can be used to fine-tune the detector's operation. Finally, +this document explains the stall detector's "splat" format. + + +What Causes RCU CPU Stall Warnings? +=================================== + +So your kernel printed an RCU CPU stall warning. The next question is +"What caused it?" The following problems can result in RCU CPU stall +warnings: + +- A CPU looping in an RCU read-side critical section. + +- A CPU looping with interrupts disabled. + +- A CPU looping with preemption disabled. + +- A CPU looping with bottom halves disabled. + +- For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel + without invoking schedule(). If the looping in the kernel is + really expected and desirable behavior, you might need to add + some calls to cond_resched(). + +- Booting Linux using a console connection that is too slow to + keep up with the boot-time console-message rate. For example, + a 115Kbaud serial console can be -way- too slow to keep up + with boot-time message rates, and will frequently result in + RCU CPU stall warning messages. Especially if you have added + debug printk()s. + +- Anything that prevents RCU's grace-period kthreads from running. + This can result in the "All QSes seen" console-log message. + This message will include information on when the kthread last + ran and how often it should be expected to run. It can also + result in the ``rcu_.*kthread starved for`` console-log message, + which will include additional debugging information. + +- A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might + happen to preempt a low-priority task in the middle of an RCU + read-side critical section. This is especially damaging if + that low-priority task is not permitted to run on any other CPU, + in which case the next RCU grace period can never complete, which + will eventually cause the system to run out of memory and hang. + While the system is in the process of running itself out of + memory, you might see stall-warning messages. + +- A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that + is running at a higher priority than the RCU softirq threads. + This will prevent RCU callbacks from ever being invoked, + and in a CONFIG_PREEMPT_RCU kernel will further prevent + RCU grace periods from ever completing. Either way, the + system will eventually run out of memory and hang. In the + CONFIG_PREEMPT_RCU case, you might see stall-warning + messages. + + You can use the rcutree.kthread_prio kernel boot parameter to + increase the scheduling priority of RCU's kthreads, which can + help avoid this problem. However, please note that doing this + can increase your system's context-switch rate and thus degrade + performance. + +- A periodic interrupt whose handler takes longer than the time + interval between successive pairs of interrupts. This can + prevent RCU's kthreads and softirq handlers from running. + Note that certain high-overhead debugging options, for example + the function_graph tracer, can result in interrupt handler taking + considerably longer than normal, which can in turn result in + RCU CPU stall warnings. + +- Testing a workload on a fast system, tuning the stall-warning + timeout down to just barely avoid RCU CPU stall warnings, and then + running the same workload with the same stall-warning timeout on a + slow system. Note that thermal throttling and on-demand governors + can cause a single system to be sometimes fast and sometimes slow! + +- A hardware or software issue shuts off the scheduler-clock + interrupt on a CPU that is not in dyntick-idle mode. This + problem really has happened, and seems to be most likely to + result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels. + +- A bug in the RCU implementation. + +- A hardware failure. This is quite unlikely, but has occurred + at least once in real life. A CPU failed in a running system, + becoming unresponsive, but not causing an immediate crash. + This resulted in a series of RCU CPU stall warnings, eventually + leading the realization that the CPU had failed. + +The RCU, RCU-sched, and RCU-tasks implementations have CPU stall warning. +Note that SRCU does -not- have CPU stall warnings. Please note that +RCU only detects CPU stalls when there is a grace period in progress. +No grace period, no CPU stall warnings. + +To diagnose the cause of the stall, inspect the stack traces. +The offending function will usually be near the top of the stack. +If you have a series of stall warnings from a single extended stall, +comparing the stack traces can often help determine where the stall +is occurring, which will usually be in the function nearest the top of +that portion of the stack which remains the same from trace to trace. +If you can reliably trigger the stall, ftrace can be quite helpful. + +RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE +and with RCU's event tracing. For information on RCU's event tracing, +see include/trace/events/rcu.h. + + +Fine-Tuning the RCU CPU Stall Detector +====================================== + +The rcuupdate.rcu_cpu_stall_suppress module parameter disables RCU's +CPU stall detector, which detects conditions that unduly delay RCU grace +periods. This module parameter enables CPU stall detection by default, +but may be overridden via boot-time parameter or at runtime via sysfs. +The stall detector's idea of what constitutes "unduly delayed" is +controlled by a set of kernel configuration variables and cpp macros: + +CONFIG_RCU_CPU_STALL_TIMEOUT +---------------------------- + + This kernel configuration parameter defines the period of time + that RCU will wait from the beginning of a grace period until it + issues an RCU CPU stall warning. This time period is normally + 21 seconds. + + This configuration parameter may be changed at runtime via the + /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout, however + this parameter is checked only at the beginning of a cycle. + So if you are 10 seconds into a 40-second stall, setting this + sysfs parameter to (say) five will shorten the timeout for the + -next- stall, or the following warning for the current stall + (assuming the stall lasts long enough). It will not affect the + timing of the next warning for the current stall. + + Stall-warning messages may be enabled and disabled completely via + /sys/module/rcupdate/parameters/rcu_cpu_stall_suppress. + +RCU_STALL_DELAY_DELTA +--------------------- + + Although the lockdep facility is extremely useful, it does add + some overhead. Therefore, under CONFIG_PROVE_RCU, the + RCU_STALL_DELAY_DELTA macro allows five extra seconds before + giving an RCU CPU stall warning message. (This is a cpp + macro, not a kernel configuration parameter.) + +RCU_STALL_RAT_DELAY +------------------- + + The CPU stall detector tries to make the offending CPU print its + own warnings, as this often gives better-quality stack traces. + However, if the offending CPU does not detect its own stall in + the number of jiffies specified by RCU_STALL_RAT_DELAY, then + some other CPU will complain. This delay is normally set to + two jiffies. (This is a cpp macro, not a kernel configuration + parameter.) + +rcupdate.rcu_task_stall_timeout +------------------------------- + + This boot/sysfs parameter controls the RCU-tasks stall warning + interval. A value of zero or less suppresses RCU-tasks stall + warnings. A positive value sets the stall-warning interval + in seconds. An RCU-tasks stall warning starts with the line: + + INFO: rcu_tasks detected stalls on tasks: + + And continues with the output of sched_show_task() for each + task stalling the current RCU-tasks grace period. + + +Interpreting RCU's CPU Stall-Detector "Splats" +============================================== + +For non-RCU-tasks flavors of RCU, when a CPU detects that it is stalling, +it will print a message similar to the following:: + + INFO: rcu_sched detected stalls on CPUs/tasks: + 2-...: (3 GPs behind) idle=06c/0/0 softirq=1453/1455 fqs=0 + 16-...: (0 ticks this GP) idle=81c/0/0 softirq=764/764 fqs=0 + (detected by 32, t=2603 jiffies, g=7075, q=625) + +This message indicates that CPU 32 detected that CPUs 2 and 16 were both +causing stalls, and that the stall was affecting RCU-sched. This message +will normally be followed by stack dumps for each CPU. Please note that +PREEMPT_RCU builds can be stalled by tasks as well as by CPUs, and that +the tasks will be indicated by PID, for example, "P3421". It is even +possible for an rcu_state stall to be caused by both CPUs -and- tasks, +in which case the offending CPUs and tasks will all be called out in the list. + +CPU 2's "(3 GPs behind)" indicates that this CPU has not interacted with +the RCU core for the past three grace periods. In contrast, CPU 16's "(0 +ticks this GP)" indicates that this CPU has not taken any scheduling-clock +interrupts during the current stalled grace period. + +The "idle=" portion of the message prints the dyntick-idle state. +The hex number before the first "/" is the low-order 12 bits of the +dynticks counter, which will have an even-numbered value if the CPU +is in dyntick-idle mode and an odd-numbered value otherwise. The hex +number between the two "/"s is the value of the nesting, which will be +a small non-negative number if in the idle loop (as shown above) and a +very large positive number otherwise. + +The "softirq=" portion of the message tracks the number of RCU softirq +handlers that the stalled CPU has executed. The number before the "/" +is the number that had executed since boot at the time that this CPU +last noted the beginning of a grace period, which might be the current +(stalled) grace period, or it might be some earlier grace period (for +example, if the CPU might have been in dyntick-idle mode for an extended +time period. The number after the "/" is the number that have executed +since boot until the current time. If this latter number stays constant +across repeated stall-warning messages, it is possible that RCU's softirq +handlers are no longer able to execute on this CPU. This can happen if +the stalled CPU is spinning with interrupts are disabled, or, in -rt +kernels, if a high-priority process is starving RCU's softirq handler. + +The "fqs=" shows the number of force-quiescent-state idle/offline +detection passes that the grace-period kthread has made across this +CPU since the last time that this CPU noted the beginning of a grace +period. + +The "detected by" line indicates which CPU detected the stall (in this +case, CPU 32), how many jiffies have elapsed since the start of the grace +period (in this case 2603), the grace-period sequence number (7075), and +an estimate of the total number of RCU callbacks queued across all CPUs +(625 in this case). + +In kernels with CONFIG_RCU_FAST_NO_HZ, more information is printed +for each CPU:: + + 0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 softirq=82/543 last_accelerate: a345/d342 dyntick_enabled: 1 + +The "last_accelerate:" prints the low-order 16 bits (in hex) of the +jiffies counter when this CPU last invoked rcu_try_advance_all_cbs() +from rcu_needs_cpu() or last invoked rcu_accelerate_cbs() from +rcu_prepare_for_idle(). "dyntick_enabled: 1" indicates that dyntick-idle +processing is enabled. + +If the grace period ends just as the stall warning starts printing, +there will be a spurious stall-warning message, which will include +the following:: + + INFO: Stall ended before state dump start + +This is rare, but does happen from time to time in real life. It is also +possible for a zero-jiffy stall to be flagged in this case, depending +on how the stall warning and the grace-period initialization happen to +interact. Please note that it is not possible to entirely eliminate this +sort of false positive without resorting to things like stop_machine(), +which is overkill for this sort of problem. + +If all CPUs and tasks have passed through quiescent states, but the +grace period has nevertheless failed to end, the stall-warning splat +will include something like the following:: + + All QSes seen, last rcu_preempt kthread activity 23807 (4297905177-4297881370), jiffies_till_next_fqs=3, root ->qsmask 0x0 + +The "23807" indicates that it has been more than 23 thousand jiffies +since the grace-period kthread ran. The "jiffies_till_next_fqs" +indicates how frequently that kthread should run, giving the number +of jiffies between force-quiescent-state scans, in this case three, +which is way less than 23807. Finally, the root rcu_node structure's +->qsmask field is printed, which will normally be zero. + +If the relevant grace-period kthread has been unable to run prior to +the stall warning, as was the case in the "All QSes seen" line above, +the following additional line is printed:: + + kthread starved for 23807 jiffies! g7075 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1 ->cpu=5 + +Starving the grace-period kthreads of CPU time can of course result +in RCU CPU stall warnings even when all CPUs and tasks have passed +through the required quiescent states. The "g" number shows the current +grace-period sequence number, the "f" precedes the ->gp_flags command +to the grace-period kthread, the "RCU_GP_WAIT_FQS" indicates that the +kthread is waiting for a short timeout, the "state" precedes value of the +task_struct ->state field, and the "cpu" indicates that the grace-period +kthread last ran on CPU 5. + + +Multiple Warnings From One Stall +================================ + +If a stall lasts long enough, multiple stall-warning messages will be +printed for it. The second and subsequent messages are printed at +longer intervals, so that the time between (say) the first and second +message will be about three times the interval between the beginning +of the stall and the first message. + + +Stall Warnings for Expedited Grace Periods +========================================== + +If an expedited grace period detects a stall, it will place a message +like the following in dmesg:: + + INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 7-... } 21119 jiffies s: 73 root: 0x2/. + +This indicates that CPU 7 has failed to respond to a reschedule IPI. +The three periods (".") following the CPU number indicate that the CPU +is online (otherwise the first period would instead have been "O"), +that the CPU was online at the beginning of the expedited grace period +(otherwise the second period would have instead been "o"), and that +the CPU has been online at least once since boot (otherwise, the third +period would instead have been "N"). The number before the "jiffies" +indicates that the expedited grace period has been going on for 21,119 +jiffies. The number following the "s:" indicates that the expedited +grace-period sequence counter is 73. The fact that this last value is +odd indicates that an expedited grace period is in flight. The number +following "root:" is a bitmask that indicates which children of the root +rcu_node structure correspond to CPUs and/or tasks that are blocking the +current expedited grace period. If the tree had more than one level, +additional hex numbers would be printed for the states of the other +rcu_node structures in the tree. + +As with normal grace periods, PREEMPT_RCU builds can be stalled by +tasks as well as by CPUs, and that the tasks will be indicated by PID, +for example, "P3421". + +It is entirely possible to see stall warnings from normal and from +expedited grace periods at about the same time during the same run. diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt deleted file mode 100644 index a360a8796710..000000000000 --- a/Documentation/RCU/stallwarn.txt +++ /dev/null @@ -1,316 +0,0 @@ -Using RCU's CPU Stall Detector - -This document first discusses what sorts of issues RCU's CPU stall -detector can locate, and then discusses kernel parameters and Kconfig -options that can be used to fine-tune the detector's operation. Finally, -this document explains the stall detector's "splat" format. - - -What Causes RCU CPU Stall Warnings? - -So your kernel printed an RCU CPU stall warning. The next question is -"What caused it?" The following problems can result in RCU CPU stall -warnings: - -o A CPU looping in an RCU read-side critical section. - -o A CPU looping with interrupts disabled. - -o A CPU looping with preemption disabled. - -o A CPU looping with bottom halves disabled. - -o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel - without invoking schedule(). If the looping in the kernel is - really expected and desirable behavior, you might need to add - some calls to cond_resched(). - -o Booting Linux using a console connection that is too slow to - keep up with the boot-time console-message rate. For example, - a 115Kbaud serial console can be -way- too slow to keep up - with boot-time message rates, and will frequently result in - RCU CPU stall warning messages. Especially if you have added - debug printk()s. - -o Anything that prevents RCU's grace-period kthreads from running. - This can result in the "All QSes seen" console-log message. - This message will include information on when the kthread last - ran and how often it should be expected to run. It can also - result in the "rcu_.*kthread starved for" console-log message, - which will include additional debugging information. - -o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might - happen to preempt a low-priority task in the middle of an RCU - read-side critical section. This is especially damaging if - that low-priority task is not permitted to run on any other CPU, - in which case the next RCU grace period can never complete, which - will eventually cause the system to run out of memory and hang. - While the system is in the process of running itself out of - memory, you might see stall-warning messages. - -o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that - is running at a higher priority than the RCU softirq threads. - This will prevent RCU callbacks from ever being invoked, - and in a CONFIG_PREEMPT_RCU kernel will further prevent - RCU grace periods from ever completing. Either way, the - system will eventually run out of memory and hang. In the - CONFIG_PREEMPT_RCU case, you might see stall-warning - messages. - - You can use the rcutree.kthread_prio kernel boot parameter to - increase the scheduling priority of RCU's kthreads, which can - help avoid this problem. However, please note that doing this - can increase your system's context-switch rate and thus degrade - performance. - -o A periodic interrupt whose handler takes longer than the time - interval between successive pairs of interrupts. This can - prevent RCU's kthreads and softirq handlers from running. - Note that certain high-overhead debugging options, for example - the function_graph tracer, can result in interrupt handler taking - considerably longer than normal, which can in turn result in - RCU CPU stall warnings. - -o Testing a workload on a fast system, tuning the stall-warning - timeout down to just barely avoid RCU CPU stall warnings, and then - running the same workload with the same stall-warning timeout on a - slow system. Note that thermal throttling and on-demand governors - can cause a single system to be sometimes fast and sometimes slow! - -o A hardware or software issue shuts off the scheduler-clock - interrupt on a CPU that is not in dyntick-idle mode. This - problem really has happened, and seems to be most likely to - result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels. - -o A bug in the RCU implementation. - -o A hardware failure. This is quite unlikely, but has occurred - at least once in real life. A CPU failed in a running system, - becoming unresponsive, but not causing an immediate crash. - This resulted in a series of RCU CPU stall warnings, eventually - leading the realization that the CPU had failed. - -The RCU, RCU-sched, and RCU-tasks implementations have CPU stall warning. -Note that SRCU does -not- have CPU stall warnings. Please note that -RCU only detects CPU stalls when there is a grace period in progress. -No grace period, no CPU stall warnings. - -To diagnose the cause of the stall, inspect the stack traces. -The offending function will usually be near the top of the stack. -If you have a series of stall warnings from a single extended stall, -comparing the stack traces can often help determine where the stall -is occurring, which will usually be in the function nearest the top of -that portion of the stack which remains the same from trace to trace. -If you can reliably trigger the stall, ftrace can be quite helpful. - -RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE -and with RCU's event tracing. For information on RCU's event tracing, -see include/trace/events/rcu.h. - - -Fine-Tuning the RCU CPU Stall Detector - -The rcuupdate.rcu_cpu_stall_suppress module parameter disables RCU's -CPU stall detector, which detects conditions that unduly delay RCU grace -periods. This module parameter enables CPU stall detection by default, -but may be overridden via boot-time parameter or at runtime via sysfs. -The stall detector's idea of what constitutes "unduly delayed" is -controlled by a set of kernel configuration variables and cpp macros: - -CONFIG_RCU_CPU_STALL_TIMEOUT - - This kernel configuration parameter defines the period of time - that RCU will wait from the beginning of a grace period until it - issues an RCU CPU stall warning. This time period is normally - 21 seconds. - - This configuration parameter may be changed at runtime via the - /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout, however - this parameter is checked only at the beginning of a cycle. - So if you are 10 seconds into a 40-second stall, setting this - sysfs parameter to (say) five will shorten the timeout for the - -next- stall, or the following warning for the current stall - (assuming the stall lasts long enough). It will not affect the - timing of the next warning for the current stall. - - Stall-warning messages may be enabled and disabled completely via - /sys/module/rcupdate/parameters/rcu_cpu_stall_suppress. - -RCU_STALL_DELAY_DELTA - - Although the lockdep facility is extremely useful, it does add - some overhead. Therefore, under CONFIG_PROVE_RCU, the - RCU_STALL_DELAY_DELTA macro allows five extra seconds before - giving an RCU CPU stall warning message. (This is a cpp - macro, not a kernel configuration parameter.) - -RCU_STALL_RAT_DELAY - - The CPU stall detector tries to make the offending CPU print its - own warnings, as this often gives better-quality stack traces. - However, if the offending CPU does not detect its own stall in - the number of jiffies specified by RCU_STALL_RAT_DELAY, then - some other CPU will complain. This delay is normally set to - two jiffies. (This is a cpp macro, not a kernel configuration - parameter.) - -rcupdate.rcu_task_stall_timeout - - This boot/sysfs parameter controls the RCU-tasks stall warning - interval. A value of zero or less suppresses RCU-tasks stall - warnings. A positive value sets the stall-warning interval - in seconds. An RCU-tasks stall warning starts with the line: - - INFO: rcu_tasks detected stalls on tasks: - - And continues with the output of sched_show_task() for each - task stalling the current RCU-tasks grace period. - - -Interpreting RCU's CPU Stall-Detector "Splats" - -For non-RCU-tasks flavors of RCU, when a CPU detects that it is stalling, -it will print a message similar to the following: - - INFO: rcu_sched detected stalls on CPUs/tasks: - 2-...: (3 GPs behind) idle=06c/0/0 softirq=1453/1455 fqs=0 - 16-...: (0 ticks this GP) idle=81c/0/0 softirq=764/764 fqs=0 - (detected by 32, t=2603 jiffies, g=7075, q=625) - -This message indicates that CPU 32 detected that CPUs 2 and 16 were both -causing stalls, and that the stall was affecting RCU-sched. This message -will normally be followed by stack dumps for each CPU. Please note that -PREEMPT_RCU builds can be stalled by tasks as well as by CPUs, and that -the tasks will be indicated by PID, for example, "P3421". It is even -possible for an rcu_state stall to be caused by both CPUs -and- tasks, -in which case the offending CPUs and tasks will all be called out in the list. - -CPU 2's "(3 GPs behind)" indicates that this CPU has not interacted with -the RCU core for the past three grace periods. In contrast, CPU 16's "(0 -ticks this GP)" indicates that this CPU has not taken any scheduling-clock -interrupts during the current stalled grace period. - -The "idle=" portion of the message prints the dyntick-idle state. -The hex number before the first "/" is the low-order 12 bits of the -dynticks counter, which will have an even-numbered value if the CPU -is in dyntick-idle mode and an odd-numbered value otherwise. The hex -number between the two "/"s is the value of the nesting, which will be -a small non-negative number if in the idle loop (as shown above) and a -very large positive number otherwise. - -The "softirq=" portion of the message tracks the number of RCU softirq -handlers that the stalled CPU has executed. The number before the "/" -is the number that had executed since boot at the time that this CPU -last noted the beginning of a grace period, which might be the current -(stalled) grace period, or it might be some earlier grace period (for -example, if the CPU might have been in dyntick-idle mode for an extended -time period. The number after the "/" is the number that have executed -since boot until the current time. If this latter number stays constant -across repeated stall-warning messages, it is possible that RCU's softirq -handlers are no longer able to execute on this CPU. This can happen if -the stalled CPU is spinning with interrupts are disabled, or, in -rt -kernels, if a high-priority process is starving RCU's softirq handler. - -The "fqs=" shows the number of force-quiescent-state idle/offline -detection passes that the grace-period kthread has made across this -CPU since the last time that this CPU noted the beginning of a grace -period. - -The "detected by" line indicates which CPU detected the stall (in this -case, CPU 32), how many jiffies have elapsed since the start of the grace -period (in this case 2603), the grace-period sequence number (7075), and -an estimate of the total number of RCU callbacks queued across all CPUs -(625 in this case). - -In kernels with CONFIG_RCU_FAST_NO_HZ, more information is printed -for each CPU: - - 0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 softirq=82/543 last_accelerate: a345/d342 dyntick_enabled: 1 - -The "last_accelerate:" prints the low-order 16 bits (in hex) of the -jiffies counter when this CPU last invoked rcu_try_advance_all_cbs() -from rcu_needs_cpu() or last invoked rcu_accelerate_cbs() from -rcu_prepare_for_idle(). "dyntick_enabled: 1" indicates that dyntick-idle -processing is enabled. - -If the grace period ends just as the stall warning starts printing, -there will be a spurious stall-warning message, which will include -the following: - - INFO: Stall ended before state dump start - -This is rare, but does happen from time to time in real life. It is also -possible for a zero-jiffy stall to be flagged in this case, depending -on how the stall warning and the grace-period initialization happen to -interact. Please note that it is not possible to entirely eliminate this -sort of false positive without resorting to things like stop_machine(), -which is overkill for this sort of problem. - -If all CPUs and tasks have passed through quiescent states, but the -grace period has nevertheless failed to end, the stall-warning splat -will include something like the following: - - All QSes seen, last rcu_preempt kthread activity 23807 (4297905177-4297881370), jiffies_till_next_fqs=3, root ->qsmask 0x0 - -The "23807" indicates that it has been more than 23 thousand jiffies -since the grace-period kthread ran. The "jiffies_till_next_fqs" -indicates how frequently that kthread should run, giving the number -of jiffies between force-quiescent-state scans, in this case three, -which is way less than 23807. Finally, the root rcu_node structure's -->qsmask field is printed, which will normally be zero. - -If the relevant grace-period kthread has been unable to run prior to -the stall warning, as was the case in the "All QSes seen" line above, -the following additional line is printed: - - kthread starved for 23807 jiffies! g7075 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1 ->cpu=5 - -Starving the grace-period kthreads of CPU time can of course result -in RCU CPU stall warnings even when all CPUs and tasks have passed -through the required quiescent states. The "g" number shows the current -grace-period sequence number, the "f" precedes the ->gp_flags command -to the grace-period kthread, the "RCU_GP_WAIT_FQS" indicates that the -kthread is waiting for a short timeout, the "state" precedes value of the -task_struct ->state field, and the "cpu" indicates that the grace-period -kthread last ran on CPU 5. - - -Multiple Warnings From One Stall - -If a stall lasts long enough, multiple stall-warning messages will be -printed for it. The second and subsequent messages are printed at -longer intervals, so that the time between (say) the first and second -message will be about three times the interval between the beginning -of the stall and the first message. - - -Stall Warnings for Expedited Grace Periods - -If an expedited grace period detects a stall, it will place a message -like the following in dmesg: - - INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 7-... } 21119 jiffies s: 73 root: 0x2/. - -This indicates that CPU 7 has failed to respond to a reschedule IPI. -The three periods (".") following the CPU number indicate that the CPU -is online (otherwise the first period would instead have been "O"), -that the CPU was online at the beginning of the expedited grace period -(otherwise the second period would have instead been "o"), and that -the CPU has been online at least once since boot (otherwise, the third -period would instead have been "N"). The number before the "jiffies" -indicates that the expedited grace period has been going on for 21,119 -jiffies. The number following the "s:" indicates that the expedited -grace-period sequence counter is 73. The fact that this last value is -odd indicates that an expedited grace period is in flight. The number -following "root:" is a bitmask that indicates which children of the root -rcu_node structure correspond to CPUs and/or tasks that are blocking the -current expedited grace period. If the tree had more than one level, -additional hex numbers would be printed for the states of the other -rcu_node structures in the tree. - -As with normal grace periods, PREEMPT_RCU builds can be stalled by -tasks as well as by CPUs, and that the tasks will be indicated by PID, -for example, "P3421". - -It is entirely possible to see stall warnings from normal and from -expedited grace periods at about the same time during the same run. diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index 54a6dba0280d..b04256cd7e12 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -468,7 +468,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) /* * OK, time to rat on our buddy... - * See Documentation/RCU/stallwarn.txt for info on how to debug + * See Documentation/RCU/stallwarn.rst for info on how to debug * RCU CPU stall warnings. */ pr_err("INFO: %s detected stalls on CPUs/tasks:\n", rcu_state.name); @@ -535,7 +535,7 @@ static void print_cpu_stall(unsigned long gps) /* * OK, time to rat on ourselves... - * See Documentation/RCU/stallwarn.txt for info on how to debug + * See Documentation/RCU/stallwarn.rst for info on how to debug * RCU CPU stall warnings. */ pr_err("INFO: %s self-detected stall on CPU\n", rcu_state.name); -- cgit v1.2.3 From 2d9c318bfd15394da014737bee30e7b2e22c5eac Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 21 Apr 2020 19:04:11 +0200 Subject: docs: RCU: Don't duplicate chapter names in rculist_nulls.rst Since changeset 58ad30cf91f0 ("docs: fix reference to core-api/namespaces.rst"), auto-references for chapters are generated. This is a nice feature, but has a drawback: no chapters can have the same sumber. So, we need to add two higher hierarchy chapters on this document, in order to avoid such duplication. Fixes: 58ad30cf91f0 ("docs: fix reference to core-api/namespaces.rst") Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Paul E. McKenney --- Documentation/RCU/rculist_nulls.rst | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst index d40374221d69..a9fc774bc400 100644 --- a/Documentation/RCU/rculist_nulls.rst +++ b/Documentation/RCU/rculist_nulls.rst @@ -10,6 +10,9 @@ objects using SLAB_TYPESAFE_BY_RCU allocations. Please read the basics in Documentation/RCU/listRCU.rst +Using 'nulls' +============= + Using special makers (called 'nulls') is a convenient way to solve following problem : @@ -126,6 +129,9 @@ very very fast (before the end of RCU grace period) -------------------------------------------------------------------------- +Avoiding extra smp_rmb() +======================== + With hlist_nulls we can avoid extra smp_rmb() in lockless_lookup() and extra smp_wmb() in insert function. -- cgit v1.2.3 From b81898e3d2133715e4475d25757595a3e18502ed Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 30 Apr 2020 12:23:11 -0700 Subject: doc: Timer problems can cause RCU CPU stall warnings Over the past few years, there have been several cases where timekeeping bugs have caused RCU CPU stall warnings, particularly during hardware bringup. This commit therefore adds such bugs to the list of things that can result in RCU CPU stall warnings. Signed-off-by: Paul E. McKenney --- Documentation/RCU/stallwarn.rst | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/Documentation/RCU/stallwarn.rst b/Documentation/RCU/stallwarn.rst index 08bc9aec4606..c9ab6af4d3be 100644 --- a/Documentation/RCU/stallwarn.rst +++ b/Documentation/RCU/stallwarn.rst @@ -87,6 +87,13 @@ warnings: problem really has happened, and seems to be most likely to result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels. +- A hardware or software issue that prevents time-based wakeups + from occurring. These issues can range from misconfigured or + buggy timer hardware through bugs in the interrupt or exception + path (whether hardware, firmware, or software) through bugs + in Linux's timer subsystem through bugs in the scheduler, and, + yes, even including bugs in RCU itself. + - A bug in the RCU implementation. - A hardware failure. This is quite unlikely, but has occurred -- cgit v1.2.3 From d93d97cbe0d4369153fb04954f1481a9f42aa5b6 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 11 May 2020 19:52:34 -0700 Subject: doc: Tasks RCU must protect instructions before trampoline Protecting the code in a trampoline can also require protecting a number of instructions prior to actually entering the trampoline. For example, these earlier instructions might be computing the address of the trampoline. This commit therefore updates RCU's requirements to record this for posterity. Link: https://lore.kernel.org/lkml/20200511154824.09a18c46@gandalf.local.home/ Reported-by: Lai Jiangshan Reported-by: Steven Rostedt Signed-off-by: Paul E. McKenney --- Documentation/RCU/Design/Requirements/Requirements.rst | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst index 75b8ca007a11..a69b5c43a10c 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.rst +++ b/Documentation/RCU/Design/Requirements/Requirements.rst @@ -2583,7 +2583,12 @@ not work to have these markers in the trampoline itself, because there would need to be instructions following ``rcu_read_unlock()``. Although ``synchronize_rcu()`` would guarantee that execution reached the ``rcu_read_unlock()``, it would not be able to guarantee that execution -had completely left the trampoline. +had completely left the trampoline. Worse yet, in some situations +the trampoline's protection must extend a few instructions *prior* to +execution reaching the trampoline. For example, these few instructions +might calculate the address of the trampoline, so that entering the +trampoline would be pre-ordained a surprisingly long time before execution +actually reached the trampoline itself. The solution, in the form of `Tasks RCU `__, is to have implicit read-side -- cgit v1.2.3 From 7ee880b7bf1dea88d0a472b775aebdb4fb6bf860 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Wed, 15 Apr 2020 22:26:55 +0000 Subject: rcu: Initialize and destroy rcu_synchronize only when necessary The __wait_rcu_gp() function unconditionally initializes and cleans up each element of rs_array[], whether used or not. This is slightly wasteful and rather confusing, so this commit skips both initialization and cleanup for duplicate callback functions. Signed-off-by: Wei Yang Signed-off-by: Paul E. McKenney --- kernel/rcu/update.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 84843adfd939..f5a82e107bcb 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -390,13 +390,14 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array, might_sleep(); continue; } - init_rcu_head_on_stack(&rs_array[i].head); - init_completion(&rs_array[i].completion); for (j = 0; j < i; j++) if (crcu_array[j] == crcu_array[i]) break; - if (j == i) + if (j == i) { + init_rcu_head_on_stack(&rs_array[i].head); + init_completion(&rs_array[i].completion); (crcu_array[i])(&rs_array[i].head, wakeme_after_rcu); + } } /* Wait for all callbacks to be invoked. */ @@ -407,9 +408,10 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array, for (j = 0; j < i; j++) if (crcu_array[j] == crcu_array[i]) break; - if (j == i) + if (j == i) { wait_for_completion(&rs_array[i].completion); - destroy_rcu_head_on_stack(&rs_array[i].head); + destroy_rcu_head_on_stack(&rs_array[i].head); + } } } EXPORT_SYMBOL_GPL(__wait_rcu_gp); -- cgit v1.2.3 From 0a3b3c253a1eb2c7fe7f34086d46660c909abeb3 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 16 Apr 2020 16:46:10 -0700 Subject: mm/mmap.c: Add cond_resched() for exit_mmap() CPU stalls A large process running on a heavily loaded system can encounter the following RCU CPU stall warning: rcu: INFO: rcu_sched self-detected stall on CPU rcu: 3-....: (20998 ticks this GP) idle=4ea/1/0x4000000000000002 softirq=556558/556558 fqs=5190 (t=21013 jiffies g=1005461 q=132576) NMI backtrace for cpu 3 CPU: 3 PID: 501900 Comm: aio-free-ring-w Kdump: loaded Not tainted 5.2.9-108_fbk12_rc3_3858_gb83b75af7909 #1 Hardware name: Wiwynn HoneyBadger/PantherPlus, BIOS HBM6.71 02/03/2016 Call Trace: dump_stack+0x46/0x60 nmi_cpu_backtrace.cold.3+0x13/0x50 ? lapic_can_unplug_cpu.cold.27+0x34/0x34 nmi_trigger_cpumask_backtrace+0xba/0xca rcu_dump_cpu_stacks+0x99/0xc7 rcu_sched_clock_irq.cold.87+0x1aa/0x397 ? tick_sched_do_timer+0x60/0x60 update_process_times+0x28/0x60 tick_sched_timer+0x37/0x70 __hrtimer_run_queues+0xfe/0x270 hrtimer_interrupt+0xf4/0x210 smp_apic_timer_interrupt+0x5e/0x120 apic_timer_interrupt+0xf/0x20 RIP: 0010:kmem_cache_free+0x223/0x300 Code: 88 00 00 00 0f 85 ca 00 00 00 41 8b 55 18 31 f6 f7 da 41 f6 45 0a 02 40 0f 94 c6 83 c6 05 9c 41 5e fa e8 a0 a7 01 00 41 56 9d <49> 8b 47 08 a8 03 0f 85 87 00 00 00 65 48 ff 08 e9 3d fe ff ff 65 RSP: 0018:ffffc9000e8e3da8 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13 RAX: 0000000000020000 RBX: ffff88861b9de960 RCX: 0000000000000030 RDX: fffffffffffe41e8 RSI: 000060777fe3a100 RDI: 000000000001be18 RBP: ffffea00186e7780 R08: ffffffffffffffff R09: ffffffffffffffff R10: ffff88861b9dea28 R11: ffff88887ffde000 R12: ffffffff81230a1f R13: ffff888854684dc0 R14: 0000000000000206 R15: ffff8888547dbc00 ? remove_vma+0x4f/0x60 remove_vma+0x4f/0x60 exit_mmap+0xd6/0x160 mmput+0x4a/0x110 do_exit+0x278/0xae0 ? syscall_trace_enter+0x1d3/0x2b0 ? handle_mm_fault+0xaa/0x1c0 do_group_exit+0x3a/0xa0 __x64_sys_exit_group+0x14/0x20 do_syscall_64+0x42/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 And on a PREEMPT=n kernel, the "while (vma)" loop in exit_mmap() can run for a very long time given a large process. This commit therefore adds a cond_resched() to this loop, providing RCU any needed quiescent states. Cc: Andrew Morton Cc: Reviewed-by: Shakeel Butt Reviewed-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- mm/mmap.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/mmap.c b/mm/mmap.c index 59a4682ebf3f..972f839c6ec8 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3159,6 +3159,7 @@ void exit_mmap(struct mm_struct *mm) if (vma->vm_flags & VM_ACCOUNT) nr_accounted += vma_pages(vma); vma = remove_vma(vma); + cond_resched(); } vm_unacct_memory(nr_accounted); } -- cgit v1.2.3 From abfce0414814149f716e1d30da1fb3140d1b3473 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Sun, 19 Apr 2020 21:57:15 +0000 Subject: rcu: Simplify the calculation of rcu_state.ncpus There is only 1 bit set in mask, which means that the only difference between oldmask and the new one will be at the position where the bit is set in mask. This commit therefore updates rcu_state.ncpus by checking whether the bit in mask is already set in rnp->expmaskinitnext. Signed-off-by: Wei Yang Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 6c6569e0586c..bef1dc91bfbe 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3842,10 +3842,9 @@ void rcu_cpu_starting(unsigned int cpu) { unsigned long flags; unsigned long mask; - int nbits; - unsigned long oldmask; struct rcu_data *rdp; struct rcu_node *rnp; + bool newcpu; if (per_cpu(rcu_cpu_started, cpu)) return; @@ -3857,12 +3856,10 @@ void rcu_cpu_starting(unsigned int cpu) mask = rdp->grpmask; raw_spin_lock_irqsave_rcu_node(rnp, flags); WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask); - oldmask = rnp->expmaskinitnext; + newcpu = !(rnp->expmaskinitnext & mask); rnp->expmaskinitnext |= mask; - oldmask ^= rnp->expmaskinitnext; - nbits = bitmap_weight(&oldmask, BITS_PER_LONG); /* Allow lockless access for expedited grace periods. */ - smp_store_release(&rcu_state.ncpus, rcu_state.ncpus + nbits); /* ^^^ */ + smp_store_release(&rcu_state.ncpus, rcu_state.ncpus + newcpu); /* ^^^ */ ASSERT_EXCLUSIVE_WRITER(rcu_state.ncpus); rcu_gpnum_ovf(rnp, rdp); /* Offline-induced counter wrap? */ rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq); -- cgit v1.2.3 From e816d56fad57ba9817cef6606b12f5e14647c3bf Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 1 May 2020 16:49:48 -0700 Subject: rcu: Add callbacks-invoked counters This commit adds a count of the callbacks invoked to the per-CPU rcu_data structure. This count is printed by the show_rcu_gp_kthreads() that is invoked by rcutorture and the RCU CPU stall-warning code. It is also intended for use by drgn. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 1 + kernel/rcu/tree.h | 1 + kernel/rcu/tree_stall.h | 3 +++ 3 files changed, 5 insertions(+) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index bef1dc91bfbe..874c831bcc45 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2443,6 +2443,7 @@ static void rcu_do_batch(struct rcu_data *rdp) local_irq_save(flags); rcu_nocb_lock(rdp); count = -rcl.len; + rdp->n_cbs_invoked += count; trace_rcu_batch_end(rcu_state.name, count, !!rcl.head, need_resched(), is_idle_task(current), rcu_is_callbacks_kthread()); diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 43991a40b084..9c6f7343bec0 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -171,6 +171,7 @@ struct rcu_data { /* different grace periods. */ long qlen_last_fqs_check; /* qlen at last check for QS forcing */ + unsigned long n_cbs_invoked; /* # callbacks invoked since boot. */ unsigned long n_force_qs_snap; /* did other CPU force QS recently? */ long blimit; /* Upper limit on a processed batch */ diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index 54a6dba0280d..2768ce6bf657 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -649,6 +649,7 @@ static void check_cpu_stall(struct rcu_data *rdp) */ void show_rcu_gp_kthreads(void) { + unsigned long cbs = 0; int cpu; unsigned long j; unsigned long ja; @@ -690,9 +691,11 @@ void show_rcu_gp_kthreads(void) } for_each_possible_cpu(cpu) { rdp = per_cpu_ptr(&rcu_data, cpu); + cbs += data_race(rdp->n_cbs_invoked); if (rcu_segcblist_is_offloaded(&rdp->cblist)) show_rcu_nocb_state(rdp); } + pr_info("RCU callbacks invoked since boot: %lu\n", cbs); show_rcu_tasks_gp_kthreads(); } EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads); -- cgit v1.2.3 From f8466f94685b5bd931384526cf51e090fd2ac706 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 3 May 2020 19:16:09 -0700 Subject: rcu: Add comment documenting rcu_callback_map's purpose The rcu_callback_map lockdep_map structure was added back in 2013, but its purpose has become obscure. This commit therefore documments that the purpose of rcu_callback map is, in the words of commit 24ef659a857 ("rcu: Provide better diagnostics for blocking in RCU callback functions"), to help lockdep to tie an "inappropriate voluntary context switch back to the fact that the function is being invoked from within a callback." Signed-off-by: Paul E. McKenney --- kernel/rcu/update.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index f5a82e107bcb..ca17b771ad60 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -279,6 +279,7 @@ struct lockdep_map rcu_sched_lock_map = { }; EXPORT_SYMBOL_GPL(rcu_sched_lock_map); +// Tell lockdep when RCU callbacks are being invoked. static struct lock_class_key rcu_callback_key; struct lockdep_map rcu_callback_map = STATIC_LOCKDEP_MAP_INIT("rcu_callback", &rcu_callback_key); -- cgit v1.2.3 From 88748e330040ecf4681a2c8f344fd386862bf913 Mon Sep 17 00:00:00 2001 From: Madhuparna Bhowmik Date: Mon, 4 May 2020 08:05:05 -0400 Subject: trace: events: rcu: Change description of rcu_dyntick trace event The different strings used for describing the polarity are Start, End and StillNonIdle. Since StillIdle is not used in any trace point for rcu_dyntick, it can be removed and StillNonIdle can be added in the description. Because StillNonIdle is used in a few tracepoints for rcu_dyntick. Similarly, USER, IDLE and IRQ are used for describing context in the rcu_dyntick tracepoints. Since, "KERNEL" is not used for any of the rcu_dyntick tracepoints, remove it from the description. Signed-off-by: Madhuparna Bhowmik Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- include/trace/events/rcu.h | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h index f9a7811148e2..af274d1532bf 100644 --- a/include/trace/events/rcu.h +++ b/include/trace/events/rcu.h @@ -435,11 +435,12 @@ TRACE_EVENT_RCU(rcu_fqs, #endif /* #if defined(CONFIG_TREE_RCU) */ /* - * Tracepoint for dyntick-idle entry/exit events. These take a string - * as argument: "Start" for entering dyntick-idle mode, "Startirq" for - * entering it from irq/NMI, "End" for leaving it, "Endirq" for leaving it - * to irq/NMI, "--=" for events moving towards idle, and "++=" for events - * moving away from idle. + * Tracepoint for dyntick-idle entry/exit events. These take 2 strings + * as argument: + * polarity: "Start", "End", "StillNonIdle" for entering, exiting or still not + * being in dyntick-idle mode. + * context: "USER" or "IDLE" or "IRQ". + * NMIs nested in IRQs are inferred with dynticks_nesting > 1 in IRQ context. * * These events also take a pair of numbers, which indicate the nesting * depth before and after the event of interest, and a third number that is -- cgit v1.2.3 From 77865dea25c4f45ce0c5bf61a8470af01fccd944 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 7 May 2020 15:44:46 -0700 Subject: rcu: Grace-period-kthread related sleeps to idle priority This commit converts the long-standing schedule_timeout_interruptible() and schedule_timeout_uninterruptible() calls used by RCU's grace-period kthread to schedule_timeout_idle(). This conversion avoids polluting the load-average with RCU-related sleeping. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 874c831bcc45..feb31c201dee 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1638,7 +1638,7 @@ static void rcu_gp_slow(int delay) if (delay > 0 && !(rcu_seq_ctr(rcu_state.gp_seq) % (rcu_num_nodes * PER_RCU_NODE_PERIOD * delay))) - schedule_timeout_uninterruptible(delay); + schedule_timeout_idle(delay); } static unsigned long sleep_duration; @@ -1661,7 +1661,7 @@ static void rcu_gp_torture_wait(void) duration = xchg(&sleep_duration, 0UL); if (duration > 0) { pr_alert("%s: Waiting %lu jiffies\n", __func__, duration); - schedule_timeout_uninterruptible(duration); + schedule_timeout_idle(duration); pr_alert("%s: Wait complete\n", __func__); } } @@ -2727,7 +2727,7 @@ static void rcu_cpu_kthread(unsigned int cpu) } *statusp = RCU_KTHREAD_YIELDING; trace_rcu_utilization(TPS("Start CPU kthread@rcu_yield")); - schedule_timeout_interruptible(2); + schedule_timeout_idle(2); trace_rcu_utilization(TPS("End CPU kthread@rcu_yield")); *statusp = RCU_KTHREAD_WAITING; } -- cgit v1.2.3 From a9352f72d6a9e8fe4840b9f0d97af8f5a6c52c79 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 7 May 2020 16:34:38 -0700 Subject: rcu: Priority-boost-related sleeps to idle priority This commit converts the long-standing schedule_timeout_interruptible() call used by RCU's priority-boosting kthreads to schedule_timeout_idle(). This conversion avoids polluting the load-average with RCU-related sleeping. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 352223664ebd..25296c17a30d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1033,7 +1033,7 @@ static int rcu_boost_kthread(void *arg) if (spincnt > 10) { WRITE_ONCE(rnp->boost_kthread_status, RCU_KTHREAD_YIELDING); trace_rcu_utilization(TPS("End boost kthread@rcu_yield")); - schedule_timeout_interruptible(2); + schedule_timeout_idle(2); trace_rcu_utilization(TPS("Start boost kthread@rcu_yield")); spincnt = 0; } -- cgit v1.2.3 From f5ca34643bbd84f514bdeee194c45dd1fb066ef2 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 7 May 2020 16:36:10 -0700 Subject: rcu: No-CBs-related sleeps to idle priority This commit converts the schedule_timeout_interruptible() call used by RCU's no-CBs grace-period kthreads to schedule_timeout_idle(). This conversion avoids polluting the load-average with RCU-related sleeping. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 25296c17a30d..982fc5be5269 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2005,7 +2005,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) /* Polling, so trace if first poll in the series. */ if (gotcbs) trace_rcu_nocb_wake(rcu_state.name, cpu, TPS("Poll")); - schedule_timeout_interruptible(1); + schedule_timeout_idle(1); } else if (!needwait_gp) { /* Wait for callbacks to appear. */ trace_rcu_nocb_wake(rcu_state.name, cpu, TPS("Sleep")); -- cgit v1.2.3 From 68c2f27e01f61760e6ae76fff9682e1ffe9bacb6 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 7 May 2020 16:38:29 -0700 Subject: rcu: Expedited grace-period sleeps to idle priority This commit converts the schedule_timeout_uninterruptible() call used by RCU's expedited grace-period processing to schedule_timeout_idle(). This conversion avoids polluting the load-average with RCU-related sleeping. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_exp.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 72952edad1e4..1888c0eb1216 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -403,7 +403,7 @@ retry_ipi: /* Online, so delay for a bit and try again. */ raw_spin_unlock_irqrestore_rcu_node(rnp, flags); trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("selectofl")); - schedule_timeout_uninterruptible(1); + schedule_timeout_idle(1); goto retry_ipi; } /* CPU really is offline, so we must report its QS. */ -- cgit v1.2.3 From 9f47eb5461aaeb6cb8696f9d11503ae90e4d5cb0 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 8 May 2020 14:15:37 -0700 Subject: fs/btrfs: Add cond_resched() for try_release_extent_mapping() stalls Very large I/Os can cause the following RCU CPU stall warning: RIP: 0010:rb_prev+0x8/0x50 Code: 49 89 c0 49 89 d1 48 89 c2 48 89 f8 e9 e5 fd ff ff 4c 89 48 10 c3 4c = 89 06 c3 4c 89 40 10 c3 0f 1f 00 48 8b 0f 48 39 cf 74 38 <48> 8b 47 10 48 85 c0 74 22 48 8b 50 08 48 85 d2 74 0c 48 89 d0 48 RSP: 0018:ffffc9002212bab0 EFLAGS: 00000287 ORIG_RAX: ffffffffffffff13 RAX: ffff888821f93630 RBX: ffff888821f93630 RCX: ffff888821f937e0 RDX: 0000000000000000 RSI: 0000000000102000 RDI: ffff888821f93630 RBP: 0000000000103000 R08: 000000000006c000 R09: 0000000000000238 R10: 0000000000102fff R11: ffffc9002212bac8 R12: 0000000000000001 R13: ffffffffffffffff R14: 0000000000102000 R15: ffff888821f937e0 __lookup_extent_mapping+0xa0/0x110 try_release_extent_mapping+0xdc/0x220 btrfs_releasepage+0x45/0x70 shrink_page_list+0xa39/0xb30 shrink_inactive_list+0x18f/0x3b0 shrink_lruvec+0x38e/0x6b0 shrink_node+0x14d/0x690 do_try_to_free_pages+0xc6/0x3e0 try_to_free_mem_cgroup_pages+0xe6/0x1e0 reclaim_high.constprop.73+0x87/0xc0 mem_cgroup_handle_over_high+0x66/0x150 exit_to_usermode_loop+0x82/0xd0 do_syscall_64+0xd4/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 On a PREEMPT=n kernel, the try_release_extent_mapping() function's "while" loop might run for a very long time on a large I/O. This commit therefore adds a cond_resched() to this loop, providing RCU any needed quiescent states. Signed-off-by: Paul E. McKenney --- fs/btrfs/extent_io.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 68c96057ad2d..704239546093 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4515,6 +4515,8 @@ int try_release_extent_mapping(struct page *page, gfp_t mask) /* once for us */ free_extent_map(em); + + cond_resched(); /* Allow large-extent preemption. */ } } return try_release_extent_state(tree, page, mask); -- cgit v1.2.3 From 360fbbb4897c98971e8955b063c01250817a2191 Mon Sep 17 00:00:00 2001 From: Lihao Liang Date: Thu, 14 May 2020 21:34:34 +0100 Subject: rcu: Update comment from rsp->rcu_gp_seq to rsp->gp_seq Signed-off-by: Lihao Liang Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 9c6f7343bec0..575745f0a464 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -41,7 +41,7 @@ struct rcu_node { raw_spinlock_t __private lock; /* Root rcu_node's lock protects */ /* some rcu_state fields as well as */ /* following. */ - unsigned long gp_seq; /* Track rsp->rcu_gp_seq. */ + unsigned long gp_seq; /* Track rsp->gp_seq. */ unsigned long gp_seq_needed; /* Track furthest future GP request. */ unsigned long completedqs; /* All QSes done for this node. */ unsigned long qsmask; /* CPUs or groups that need to switch in */ @@ -149,7 +149,7 @@ union rcu_noqs { /* Per-CPU data for read-copy update. */ struct rcu_data { /* 1) quiescent-state and grace-period handling : */ - unsigned long gp_seq; /* Track rsp->rcu_gp_seq counter. */ + unsigned long gp_seq; /* Track rsp->gp_seq counter. */ unsigned long gp_seq_needed; /* Track furthest future GP request. */ union rcu_noqs cpu_no_qs; /* No QSes yet for this CPU. */ bool core_needs_qs; /* Core waits for quiesc state. */ -- cgit v1.2.3 From 3c8920e2dbd1a55f72dc14d656df9d0097cf5c72 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Fri, 15 May 2020 02:34:29 +0200 Subject: tick/nohz: Narrow down noise while setting current task's tick dependency Setting a tick dependency on any task, including the case where a task sets that dependency on itself, triggers an IPI to all CPUs. That is of course suboptimal but it had previously not been an issue because it was only used by POSIX CPU timers on nohz_full, which apparently never occurs in latency-sensitive workloads in production. (Or users of such systems are suffering in silence on the one hand or venting their ire on the wrong people on the other.) But RCU now sets a task tick dependency on the current task in order to fix stall issues that can occur during RCU callback processing. Thus, RCU callback processing triggers frequent system-wide IPIs from nohz_full CPUs. This is quite counter-productive, after all, avoiding IPIs is what nohz_full is supposed to be all about. This commit therefore optimizes tasks' self-setting of a task tick dependency by using tick_nohz_full_kick() to avoid the system-wide IPI. Instead, only the execution of the one task is disturbed, which is acceptable given that this disturbance is well down into the noise compared to the degree to which the RCU callback processing itself disturbs execution. Fixes: 6a949b7af82d (rcu: Force on tick when invoking lots of callbacks) Reported-by: Matt Fleming Signed-off-by: Frederic Weisbecker Cc: stable@kernel.org Cc: Paul E. McKenney Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Ingo Molnar Signed-off-by: Paul E. McKenney --- kernel/time/tick-sched.c | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 3e2dc9b8858c..f0199a4ba1ad 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -351,16 +351,24 @@ void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_cpu); /* - * Set a per-task tick dependency. Posix CPU timers need this in order to elapse - * per task timers. + * Set a per-task tick dependency. RCU need this. Also posix CPU timers + * in order to elapse per task timers. */ void tick_nohz_dep_set_task(struct task_struct *tsk, enum tick_dep_bits bit) { - /* - * We could optimize this with just kicking the target running the task - * if that noise matters for nohz full users. - */ - tick_nohz_dep_set_all(&tsk->tick_dep_mask, bit); + if (!atomic_fetch_or(BIT(bit), &tsk->tick_dep_mask)) { + if (tsk == current) { + preempt_disable(); + tick_nohz_full_kick(); + preempt_enable(); + } else { + /* + * Some future tick_nohz_full_kick_task() + * should optimize this. + */ + tick_nohz_full_kick_all(); + } + } } EXPORT_SYMBOL_GPL(tick_nohz_dep_set_task); -- cgit v1.2.3 From 55fbe86ef303bc8ab040e579fba34a750c08200e Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 19 May 2020 15:02:02 -0700 Subject: rcu: Remove initialized but unused rnp from check_slow_task() This commit removes the variable rnp from check_slow_task(), which is defined, assigned to, but not otherwise used. Reported-by: kbuild test robot Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_stall.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index 2768ce6bf657..d203f82a380a 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -237,14 +237,12 @@ struct rcu_stall_chk_rdr { */ static bool check_slow_task(struct task_struct *t, void *arg) { - struct rcu_node *rnp; struct rcu_stall_chk_rdr *rscrp = arg; if (task_curr(t)) return false; // It is running, so decline to inspect it. rscrp->nesting = t->rcu_read_lock_nesting; rscrp->rs = t->rcu_read_unlock_special; - rnp = t->rcu_blocked_node; rscrp->on_blkd_list = !list_empty(&t->rcu_node_entry); return true; } -- cgit v1.2.3 From 04b25a495bd68c1dad07263fb91e8b5a31c00a9e Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 19 May 2020 17:00:54 -0700 Subject: rcu: Mark rcu_nmi_enter() call to rcu_cleanup_after_idle() noinstr The objtool complains about the call to rcu_cleanup_after_idle() from rcu_nmi_enter(), so this commit adds instrumentation_begin() before that call and instrumentation_end() after it. Acked-by: Peter Zijlstra (Intel) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index feb31c201dee..d17e5a08bf43 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -990,8 +990,11 @@ noinstr void rcu_nmi_enter(void) rcu_dynticks_eqs_exit(); // ... but is watching here. - if (!in_nmi()) + if (!in_nmi()) { + instrumentation_begin(); rcu_cleanup_after_idle(); + instrumentation_end(); + } instrumentation_begin(); // instrumentation for the noinstr rcu_dynticks_curr_cpu_in_eqs() -- cgit v1.2.3 From d29e0b26b020422cc51b5b51733cc50fcf443965 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 28 May 2020 08:49:29 -0700 Subject: lockdep: Complain only once about RCU in extended quiescent state Currently, lockdep_rcu_suspicious() complains twice about RCU read-side critical sections being invoked from within extended quiescent states, for example: RCU used illegally from idle CPU! rcu_scheduler_active = 2, debug_locks = 1 RCU used illegally from extended quiescent state! This commit therefore saves a couple lines of code and one line of console-log output by eliminating the first of these two complaints. Link: https://lore.kernel.org/lkml/87wo4wnpzb.fsf@nanos.tec.linutronix.de Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Will Deacon Signed-off-by: Paul E. McKenney --- kernel/locking/lockdep.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 29a8de4c50b9..0a7549d159ed 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -5851,9 +5851,7 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s) pr_warn("\n%srcu_scheduler_active = %d, debug_locks = %d\n", !rcu_lockdep_current_cpu_online() ? "RCU used illegally from offline CPU!\n" - : !rcu_is_watching() - ? "RCU used illegally from idle CPU!\n" - : "", + : "", rcu_scheduler_active, debug_locks); /* -- cgit v1.2.3 From e40bb921119814c6f746891af9cd37eccda616a4 Mon Sep 17 00:00:00 2001 From: Jules Irenge Date: Mon, 1 Jun 2020 19:45:49 +0100 Subject: rcu: Replace 1 with true Coccinelle reports a warning WARNING: Assignment of 0/1 to bool variable The root cause is that the variable lastphase is a bool, but is initialised with integer 1. This commit therefore replaces the 1 with a true. Signed-off-by: Jules Irenge Signed-off-by: Paul E. McKenney --- kernel/rcu/update.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index ca17b771ad60..a0ba8858dd35 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -207,7 +207,7 @@ void rcu_end_inkernel_boot(void) rcu_unexpedite_gp(); if (rcu_normal_after_boot) WRITE_ONCE(rcu_normal, 1); - rcu_boot_ended = 1; + rcu_boot_ended = true; } /* -- cgit v1.2.3 From c6dfd72b7a3b70a2054db0f73245ea2f762a8452 Mon Sep 17 00:00:00 2001 From: Peter Enderborg Date: Thu, 4 Jun 2020 12:23:20 +0200 Subject: rcu: Stop shrinker loop The count and scan can be separated in time, and there is a fair chance that all work is already done when the scan starts, which might in turn result in a needless retry. This commit therefore avoids this retry by returning SHRINK_STOP. Reviewed-by: Uladzislau Rezki (Sony) Signed-off-by: Peter Enderborg Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d17e5a08bf43..c8196fab563c 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3332,7 +3332,7 @@ kfree_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) break; } - return freed; + return freed == 0 ? SHRINK_STOP : freed; } static struct shrinker kfree_rcu_shrinker = { -- cgit v1.2.3 From 00943a609d7ad0f08e58bc9c214f38b0ba163c88 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 12 Jun 2020 10:07:52 +0800 Subject: rcu: gp_max is protected by root rcu_node's lock Because gp_max is protected by root rcu_node's lock, this commit moves the gp_max definition to the region of the rcu_node structure containing fields protected by this lock. Signed-off-by: Wei Yang Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 575745f0a464..09ec93b16f28 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -302,6 +302,8 @@ struct rcu_state { u8 boost ____cacheline_internodealigned_in_smp; /* Subject to priority boost. */ unsigned long gp_seq; /* Grace-period sequence #. */ + unsigned long gp_max; /* Maximum GP duration in */ + /* jiffies. */ struct task_struct *gp_kthread; /* Task for grace periods. */ struct swait_queue_head gp_wq; /* Where GP task waits. */ short gp_flags; /* Commands for GP task. */ @@ -347,8 +349,6 @@ struct rcu_state { /* a reluctant CPU. */ unsigned long n_force_qs_gpstart; /* Snapshot of n_force_qs at */ /* GP start. */ - unsigned long gp_max; /* Maximum GP duration in */ - /* jiffies. */ const char *name; /* Name of structure. */ char abbr; /* Abbreviated name. */ -- cgit v1.2.3 From a2dae43088d51c4869e7fa91ca09bcc890e277fc Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 12 Jun 2020 10:07:53 +0800 Subject: rcu: grplo/grphi just records CPU number The ->grplo and ->grphi fields store the lowest and highest CPU number covered by to a rcu_node structure, which is not the group number. This commit therefore adjusts these fields' comments to match reality. Signed-off-by: Wei Yang Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 09ec93b16f28..9f903f5c9fa1 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -73,8 +73,8 @@ struct rcu_node { unsigned long ffmask; /* Fully functional CPUs. */ unsigned long grpmask; /* Mask to apply to parent qsmask. */ /* Only one bit will be set in this mask. */ - int grplo; /* lowest-numbered CPU or group here. */ - int grphi; /* highest-numbered CPU or group here. */ + int grplo; /* lowest-numbered CPU here. */ + int grphi; /* highest-numbered CPU here. */ u8 grpnum; /* CPU/group number for next level up. */ u8 level; /* root is at level 0. */ bool wait_blkd_tasks;/* Necessary to wait for blocked tasks to */ -- cgit v1.2.3 From 7a0c2b0940c13a06573320ab7118375b35feef8b Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 12 Jun 2020 10:07:54 +0800 Subject: rcu: grpnum just records group number The ->grpnum field in the rcu_node structure contains the bit position in this structure's parent's bitmasks, which is not the CPU number. This commit therefore adjusts this field's comment accordingly. Signed-off-by: Wei Yang Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 9f903f5c9fa1..c96ae351688b 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -75,7 +75,7 @@ struct rcu_node { /* Only one bit will be set in this mask. */ int grplo; /* lowest-numbered CPU here. */ int grphi; /* highest-numbered CPU here. */ - u8 grpnum; /* CPU/group number for next level up. */ + u8 grpnum; /* group number for next level up. */ u8 level; /* root is at level 0. */ bool wait_blkd_tasks;/* Necessary to wait for blocked tasks to */ /* exit RCU read-side critical sections */ -- cgit v1.2.3 From c3cb47a6cc74af0b79579ba167d7124eb669fbaa Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Mon, 15 Jun 2020 12:28:05 -0700 Subject: kernel/rcu/tree.c: Fix kernel-doc warnings Fix kernel-doc warning: ../kernel/rcu/tree.c:959: warning: Excess function parameter 'irq' description in 'rcu_nmi_enter' Fixes: cf7614e13c8f ("rcu: Refactor rcu_{nmi,irq}_{enter,exit}()") Signed-off-by: Randy Dunlap Cc: Byungchul Park Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 1 - 1 file changed, 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index c8196fab563c..ef05aac7f9d3 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -954,7 +954,6 @@ void __rcu_irq_enter_check_tick(void) /** * rcu_nmi_enter - inform RCU of entry to NMI context - * @irq: Is this call from rcu_irq_enter? * * If the CPU was idle from RCU's viewpoint, update rdp->dynticks and * rdp->dynticks_nmi_nesting to let the RCU grace-period handling know -- cgit v1.2.3 From 24692fa22c30cb8fcfcabdc07a3c82964475b639 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 15 Jun 2020 08:46:49 +0200 Subject: rcu: Fix some kernel-doc warnings The current code provokes some kernel-doc warnings: ./kernel/rcu/tree.c:2915: warning: Function parameter or member 'count' not described in 'kfree_rcu_cpu' ./include/linux/rculist.h:517: warning: bad line: [@right ][node2 ... ] ./include/linux/rculist.h:2: WARNING: Unexpected indentation. This commit therefore moves the comment for "count" to the kernel-doc markup and adds a missing "*" on one kernel-doc continuation line. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Paul E. McKenney --- include/linux/rculist.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/rculist.h b/include/linux/rculist.h index df587d181844..7eed65b5f713 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -512,7 +512,7 @@ static inline void hlist_replace_rcu(struct hlist_node *old, * @right: The hlist head on the right * * The lists start out as [@left ][node1 ... ] and - [@right ][node2 ... ] + * [@right ][node2 ... ] * The lists end up as [@left ][node2 ... ] * [@right ][node1 ... ] */ -- cgit v1.2.3 From 8e11690d2f5a9823d66f68918c3986b4e9e160ab Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 4 May 2020 14:35:00 +0200 Subject: rcu: Fix a kernel-doc warnings for "count" There are some kernel-doc warnings: ./kernel/rcu/tree.c:2915: warning: Function parameter or member 'count' not described in 'kfree_rcu_cpu' This commit therefore moves the comment for "count" to the kernel-doc markup. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 6c6569e0586c..ba4c477495b5 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3004,6 +3004,7 @@ struct kfree_rcu_cpu_work { * @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES * @monitor_todo: Tracks whether a @monitor_work delayed work is pending * @initialized: The @lock and @rcu_work fields have been initialized + * @count: Number of objects for which GP not started * * This is a per-CPU structure. The reason that it is not included in * the rcu_data structure is to permit this code to be extracted from @@ -3019,7 +3020,6 @@ struct kfree_rcu_cpu { struct delayed_work monitor_work; bool monitor_todo; bool initialized; - // Number of objects for which GP not started int count; }; -- cgit v1.2.3 From 8ac88f7177c75bf9b7b8c29a8054115e1c712baf Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Mon, 25 May 2020 23:47:45 +0200 Subject: rcu/tree: Keep kfree_rcu() awake during lock contention On PREEMPT_RT kernels, the krcp spinlock gets converted to an rt-mutex and causes kfree_rcu() callers to sleep. This makes it unusable for callers in purely atomic sections such as non-threaded IRQ handlers and raw spinlock sections. Fix it by converting the spinlock to a raw spinlock. Vetting all code paths, there is no reason to believe that the raw spinlock will hurt RT latencies as it is not held for a long time. Cc: bigeasy@linutronix.de Cc: Uladzislau Rezki Reviewed-by: Uladzislau Rezki Signed-off-by: Joel Fernandes (Google) Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index ba4c477495b5..c5de5adca0dd 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3016,7 +3016,7 @@ struct kfree_rcu_cpu { struct kfree_rcu_bulk_data *bhead; struct kfree_rcu_bulk_data *bcached; struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES]; - spinlock_t lock; + raw_spinlock_t lock; struct delayed_work monitor_work; bool monitor_todo; bool initialized; @@ -3049,12 +3049,12 @@ static void kfree_rcu_work(struct work_struct *work) krwp = container_of(to_rcu_work(work), struct kfree_rcu_cpu_work, rcu_work); krcp = krwp->krcp; - spin_lock_irqsave(&krcp->lock, flags); + raw_spin_lock_irqsave(&krcp->lock, flags); head = krwp->head_free; krwp->head_free = NULL; bhead = krwp->bhead_free; krwp->bhead_free = NULL; - spin_unlock_irqrestore(&krcp->lock, flags); + raw_spin_unlock_irqrestore(&krcp->lock, flags); /* "bhead" is now private, so traverse locklessly. */ for (; bhead; bhead = bnext) { @@ -3157,14 +3157,14 @@ static inline void kfree_rcu_drain_unlock(struct kfree_rcu_cpu *krcp, krcp->monitor_todo = false; if (queue_kfree_rcu_work(krcp)) { // Success! Our job is done here. - spin_unlock_irqrestore(&krcp->lock, flags); + raw_spin_unlock_irqrestore(&krcp->lock, flags); return; } // Previous RCU batch still in progress, try again later. krcp->monitor_todo = true; schedule_delayed_work(&krcp->monitor_work, KFREE_DRAIN_JIFFIES); - spin_unlock_irqrestore(&krcp->lock, flags); + raw_spin_unlock_irqrestore(&krcp->lock, flags); } /* @@ -3177,11 +3177,11 @@ static void kfree_rcu_monitor(struct work_struct *work) struct kfree_rcu_cpu *krcp = container_of(work, struct kfree_rcu_cpu, monitor_work.work); - spin_lock_irqsave(&krcp->lock, flags); + raw_spin_lock_irqsave(&krcp->lock, flags); if (krcp->monitor_todo) kfree_rcu_drain_unlock(krcp, flags); else - spin_unlock_irqrestore(&krcp->lock, flags); + raw_spin_unlock_irqrestore(&krcp->lock, flags); } static inline bool @@ -3252,7 +3252,7 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) local_irq_save(flags); // For safely calling this_cpu_ptr(). krcp = this_cpu_ptr(&krc); if (krcp->initialized) - spin_lock(&krcp->lock); + raw_spin_lock(&krcp->lock); // Queue the object but don't yet schedule the batch. if (debug_rcu_head_queue(head)) { @@ -3283,7 +3283,7 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) unlock_return: if (krcp->initialized) - spin_unlock(&krcp->lock); + raw_spin_unlock(&krcp->lock); local_irq_restore(flags); } EXPORT_SYMBOL_GPL(kfree_call_rcu); @@ -3315,11 +3315,11 @@ kfree_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu); count = krcp->count; - spin_lock_irqsave(&krcp->lock, flags); + raw_spin_lock_irqsave(&krcp->lock, flags); if (krcp->monitor_todo) kfree_rcu_drain_unlock(krcp, flags); else - spin_unlock_irqrestore(&krcp->lock, flags); + raw_spin_unlock_irqrestore(&krcp->lock, flags); sc->nr_to_scan -= count; freed += count; @@ -3346,15 +3346,15 @@ void __init kfree_rcu_scheduler_running(void) for_each_online_cpu(cpu) { struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu); - spin_lock_irqsave(&krcp->lock, flags); + raw_spin_lock_irqsave(&krcp->lock, flags); if (!krcp->head || krcp->monitor_todo) { - spin_unlock_irqrestore(&krcp->lock, flags); + raw_spin_unlock_irqrestore(&krcp->lock, flags); continue; } krcp->monitor_todo = true; schedule_delayed_work_on(cpu, &krcp->monitor_work, KFREE_DRAIN_JIFFIES); - spin_unlock_irqrestore(&krcp->lock, flags); + raw_spin_unlock_irqrestore(&krcp->lock, flags); } } @@ -4250,7 +4250,7 @@ static void __init kfree_rcu_batch_init(void) for_each_possible_cpu(cpu) { struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu); - spin_lock_init(&krcp->lock); + raw_spin_lock_init(&krcp->lock); for (i = 0; i < KFREE_N_BATCHES; i++) { INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work); krcp->krw_arr[i].krcp = krcp; -- cgit v1.2.3 From 4d2919411867848fab78c7cb13139e17ad8b85bc Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Mon, 25 May 2020 23:47:46 +0200 Subject: rcu/tree: Skip entry into the page allocator for PREEMPT_RT To keep the kfree_rcu() code working in purely atomic sections on RT, such as non-threaded IRQ handlers and raw spinlock sections, avoid calling into the page allocator which uses sleeping locks on RT. In fact, even if the caller is preemptible, the kfree_rcu() code is not, as the krcp->lock is a raw spinlock. Calling into the page allocator is optional and avoiding it should be Ok, especially with the page pre-allocation support in future patches. Such pre-allocation would further avoid the a need for a dynamically allocated page in the first place. Cc: Sebastian Andrzej Siewior Reviewed-by: Uladzislau Rezki Co-developed-by: Uladzislau Rezki Signed-off-by: Uladzislau Rezki Signed-off-by: Joel Fernandes (Google) Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index c5de5adca0dd..e0425faf3b3b 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3202,6 +3202,18 @@ kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, if (!bnode) { WARN_ON_ONCE(sizeof(struct kfree_rcu_bulk_data) > PAGE_SIZE); + /* + * To keep this path working on raw non-preemptible + * sections, prevent the optional entry into the + * allocator as it uses sleeping locks. In fact, even + * if the caller of kfree_rcu() is preemptible, this + * path still is not, as krcp->lock is a raw spinlock. + * With additional page pre-allocation in the works, + * hitting this return is going to be much less likely. + */ + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + return false; + bnode = (struct kfree_rcu_bulk_data *) __get_free_page(GFP_NOWAIT | __GFP_NOWARN); } -- cgit v1.2.3 From 594aa5975b9b5cfe9edaec06170e43b8c0607377 Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Mon, 25 May 2020 23:47:47 +0200 Subject: rcu/tree: Repeat the monitor if any free channel is busy It is possible that one of the channels cannot be detached because its free channel is busy and previously queued data has not been processed yet. On the other hand, another channel can be successfully detached causing the monitor work to stop. Prevent that by rescheduling the monitor work if there are any channels in the pending state after a detach attempt. Fixes: 34c881745549e ("rcu: Support kfree_bulk() interface in kfree_rcu()") Acked-by: Joel Fernandes (Google) Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index e0425faf3b3b..5151fe4e1429 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3105,7 +3105,7 @@ static void kfree_rcu_work(struct work_struct *work) static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp) { struct kfree_rcu_cpu_work *krwp; - bool queued = false; + bool repeat = false; int i; lockdep_assert_held(&krcp->lock); @@ -3143,11 +3143,14 @@ static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp) * been detached following each other, one by one. */ queue_rcu_work(system_wq, &krwp->rcu_work); - queued = true; } + + /* Repeat if any "free" corresponding channel is still busy. */ + if (krcp->bhead || krcp->head) + repeat = true; } - return queued; + return !repeat; } static inline void kfree_rcu_drain_unlock(struct kfree_rcu_cpu *krcp, -- cgit v1.2.3 From 446044eb9c9c335d3ae1be4665193ab43ebb284e Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Mon, 25 May 2020 23:47:48 +0200 Subject: rcu/tree: Make debug_objects logic independent of rcu_head kfree_rcu()'s debug_objects logic uses the address of the object's embedded rcu_head to queue/unqueue. Instead of this, make use of the object's address itself as preparation for future headless kfree_rcu() support. Reviewed-by: Uladzislau Rezki Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 29 +++++++++++++---------------- 1 file changed, 13 insertions(+), 16 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 5151fe4e1429..143c1e9265b6 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2970,13 +2970,11 @@ EXPORT_SYMBOL_GPL(call_rcu); * @nr_records: Number of active pointers in the array * @records: Array of the kfree_rcu() pointers * @next: Next bulk object in the block chain - * @head_free_debug: For debug, when CONFIG_DEBUG_OBJECTS_RCU_HEAD is set */ struct kfree_rcu_bulk_data { unsigned long nr_records; void *records[KFREE_BULK_MAX_ENTR]; struct kfree_rcu_bulk_data *next; - struct rcu_head *head_free_debug; }; /** @@ -3026,11 +3024,13 @@ struct kfree_rcu_cpu { static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc); static __always_inline void -debug_rcu_head_unqueue_bulk(struct rcu_head *head) +debug_rcu_bhead_unqueue(struct kfree_rcu_bulk_data *bhead) { #ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD - for (; head; head = head->next) - debug_rcu_head_unqueue(head); + int i; + + for (i = 0; i < bhead->nr_records; i++) + debug_rcu_head_unqueue((struct rcu_head *)(bhead->records[i])); #endif } @@ -3060,7 +3060,7 @@ static void kfree_rcu_work(struct work_struct *work) for (; bhead; bhead = bnext) { bnext = bhead->next; - debug_rcu_head_unqueue_bulk(bhead->head_free_debug); + debug_rcu_bhead_unqueue(bhead); rcu_lock_acquire(&rcu_callback_map); trace_rcu_invoke_kfree_bulk_callback(rcu_state.name, @@ -3082,14 +3082,15 @@ static void kfree_rcu_work(struct work_struct *work) */ for (; head; head = next) { unsigned long offset = (unsigned long)head->func; + void *ptr = (void *)head - offset; next = head->next; - debug_rcu_head_unqueue(head); + debug_rcu_head_unqueue((struct rcu_head *)ptr); rcu_lock_acquire(&rcu_callback_map); trace_rcu_invoke_kfree_callback(rcu_state.name, head, offset); if (!WARN_ON_ONCE(!__is_kfree_rcu_offset(offset))) - kfree((void *)head - offset); + kfree(ptr); rcu_lock_release(&rcu_callback_map); cond_resched_tasks_rcu_qs(); @@ -3228,18 +3229,11 @@ kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, /* Initialize the new block. */ bnode->nr_records = 0; bnode->next = krcp->bhead; - bnode->head_free_debug = NULL; /* Attach it to the head. */ krcp->bhead = bnode; } -#ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD - head->func = func; - head->next = krcp->bhead->head_free_debug; - krcp->bhead->head_free_debug = head; -#endif - /* Finally insert. */ krcp->bhead->records[krcp->bhead->nr_records++] = (void *) head - (unsigned long) func; @@ -3263,14 +3257,17 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) { unsigned long flags; struct kfree_rcu_cpu *krcp; + void *ptr; local_irq_save(flags); // For safely calling this_cpu_ptr(). krcp = this_cpu_ptr(&krc); if (krcp->initialized) raw_spin_lock(&krcp->lock); + ptr = (void *)head - (unsigned long)func; + // Queue the object but don't yet schedule the batch. - if (debug_rcu_head_queue(head)) { + if (debug_rcu_head_queue(ptr)) { // Probable double kfree_rcu(), just leak. WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n", __func__, head); -- cgit v1.2.3 From 3af84862817403d317dc33312e7a88d76e79401a Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Mon, 25 May 2020 23:47:49 +0200 Subject: rcu/tree: Simplify KFREE_BULK_MAX_ENTR macro We can simplify KFREE_BULK_MAX_ENTR macro and get rid of magic numbers which were used to make the structure to be exactly one page. Suggested-by: Boqun Feng Reviewed-by: Joel Fernandes (Google) Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 143c1e9265b6..bcdc06364426 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2958,13 +2958,6 @@ EXPORT_SYMBOL_GPL(call_rcu); #define KFREE_DRAIN_JIFFIES (HZ / 50) #define KFREE_N_BATCHES 2 -/* - * This macro defines how many entries the "records" array - * will contain. It is based on the fact that the size of - * kfree_rcu_bulk_data structure becomes exactly one page. - */ -#define KFREE_BULK_MAX_ENTR ((PAGE_SIZE / sizeof(void *)) - 3) - /** * struct kfree_rcu_bulk_data - single block to store kfree_rcu() pointers * @nr_records: Number of active pointers in the array @@ -2973,10 +2966,18 @@ EXPORT_SYMBOL_GPL(call_rcu); */ struct kfree_rcu_bulk_data { unsigned long nr_records; - void *records[KFREE_BULK_MAX_ENTR]; struct kfree_rcu_bulk_data *next; + void *records[]; }; +/* + * This macro defines how many entries the "records" array + * will contain. It is based on the fact that the size of + * kfree_rcu_bulk_data structure becomes exactly one page. + */ +#define KFREE_BULK_MAX_ENTR \ + ((PAGE_SIZE - sizeof(struct kfree_rcu_bulk_data)) / sizeof(void *)) + /** * struct kfree_rcu_cpu_work - single batch of kfree_rcu() requests * @rcu_work: Let queue_rcu_work() invoke workqueue handler after grace period -- cgit v1.2.3 From 952371d6fc0bc360d1d5780f86bb355836117ca2 Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Mon, 25 May 2020 23:47:50 +0200 Subject: rcu/tree: Move kfree_rcu_cpu locking/unlocking to separate functions Introduce helpers to lock and unlock per-cpu "kfree_rcu_cpu" structures. That will make kfree_call_rcu() more readable and prevent programming errors. Reviewed-by: Joel Fernandes (Google) Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 31 +++++++++++++++++++++++-------- 1 file changed, 23 insertions(+), 8 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index bcdc06364426..368bdc441ffb 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3035,6 +3035,27 @@ debug_rcu_bhead_unqueue(struct kfree_rcu_bulk_data *bhead) #endif } +static inline struct kfree_rcu_cpu * +krc_this_cpu_lock(unsigned long *flags) +{ + struct kfree_rcu_cpu *krcp; + + local_irq_save(*flags); // For safely calling this_cpu_ptr(). + krcp = this_cpu_ptr(&krc); + if (likely(krcp->initialized)) + raw_spin_lock(&krcp->lock); + + return krcp; +} + +static inline void +krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags) +{ + if (likely(krcp->initialized)) + raw_spin_unlock(&krcp->lock); + local_irq_restore(flags); +} + /* * This function is invoked in workqueue context after a grace period. * It frees all the objects queued on ->bhead_free or ->head_free. @@ -3260,11 +3281,7 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) struct kfree_rcu_cpu *krcp; void *ptr; - local_irq_save(flags); // For safely calling this_cpu_ptr(). - krcp = this_cpu_ptr(&krc); - if (krcp->initialized) - raw_spin_lock(&krcp->lock); - + krcp = krc_this_cpu_lock(&flags); ptr = (void *)head - (unsigned long)func; // Queue the object but don't yet schedule the batch. @@ -3295,9 +3312,7 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) } unlock_return: - if (krcp->initialized) - raw_spin_unlock(&krcp->lock); - local_irq_restore(flags); + krc_this_cpu_unlock(krcp, flags); } EXPORT_SYMBOL_GPL(kfree_call_rcu); -- cgit v1.2.3 From 69f08d3999dbef1553a3332b8055282dd3893b6c Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Mon, 25 May 2020 23:47:51 +0200 Subject: rcu/tree: Use static initializer for krc.lock The per-CPU variable is initialized at runtime in kfree_rcu_batch_init(). This function is invoked before 'rcu_scheduler_active' is set to 'RCU_SCHEDULER_RUNNING'. After the initialisation, '->initialized' is to true. The raw_spin_lock is only acquired if '->initialized' is set to true. The worqueue item is only used if 'rcu_scheduler_active' set to RCU_SCHEDULER_RUNNING which happens after initialisation. Use a static initializer for krc.lock and remove the runtime initialisation of the lock. Since the lock can now be always acquired, remove the '->initialized' check. Cc: Sebastian Andrzej Siewior Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 368bdc441ffb..a42a4693f161 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3002,7 +3002,7 @@ struct kfree_rcu_cpu_work { * @lock: Synchronize access to this structure * @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES * @monitor_todo: Tracks whether a @monitor_work delayed work is pending - * @initialized: The @lock and @rcu_work fields have been initialized + * @initialized: The @rcu_work fields have been initialized * @count: Number of objects for which GP not started * * This is a per-CPU structure. The reason that it is not included in @@ -3022,7 +3022,9 @@ struct kfree_rcu_cpu { int count; }; -static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc); +static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = { + .lock = __RAW_SPIN_LOCK_UNLOCKED(krc.lock), +}; static __always_inline void debug_rcu_bhead_unqueue(struct kfree_rcu_bulk_data *bhead) @@ -3042,8 +3044,7 @@ krc_this_cpu_lock(unsigned long *flags) local_irq_save(*flags); // For safely calling this_cpu_ptr(). krcp = this_cpu_ptr(&krc); - if (likely(krcp->initialized)) - raw_spin_lock(&krcp->lock); + raw_spin_lock(&krcp->lock); return krcp; } @@ -3051,8 +3052,7 @@ krc_this_cpu_lock(unsigned long *flags) static inline void krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags) { - if (likely(krcp->initialized)) - raw_spin_unlock(&krcp->lock); + raw_spin_unlock(&krcp->lock); local_irq_restore(flags); } @@ -4278,7 +4278,6 @@ static void __init kfree_rcu_batch_init(void) for_each_possible_cpu(cpu) { struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu); - raw_spin_lock_init(&krcp->lock); for (i = 0; i < KFREE_N_BATCHES; i++) { INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work); krcp->krw_arr[i].krcp = krcp; -- cgit v1.2.3 From 53c72b590b3a0afd6747d6f7957e6838003e90a4 Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Mon, 25 May 2020 23:47:52 +0200 Subject: rcu/tree: cache specified number of objects In order to reduce the dynamic need for pages in kfree_rcu(), pre-allocate a configurable number of pages per CPU and link them in a list. When kfree_rcu() reclaims objects, the object's container page is cached into a list instead of being released to the low-level page allocator. Such an approach provides O(1) access to free pages while also reducing the number of requests to the page allocator. It also makes the kfree_rcu() code to have free pages available during a low memory condition. A read-only sysfs parameter (rcu_min_cached_objs) reflects the minimum number of allowed cached pages per CPU. Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 8 +++ kernel/rcu/tree.c | 66 +++++++++++++++++++++++-- 2 files changed, 70 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index fb95fad81c79..befaa63652ff 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4038,6 +4038,14 @@ latencies, which will choose a value aligned with the appropriate hardware boundaries. + rcutree.rcu_min_cached_objs= [KNL] + Minimum number of objects which are cached and + maintained per one CPU. Object size is equal + to PAGE_SIZE. The cache allows to reduce the + pressure to page allocator, also it makes the + whole algorithm to behave better in low memory + condition. + rcutree.jiffies_till_first_fqs= [KNL] Set delay from grace-period initialization to first attempt to force quiescent states. diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index a42a4693f161..37c0cd0332f8 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -175,6 +175,15 @@ module_param(gp_init_delay, int, 0444); static int gp_cleanup_delay; module_param(gp_cleanup_delay, int, 0444); +/* + * This rcu parameter is runtime-read-only. It reflects + * a minimum allowed number of objects which can be cached + * per-CPU. Object size is equal to one page. This value + * can be changed at boot time. + */ +static int rcu_min_cached_objs = 2; +module_param(rcu_min_cached_objs, int, 0444); + /* Retrieve RCU kthreads priority for rcutorture */ int rcu_get_gp_kthreads_prio(void) { @@ -2997,7 +3006,6 @@ struct kfree_rcu_cpu_work { * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace period * @head: List of kfree_rcu() objects not yet waiting for a grace period * @bhead: Bulk-List of kfree_rcu() objects not yet waiting for a grace period - * @bcached: Keeps at most one object for later reuse when build chain blocks * @krw_arr: Array of batches of kfree_rcu() objects waiting for a grace period * @lock: Synchronize access to this structure * @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES @@ -3013,13 +3021,22 @@ struct kfree_rcu_cpu_work { struct kfree_rcu_cpu { struct rcu_head *head; struct kfree_rcu_bulk_data *bhead; - struct kfree_rcu_bulk_data *bcached; struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES]; raw_spinlock_t lock; struct delayed_work monitor_work; bool monitor_todo; bool initialized; int count; + + /* + * A simple cache list that contains objects for + * reuse purpose. In order to save some per-cpu + * space the list is singular. Even though it is + * lockless an access has to be protected by the + * per-cpu lock. + */ + struct llist_head bkvcache; + int nr_bkv_objs; }; static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = { @@ -3056,6 +3073,31 @@ krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags) local_irq_restore(flags); } +static inline struct kfree_rcu_bulk_data * +get_cached_bnode(struct kfree_rcu_cpu *krcp) +{ + if (!krcp->nr_bkv_objs) + return NULL; + + krcp->nr_bkv_objs--; + return (struct kfree_rcu_bulk_data *) + llist_del_first(&krcp->bkvcache); +} + +static inline bool +put_cached_bnode(struct kfree_rcu_cpu *krcp, + struct kfree_rcu_bulk_data *bnode) +{ + // Check the limit. + if (krcp->nr_bkv_objs >= rcu_min_cached_objs) + return false; + + llist_add((struct llist_node *) bnode, &krcp->bkvcache); + krcp->nr_bkv_objs++; + return true; + +} + /* * This function is invoked in workqueue context after a grace period. * It frees all the objects queued on ->bhead_free or ->head_free. @@ -3091,7 +3133,12 @@ static void kfree_rcu_work(struct work_struct *work) kfree_bulk(bhead->nr_records, bhead->records); rcu_lock_release(&rcu_callback_map); - if (cmpxchg(&krcp->bcached, NULL, bhead)) + krcp = krc_this_cpu_lock(&flags); + if (put_cached_bnode(krcp, bhead)) + bhead = NULL; + krc_this_cpu_unlock(krcp, flags); + + if (bhead) free_page((unsigned long) bhead); cond_resched_tasks_rcu_qs(); @@ -3224,7 +3271,7 @@ kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, /* Check if a new block is required. */ if (!krcp->bhead || krcp->bhead->nr_records == KFREE_BULK_MAX_ENTR) { - bnode = xchg(&krcp->bcached, NULL); + bnode = get_cached_bnode(krcp); if (!bnode) { WARN_ON_ONCE(sizeof(struct kfree_rcu_bulk_data) > PAGE_SIZE); @@ -4277,12 +4324,23 @@ static void __init kfree_rcu_batch_init(void) for_each_possible_cpu(cpu) { struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu); + struct kfree_rcu_bulk_data *bnode; for (i = 0; i < KFREE_N_BATCHES; i++) { INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work); krcp->krw_arr[i].krcp = krcp; } + for (i = 0; i < rcu_min_cached_objs; i++) { + bnode = (struct kfree_rcu_bulk_data *) + __get_free_page(GFP_NOWAIT | __GFP_NOWARN); + + if (bnode) + put_cached_bnode(krcp, bnode); + else + pr_err("Failed to preallocate for %d CPU!\n", cpu); + } + INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor); krcp->initialized = true; } -- cgit v1.2.3 From 5f3c8d620447d509e534962e23f7edfb85f4e533 Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Mon, 25 May 2020 23:47:53 +0200 Subject: rcu/tree: Maintain separate array for vmalloc ptrs To do so, we use an array of kvfree_rcu_bulk_data structures. It consists of two elements: - index number 0 corresponds to slab pointers. - index number 1 corresponds to vmalloc pointers. Keeping vmalloc pointers separated from slab pointers makes it possible to invoke the right freeing API for the right kind of pointer. It also prepares us for future headless support for vmalloc and SLAB objects. Such objects cannot be queued on a linked list and are instead directly into an array. Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Joel Fernandes (Google) Reviewed-by: Joel Fernandes (Google) Co-developed-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 173 +++++++++++++++++++++++++++++++----------------------- 1 file changed, 100 insertions(+), 73 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 37c0cd0332f8..67c4b984c499 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -57,6 +57,8 @@ #include #include #include +#include +#include #include "../time/tick-internal.h" #include "tree.h" @@ -2966,46 +2968,47 @@ EXPORT_SYMBOL_GPL(call_rcu); /* Maximum number of jiffies to wait before draining a batch. */ #define KFREE_DRAIN_JIFFIES (HZ / 50) #define KFREE_N_BATCHES 2 +#define FREE_N_CHANNELS 2 /** - * struct kfree_rcu_bulk_data - single block to store kfree_rcu() pointers + * struct kvfree_rcu_bulk_data - single block to store kvfree_rcu() pointers * @nr_records: Number of active pointers in the array - * @records: Array of the kfree_rcu() pointers * @next: Next bulk object in the block chain + * @records: Array of the kvfree_rcu() pointers */ -struct kfree_rcu_bulk_data { +struct kvfree_rcu_bulk_data { unsigned long nr_records; - struct kfree_rcu_bulk_data *next; + struct kvfree_rcu_bulk_data *next; void *records[]; }; /* * This macro defines how many entries the "records" array * will contain. It is based on the fact that the size of - * kfree_rcu_bulk_data structure becomes exactly one page. + * kvfree_rcu_bulk_data structure becomes exactly one page. */ -#define KFREE_BULK_MAX_ENTR \ - ((PAGE_SIZE - sizeof(struct kfree_rcu_bulk_data)) / sizeof(void *)) +#define KVFREE_BULK_MAX_ENTR \ + ((PAGE_SIZE - sizeof(struct kvfree_rcu_bulk_data)) / sizeof(void *)) /** * struct kfree_rcu_cpu_work - single batch of kfree_rcu() requests * @rcu_work: Let queue_rcu_work() invoke workqueue handler after grace period * @head_free: List of kfree_rcu() objects waiting for a grace period - * @bhead_free: Bulk-List of kfree_rcu() objects waiting for a grace period + * @bkvhead_free: Bulk-List of kvfree_rcu() objects waiting for a grace period * @krcp: Pointer to @kfree_rcu_cpu structure */ struct kfree_rcu_cpu_work { struct rcu_work rcu_work; struct rcu_head *head_free; - struct kfree_rcu_bulk_data *bhead_free; + struct kvfree_rcu_bulk_data *bkvhead_free[FREE_N_CHANNELS]; struct kfree_rcu_cpu *krcp; }; /** * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace period * @head: List of kfree_rcu() objects not yet waiting for a grace period - * @bhead: Bulk-List of kfree_rcu() objects not yet waiting for a grace period + * @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a grace period * @krw_arr: Array of batches of kfree_rcu() objects waiting for a grace period * @lock: Synchronize access to this structure * @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES @@ -3020,7 +3023,7 @@ struct kfree_rcu_cpu_work { */ struct kfree_rcu_cpu { struct rcu_head *head; - struct kfree_rcu_bulk_data *bhead; + struct kvfree_rcu_bulk_data *bkvhead[FREE_N_CHANNELS]; struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES]; raw_spinlock_t lock; struct delayed_work monitor_work; @@ -3044,7 +3047,7 @@ static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = { }; static __always_inline void -debug_rcu_bhead_unqueue(struct kfree_rcu_bulk_data *bhead) +debug_rcu_bhead_unqueue(struct kvfree_rcu_bulk_data *bhead) { #ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD int i; @@ -3073,20 +3076,20 @@ krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags) local_irq_restore(flags); } -static inline struct kfree_rcu_bulk_data * +static inline struct kvfree_rcu_bulk_data * get_cached_bnode(struct kfree_rcu_cpu *krcp) { if (!krcp->nr_bkv_objs) return NULL; krcp->nr_bkv_objs--; - return (struct kfree_rcu_bulk_data *) + return (struct kvfree_rcu_bulk_data *) llist_del_first(&krcp->bkvcache); } static inline bool put_cached_bnode(struct kfree_rcu_cpu *krcp, - struct kfree_rcu_bulk_data *bnode) + struct kvfree_rcu_bulk_data *bnode) { // Check the limit. if (krcp->nr_bkv_objs >= rcu_min_cached_objs) @@ -3105,43 +3108,63 @@ put_cached_bnode(struct kfree_rcu_cpu *krcp, static void kfree_rcu_work(struct work_struct *work) { unsigned long flags; + struct kvfree_rcu_bulk_data *bkvhead[FREE_N_CHANNELS], *bnext; struct rcu_head *head, *next; - struct kfree_rcu_bulk_data *bhead, *bnext; struct kfree_rcu_cpu *krcp; struct kfree_rcu_cpu_work *krwp; + int i, j; krwp = container_of(to_rcu_work(work), struct kfree_rcu_cpu_work, rcu_work); krcp = krwp->krcp; + raw_spin_lock_irqsave(&krcp->lock, flags); + // Channels 1 and 2. + for (i = 0; i < FREE_N_CHANNELS; i++) { + bkvhead[i] = krwp->bkvhead_free[i]; + krwp->bkvhead_free[i] = NULL; + } + + // Channel 3. head = krwp->head_free; krwp->head_free = NULL; - bhead = krwp->bhead_free; - krwp->bhead_free = NULL; raw_spin_unlock_irqrestore(&krcp->lock, flags); - /* "bhead" is now private, so traverse locklessly. */ - for (; bhead; bhead = bnext) { - bnext = bhead->next; - - debug_rcu_bhead_unqueue(bhead); - - rcu_lock_acquire(&rcu_callback_map); - trace_rcu_invoke_kfree_bulk_callback(rcu_state.name, - bhead->nr_records, bhead->records); - - kfree_bulk(bhead->nr_records, bhead->records); - rcu_lock_release(&rcu_callback_map); + // Handle two first channels. + for (i = 0; i < FREE_N_CHANNELS; i++) { + for (; bkvhead[i]; bkvhead[i] = bnext) { + bnext = bkvhead[i]->next; + debug_rcu_bhead_unqueue(bkvhead[i]); + + rcu_lock_acquire(&rcu_callback_map); + if (i == 0) { // kmalloc() / kfree(). + trace_rcu_invoke_kfree_bulk_callback( + rcu_state.name, bkvhead[i]->nr_records, + bkvhead[i]->records); + + kfree_bulk(bkvhead[i]->nr_records, + bkvhead[i]->records); + } else { // vmalloc() / vfree(). + for (j = 0; j < bkvhead[i]->nr_records; j++) { + trace_rcu_invoke_kfree_callback( + rcu_state.name, + bkvhead[i]->records[j], 0); + + vfree(bkvhead[i]->records[j]); + } + } + rcu_lock_release(&rcu_callback_map); - krcp = krc_this_cpu_lock(&flags); - if (put_cached_bnode(krcp, bhead)) - bhead = NULL; - krc_this_cpu_unlock(krcp, flags); + krcp = krc_this_cpu_lock(&flags); + if (put_cached_bnode(krcp, bkvhead[i])) + bkvhead[i] = NULL; + krc_this_cpu_unlock(krcp, flags); - if (bhead) - free_page((unsigned long) bhead); + if (bkvhead[i]) + free_page((unsigned long) bkvhead[i]); - cond_resched_tasks_rcu_qs(); + cond_resched_tasks_rcu_qs(); + } } /* @@ -3159,7 +3182,7 @@ static void kfree_rcu_work(struct work_struct *work) trace_rcu_invoke_kfree_callback(rcu_state.name, head, offset); if (!WARN_ON_ONCE(!__is_kfree_rcu_offset(offset))) - kfree(ptr); + kvfree(ptr); rcu_lock_release(&rcu_callback_map); cond_resched_tasks_rcu_qs(); @@ -3176,7 +3199,7 @@ static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp) { struct kfree_rcu_cpu_work *krwp; bool repeat = false; - int i; + int i, j; lockdep_assert_held(&krcp->lock); @@ -3184,21 +3207,25 @@ static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp) krwp = &(krcp->krw_arr[i]); /* - * Try to detach bhead or head and attach it over any + * Try to detach bkvhead or head and attach it over any * available corresponding free channel. It can be that * a previous RCU batch is in progress, it means that * immediately to queue another one is not possible so * return false to tell caller to retry. */ - if ((krcp->bhead && !krwp->bhead_free) || + if ((krcp->bkvhead[0] && !krwp->bkvhead_free[0]) || + (krcp->bkvhead[1] && !krwp->bkvhead_free[1]) || (krcp->head && !krwp->head_free)) { - /* Channel 1. */ - if (!krwp->bhead_free) { - krwp->bhead_free = krcp->bhead; - krcp->bhead = NULL; + // Channel 1 corresponds to SLAB ptrs. + // Channel 2 corresponds to vmalloc ptrs. + for (j = 0; j < FREE_N_CHANNELS; j++) { + if (!krwp->bkvhead_free[j]) { + krwp->bkvhead_free[j] = krcp->bkvhead[j]; + krcp->bkvhead[j] = NULL; + } } - /* Channel 2. */ + // Channel 3 corresponds to emergency path. if (!krwp->head_free) { krwp->head_free = krcp->head; krcp->head = NULL; @@ -3207,16 +3234,17 @@ static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp) WRITE_ONCE(krcp->count, 0); /* - * One work is per one batch, so there are two "free channels", - * "bhead_free" and "head_free" the batch can handle. It can be - * that the work is in the pending state when two channels have - * been detached following each other, one by one. + * One work is per one batch, so there are three + * "free channels", the batch can handle. It can + * be that the work is in the pending state when + * channels have been detached following by each + * other. */ queue_rcu_work(system_wq, &krwp->rcu_work); } - /* Repeat if any "free" corresponding channel is still busy. */ - if (krcp->bhead || krcp->head) + // Repeat if any "free" corresponding channel is still busy. + if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head) repeat = true; } @@ -3258,23 +3286,22 @@ static void kfree_rcu_monitor(struct work_struct *work) } static inline bool -kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, - struct rcu_head *head, rcu_callback_t func) +kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr) { - struct kfree_rcu_bulk_data *bnode; + struct kvfree_rcu_bulk_data *bnode; + int idx; if (unlikely(!krcp->initialized)) return false; lockdep_assert_held(&krcp->lock); + idx = !!is_vmalloc_addr(ptr); /* Check if a new block is required. */ - if (!krcp->bhead || - krcp->bhead->nr_records == KFREE_BULK_MAX_ENTR) { + if (!krcp->bkvhead[idx] || + krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) { bnode = get_cached_bnode(krcp); if (!bnode) { - WARN_ON_ONCE(sizeof(struct kfree_rcu_bulk_data) > PAGE_SIZE); - /* * To keep this path working on raw non-preemptible * sections, prevent the optional entry into the @@ -3287,7 +3314,7 @@ kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, if (IS_ENABLED(CONFIG_PREEMPT_RT)) return false; - bnode = (struct kfree_rcu_bulk_data *) + bnode = (struct kvfree_rcu_bulk_data *) __get_free_page(GFP_NOWAIT | __GFP_NOWARN); } @@ -3297,30 +3324,30 @@ kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, /* Initialize the new block. */ bnode->nr_records = 0; - bnode->next = krcp->bhead; + bnode->next = krcp->bkvhead[idx]; /* Attach it to the head. */ - krcp->bhead = bnode; + krcp->bkvhead[idx] = bnode; } /* Finally insert. */ - krcp->bhead->records[krcp->bhead->nr_records++] = - (void *) head - (unsigned long) func; + krcp->bkvhead[idx]->records + [krcp->bkvhead[idx]->nr_records++] = ptr; return true; } /* - * Queue a request for lazy invocation of kfree_bulk()/kfree() after a grace - * period. Please note there are two paths are maintained, one is the main one - * that uses kfree_bulk() interface and second one is emergency one, that is - * used only when the main path can not be maintained temporary, due to memory - * pressure. + * Queue a request for lazy invocation of appropriate free routine after a + * grace period. Please note there are three paths are maintained, two are the + * main ones that use array of pointers interface and third one is emergency + * one, that is used only when the main path can not be maintained temporary, + * due to memory pressure. * * Each kfree_call_rcu() request is added to a batch. The batch will be drained * every KFREE_DRAIN_JIFFIES number of jiffies. All the objects in the batch will * be free'd in workqueue context. This allows us to: batch requests together to - * reduce the number of grace periods during heavy kfree_rcu() load. + * reduce the number of grace periods during heavy kfree_rcu()/kvfree_rcu() load. */ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) { @@ -3343,7 +3370,7 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) * Under high memory pressure GFP_NOWAIT can fail, * in that case the emergency path is maintained. */ - if (unlikely(!kfree_call_rcu_add_ptr_to_bulk(krcp, head, func))) { + if (unlikely(!kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr))) { head->func = func; head->next = krcp->head; krcp->head = head; @@ -4324,7 +4351,7 @@ static void __init kfree_rcu_batch_init(void) for_each_possible_cpu(cpu) { struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu); - struct kfree_rcu_bulk_data *bnode; + struct kvfree_rcu_bulk_data *bnode; for (i = 0; i < KFREE_N_BATCHES; i++) { INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work); @@ -4332,7 +4359,7 @@ static void __init kfree_rcu_batch_init(void) } for (i = 0; i < rcu_min_cached_objs; i++) { - bnode = (struct kfree_rcu_bulk_data *) + bnode = (struct kvfree_rcu_bulk_data *) __get_free_page(GFP_NOWAIT | __GFP_NOWARN); if (bnode) -- cgit v1.2.3 From 64d1d06ccb1b7de245ccf781b91517f328bebd9f Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Mon, 25 May 2020 23:47:54 +0200 Subject: rcu/tiny: support vmalloc in tiny-RCU Replace kfree() with kvfree() in rcu_reclaim_tiny(). This makes it possible to release either SLAB or vmalloc objects after a GP. Reviewed-by: Joel Fernandes (Google) Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney --- kernel/rcu/tiny.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c index dd572ce7c747..4b99f7b88bee 100644 --- a/kernel/rcu/tiny.c +++ b/kernel/rcu/tiny.c @@ -23,6 +23,7 @@ #include #include #include +#include #include "rcu.h" @@ -86,7 +87,7 @@ static inline bool rcu_reclaim_tiny(struct rcu_head *head) rcu_lock_acquire(&rcu_callback_map); if (__is_kfree_rcu_offset(offset)) { trace_rcu_invoke_kfree_callback("", head, offset); - kfree((void *)head - offset); + kvfree((void *)head - offset); rcu_lock_release(&rcu_callback_map); return true; } -- cgit v1.2.3 From c408b215f58f7156bb6bafb64c0263ee907033df Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Mon, 25 May 2020 23:47:55 +0200 Subject: rcu: Rename *_kfree_callback/*_kfree_rcu_offset/kfree_call_* The following changes are introduced: 1. Rename rcu_invoke_kfree_callback() to rcu_invoke_kvfree_callback(), as well as the associated trace events, so the rcu_kfree_callback(), becomes rcu_kvfree_callback(). The reason is to be aligned with kvfree() notation. 2. Rename __is_kfree_rcu_offset to __is_kvfree_rcu_offset. All RCU paths use kvfree() now instead of kfree(), thus rename it. 3. Rename kfree_call_rcu() to the kvfree_call_rcu(). The reason is, it is capable of freeing vmalloc() memory now. Do the same with __kfree_rcu() macro, it becomes __kvfree_rcu(), the goal is the same. Reviewed-by: Joel Fernandes (Google) Co-developed-by: Joel Fernandes (Google) Signed-off-by: Joel Fernandes (Google) Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney --- include/linux/rcupdate.h | 14 +++++++------- include/linux/rcutiny.h | 2 +- include/linux/rcutree.h | 2 +- include/trace/events/rcu.h | 8 ++++---- kernel/rcu/tiny.c | 4 ++-- kernel/rcu/tree.c | 16 ++++++++-------- 6 files changed, 23 insertions(+), 23 deletions(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 659cbfa7581a..b344fc800a9b 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -828,17 +828,17 @@ static inline notrace void rcu_read_unlock_sched_notrace(void) /* * Does the specified offset indicate that the corresponding rcu_head - * structure can be handled by kfree_rcu()? + * structure can be handled by kvfree_rcu()? */ -#define __is_kfree_rcu_offset(offset) ((offset) < 4096) +#define __is_kvfree_rcu_offset(offset) ((offset) < 4096) /* * Helper macro for kfree_rcu() to prevent argument-expansion eyestrain. */ -#define __kfree_rcu(head, offset) \ +#define __kvfree_rcu(head, offset) \ do { \ - BUILD_BUG_ON(!__is_kfree_rcu_offset(offset)); \ - kfree_call_rcu(head, (rcu_callback_t)(unsigned long)(offset)); \ + BUILD_BUG_ON(!__is_kvfree_rcu_offset(offset)); \ + kvfree_call_rcu(head, (rcu_callback_t)(unsigned long)(offset)); \ } while (0) /** @@ -857,7 +857,7 @@ static inline notrace void rcu_read_unlock_sched_notrace(void) * Because the functions are not allowed in the low-order 4096 bytes of * kernel virtual memory, offsets up to 4095 bytes can be accommodated. * If the offset is larger than 4095 bytes, a compile-time error will - * be generated in __kfree_rcu(). If this error is triggered, you can + * be generated in __kvfree_rcu(). If this error is triggered, you can * either fall back to use of call_rcu() or rearrange the structure to * position the rcu_head structure into the first 4096 bytes. * @@ -872,7 +872,7 @@ do { \ typeof (ptr) ___p = (ptr); \ \ if (___p) \ - __kfree_rcu(&((___p)->rhf), offsetof(typeof(*(ptr)), rhf)); \ + __kvfree_rcu(&((___p)->rhf), offsetof(typeof(*(ptr)), rhf)); \ } while (0) /* diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h index 8512caeb7682..fb2eb39c484f 100644 --- a/include/linux/rcutiny.h +++ b/include/linux/rcutiny.h @@ -34,7 +34,7 @@ static inline void synchronize_rcu_expedited(void) synchronize_rcu(); } -static inline void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) +static inline void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) { call_rcu(head, func); } diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h index d5cc9d675987..d2f4064ebd1d 100644 --- a/include/linux/rcutree.h +++ b/include/linux/rcutree.h @@ -33,7 +33,7 @@ static inline void rcu_virt_note_context_switch(int cpu) } void synchronize_rcu_expedited(void); -void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func); +void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func); void rcu_barrier(void); bool rcu_eqs_special_set(int cpu); diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h index f9a7811148e2..0ee93d0b1daa 100644 --- a/include/trace/events/rcu.h +++ b/include/trace/events/rcu.h @@ -506,13 +506,13 @@ TRACE_EVENT_RCU(rcu_callback, /* * Tracepoint for the registration of a single RCU callback of the special - * kfree() form. The first argument is the RCU type, the second argument + * kvfree() form. The first argument is the RCU type, the second argument * is a pointer to the RCU callback, the third argument is the offset * of the callback within the enclosing RCU-protected data structure, * the fourth argument is the number of lazy callbacks queued, and the * fifth argument is the total number of callbacks queued. */ -TRACE_EVENT_RCU(rcu_kfree_callback, +TRACE_EVENT_RCU(rcu_kvfree_callback, TP_PROTO(const char *rcuname, struct rcu_head *rhp, unsigned long offset, long qlen), @@ -596,12 +596,12 @@ TRACE_EVENT_RCU(rcu_invoke_callback, /* * Tracepoint for the invocation of a single RCU callback of the special - * kfree() form. The first argument is the RCU flavor, the second + * kvfree() form. The first argument is the RCU flavor, the second * argument is a pointer to the RCU callback, and the third argument * is the offset of the callback within the enclosing RCU-protected * data structure. */ -TRACE_EVENT_RCU(rcu_invoke_kfree_callback, +TRACE_EVENT_RCU(rcu_invoke_kvfree_callback, TP_PROTO(const char *rcuname, struct rcu_head *rhp, unsigned long offset), diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c index 4b99f7b88bee..aa897c3f2e92 100644 --- a/kernel/rcu/tiny.c +++ b/kernel/rcu/tiny.c @@ -85,8 +85,8 @@ static inline bool rcu_reclaim_tiny(struct rcu_head *head) unsigned long offset = (unsigned long)head->func; rcu_lock_acquire(&rcu_callback_map); - if (__is_kfree_rcu_offset(offset)) { - trace_rcu_invoke_kfree_callback("", head, offset); + if (__is_kvfree_rcu_offset(offset)) { + trace_rcu_invoke_kvfree_callback("", head, offset); kvfree((void *)head - offset); rcu_lock_release(&rcu_callback_map); return true; diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 67c4b984c499..f22c47e72287 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2905,8 +2905,8 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func) return; // Enqueued onto ->nocb_bypass, so just leave. // If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock. rcu_segcblist_enqueue(&rdp->cblist, head); - if (__is_kfree_rcu_offset((unsigned long)func)) - trace_rcu_kfree_callback(rcu_state.name, head, + if (__is_kvfree_rcu_offset((unsigned long)func)) + trace_rcu_kvfree_callback(rcu_state.name, head, (unsigned long)func, rcu_segcblist_n_cbs(&rdp->cblist)); else @@ -3146,7 +3146,7 @@ static void kfree_rcu_work(struct work_struct *work) bkvhead[i]->records); } else { // vmalloc() / vfree(). for (j = 0; j < bkvhead[i]->nr_records; j++) { - trace_rcu_invoke_kfree_callback( + trace_rcu_invoke_kvfree_callback( rcu_state.name, bkvhead[i]->records[j], 0); @@ -3179,9 +3179,9 @@ static void kfree_rcu_work(struct work_struct *work) next = head->next; debug_rcu_head_unqueue((struct rcu_head *)ptr); rcu_lock_acquire(&rcu_callback_map); - trace_rcu_invoke_kfree_callback(rcu_state.name, head, offset); + trace_rcu_invoke_kvfree_callback(rcu_state.name, head, offset); - if (!WARN_ON_ONCE(!__is_kfree_rcu_offset(offset))) + if (!WARN_ON_ONCE(!__is_kvfree_rcu_offset(offset))) kvfree(ptr); rcu_lock_release(&rcu_callback_map); @@ -3344,12 +3344,12 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr) * one, that is used only when the main path can not be maintained temporary, * due to memory pressure. * - * Each kfree_call_rcu() request is added to a batch. The batch will be drained + * Each kvfree_call_rcu() request is added to a batch. The batch will be drained * every KFREE_DRAIN_JIFFIES number of jiffies. All the objects in the batch will * be free'd in workqueue context. This allows us to: batch requests together to * reduce the number of grace periods during heavy kfree_rcu()/kvfree_rcu() load. */ -void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) +void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) { unsigned long flags; struct kfree_rcu_cpu *krcp; @@ -3388,7 +3388,7 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) unlock_return: krc_this_cpu_unlock(krcp, flags); } -EXPORT_SYMBOL_GPL(kfree_call_rcu); +EXPORT_SYMBOL_GPL(kvfree_call_rcu); static unsigned long kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc) -- cgit v1.2.3 From e0feed08ab41df0fedc38d35938891ef5715c1d3 Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Mon, 25 May 2020 23:47:56 +0200 Subject: mm/list_lru.c: Rename kvfree_rcu() to local variant Rename kvfree_rcu() function to the kvfree_rcu_local() one. The purpose is to prevent a conflict of two same function declarations. The kvfree_rcu() will be globally visible what would lead to a build error. No functional change. Cc: linux-mm@kvack.org Cc: rcu@vger.kernel.org Cc: Andrew Morton Signed-off-by: Uladzislau Rezki (Sony) Reviewed-by: Joel Fernandes (Google) Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- mm/list_lru.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/list_lru.c b/mm/list_lru.c index 9222910ab1cb..e825804b3928 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -373,14 +373,14 @@ static void memcg_destroy_list_lru_node(struct list_lru_node *nlru) struct list_lru_memcg *memcg_lrus; /* * This is called when shrinker has already been unregistered, - * and nobody can use it. So, there is no need to use kvfree_rcu(). + * and nobody can use it. So, there is no need to use kvfree_rcu_local(). */ memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true); __memcg_destroy_list_lru_node(memcg_lrus, 0, memcg_nr_cache_ids); kvfree(memcg_lrus); } -static void kvfree_rcu(struct rcu_head *head) +static void kvfree_rcu_local(struct rcu_head *head) { struct list_lru_memcg *mlru; @@ -419,7 +419,7 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru, rcu_assign_pointer(nlru->memcg_lrus, new); spin_unlock_irq(&nlru->lock); - call_rcu(&old->rcu, kvfree_rcu); + call_rcu(&old->rcu, kvfree_rcu_local); return 0; } -- cgit v1.2.3 From ce4dce123fdcb5f209752d13f9f06926be65fc78 Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Mon, 25 May 2020 23:47:57 +0200 Subject: rcu: Introduce 2 arg kvfree_rcu() interface kvmalloc() can allocate two types of objects: SLAB backed and vmalloc backed. How it behaves depends on requested object's size and memory pressure. Add a kvfree_rcu() interface that can free memory allocated via kvmalloc(). It is a simple alias to kfree_rcu() which can now handle either type of object. struct test_kvfree_rcu { struct rcu_head rcu; unsigned char array[100]; }; struct test_kvfree_rcu *p; p = kvmalloc(10 * PAGE_SIZE); if (p) kvfree_rcu(p, rcu); Signed-off-by: Uladzislau Rezki (Sony) Co-developed-by: Joel Fernandes (Google) Reviewed-by: Joel Fernandes (Google) Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- include/linux/rcupdate.h | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index b344fc800a9b..51b26ab02878 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -875,6 +875,15 @@ do { \ __kvfree_rcu(&((___p)->rhf), offsetof(typeof(*(ptr)), rhf)); \ } while (0) +/** + * kvfree_rcu() - kvfree an object after a grace period. + * @ptr: pointer to kvfree + * @rhf: the name of the struct rcu_head within the type of @ptr. + * + * Same as kfree_rcu(), just simple alias. + */ +#define kvfree_rcu(ptr, rhf) kfree_rcu(ptr, rhf) + /* * Place this after a lock-acquisition primitive to guarantee that * an UNLOCK+LOCK pair acts as a full barrier. This guarantee applies -- cgit v1.2.3 From 3042f83f19bec2e0cd356f72b39e4d816e8cd5ff Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Mon, 25 May 2020 23:47:58 +0200 Subject: rcu: Support reclaim for head-less object Update the kvfree_call_rcu() function with head-less support. This allows RCU to reclaim objects without an embedded rcu_head. tree-RCU: We introduce two chains of arrays to store SLAB-backed and vmalloc pointers, each. Storage in either of these arrays does not require embedding an rcu_head within the object. Maintaining the arrays may become impossible due to high memory pressure. For such cases there is an emergency path. Objects with rcu_head inside are just queued on a backup rcu_head list. Later on that list is drained. As for the head-less variant, as the current context can sleep, the following emergency measures are applied: a) Synchronously wait until a grace period has elapsed. b) Call kvfree(). tiny-RCU: For double argument calls, there are no new changes in behavior. For single argument call, kvfree() is directly inlined on the current stack after a synchronize_rcu() call. Note that for tiny-RCU, any call to synchronize_rcu() is actually a quiescent state, therefore it does nothing. Reviewed-by: Joel Fernandes (Google) Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Joel Fernandes (Google) Co-developed-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- include/linux/rcutiny.h | 18 +++++++++++++++++- kernel/rcu/tree.c | 45 +++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 60 insertions(+), 3 deletions(-) diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h index fb2eb39c484f..5cc9637cac16 100644 --- a/include/linux/rcutiny.h +++ b/include/linux/rcutiny.h @@ -34,9 +34,25 @@ static inline void synchronize_rcu_expedited(void) synchronize_rcu(); } +/* + * Add one more declaration of kvfree() here. It is + * not so straight forward to just include + * where it is defined due to getting many compile + * errors caused by that include. + */ +extern void kvfree(const void *addr); + static inline void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) { - call_rcu(head, func); + if (head) { + call_rcu(head, func); + return; + } + + // kvfree_rcu(one_arg) call. + might_sleep(); + synchronize_rcu(); + kvfree((void *) func); } void rcu_qs(void); diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index f22c47e72287..01f29e4500ba 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3314,6 +3314,13 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr) if (IS_ENABLED(CONFIG_PREEMPT_RT)) return false; + /* + * NOTE: For one argument of kvfree_rcu() we can + * drop the lock and get the page in sleepable + * context. That would allow to maintain an array + * for the CONFIG_PREEMPT_RT as well if no cached + * pages are available. + */ bnode = (struct kvfree_rcu_bulk_data *) __get_free_page(GFP_NOWAIT | __GFP_NOWARN); } @@ -3353,16 +3360,33 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) { unsigned long flags; struct kfree_rcu_cpu *krcp; + bool success; void *ptr; + if (head) { + ptr = (void *) head - (unsigned long) func; + } else { + /* + * Please note there is a limitation for the head-less + * variant, that is why there is a clear rule for such + * objects: it can be used from might_sleep() context + * only. For other places please embed an rcu_head to + * your data. + */ + might_sleep(); + ptr = (unsigned long *) func; + } + krcp = krc_this_cpu_lock(&flags); - ptr = (void *)head - (unsigned long)func; // Queue the object but don't yet schedule the batch. if (debug_rcu_head_queue(ptr)) { // Probable double kfree_rcu(), just leak. WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n", __func__, head); + + // Mark as success and leave. + success = true; goto unlock_return; } @@ -3370,10 +3394,16 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) * Under high memory pressure GFP_NOWAIT can fail, * in that case the emergency path is maintained. */ - if (unlikely(!kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr))) { + success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr); + if (!success) { + if (head == NULL) + // Inline if kvfree_rcu(one_arg) call. + goto unlock_return; + head->func = func; head->next = krcp->head; krcp->head = head; + success = true; } WRITE_ONCE(krcp->count, krcp->count + 1); @@ -3387,6 +3417,17 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) unlock_return: krc_this_cpu_unlock(krcp, flags); + + /* + * Inline kvfree() after synchronize_rcu(). We can do + * it from might_sleep() context only, so the current + * CPU can pass the QS state. + */ + if (!success) { + debug_rcu_head_unqueue((struct rcu_head *) ptr); + synchronize_rcu(); + kvfree(ptr); + } } EXPORT_SYMBOL_GPL(kvfree_call_rcu); -- cgit v1.2.3 From 1835f475e3518ade61e25a57572c78b953778656 Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Mon, 25 May 2020 23:47:59 +0200 Subject: rcu: Introduce single argument kvfree_rcu() interface Make kvfree_rcu() capable of freeing objects that will not embed an rcu_head within it. This saves storage overhead in such objects. Reclaiming headless objects this way requires only a single argument (pointer to the object). After this patch, there are two ways to use kvfree_rcu(): a) kvfree_rcu(ptr, rhf); struct X { struct rcu_head rhf; unsigned char data[100]; }; void *ptr = kvmalloc(sizeof(struct X), GFP_KERNEL); if (ptr) kvfree_rcu(ptr, rhf); b) kvfree_rcu(ptr); void *ptr = kvmalloc(some_bytes, GFP_KERNEL); if (ptr) kvfree_rcu(ptr); Note that the headless usage (example b) can only be used in a code that can sleep. This is enforced by the CONFIG_DEBUG_ATOMIC_SLEEP option. Co-developed-by: Joel Fernandes (Google) Reviewed-by: Joel Fernandes (Google) Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney --- include/linux/rcupdate.h | 38 ++++++++++++++++++++++++++++++++++---- 1 file changed, 34 insertions(+), 4 deletions(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 51b26ab02878..d15d46db61f7 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -877,12 +877,42 @@ do { \ /** * kvfree_rcu() - kvfree an object after a grace period. - * @ptr: pointer to kvfree - * @rhf: the name of the struct rcu_head within the type of @ptr. * - * Same as kfree_rcu(), just simple alias. + * This macro consists of one or two arguments and it is + * based on whether an object is head-less or not. If it + * has a head then a semantic stays the same as it used + * to be before: + * + * kvfree_rcu(ptr, rhf); + * + * where @ptr is a pointer to kvfree(), @rhf is the name + * of the rcu_head structure within the type of @ptr. + * + * When it comes to head-less variant, only one argument + * is passed and that is just a pointer which has to be + * freed after a grace period. Therefore the semantic is + * + * kvfree_rcu(ptr); + * + * where @ptr is a pointer to kvfree(). + * + * Please note, head-less way of freeing is permitted to + * use from a context that has to follow might_sleep() + * annotation. Otherwise, please switch and embed the + * rcu_head structure within the type of @ptr. */ -#define kvfree_rcu(ptr, rhf) kfree_rcu(ptr, rhf) +#define kvfree_rcu(...) KVFREE_GET_MACRO(__VA_ARGS__, \ + kvfree_rcu_arg_2, kvfree_rcu_arg_1)(__VA_ARGS__) + +#define KVFREE_GET_MACRO(_1, _2, NAME, ...) NAME +#define kvfree_rcu_arg_2(ptr, rhf) kfree_rcu(ptr, rhf) +#define kvfree_rcu_arg_1(ptr) \ +do { \ + typeof(ptr) ___p = (ptr); \ + \ + if (___p) \ + kvfree_call_rcu(NULL, (rcu_callback_t) (___p)); \ +} while (0) /* * Place this after a lock-acquisition primitive to guarantee that -- cgit v1.2.3 From da4fc00abb97ce1269b0940abe86e25456e28424 Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Mon, 25 May 2020 23:48:00 +0200 Subject: lib/test_vmalloc.c: Add test cases for kvfree_rcu() Introduce four new test cases for testing the kvfree_rcu() interface. Two of them belong to single argument functionality and another two for 2-argument functionality. The aim is to stress and check how kvfree_rcu() behaves under different load and memory conditions and analyze its performance throughput. Reviewed-by: Joel Fernandes (Google) Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney --- lib/test_vmalloc.c | 103 ++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 95 insertions(+), 8 deletions(-) diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c index ddc9685702b1..5cf2fe9aab9e 100644 --- a/lib/test_vmalloc.c +++ b/lib/test_vmalloc.c @@ -15,6 +15,8 @@ #include #include #include +#include +#include #define __param(type, name, init, msg) \ static type name = init; \ @@ -35,14 +37,18 @@ __param(int, test_loop_count, 1000000, __param(int, run_test_mask, INT_MAX, "Set tests specified in the mask.\n\n" - "\t\tid: 1, name: fix_size_alloc_test\n" - "\t\tid: 2, name: full_fit_alloc_test\n" - "\t\tid: 4, name: long_busy_list_alloc_test\n" - "\t\tid: 8, name: random_size_alloc_test\n" - "\t\tid: 16, name: fix_align_alloc_test\n" - "\t\tid: 32, name: random_size_align_alloc_test\n" - "\t\tid: 64, name: align_shift_alloc_test\n" - "\t\tid: 128, name: pcpu_alloc_test\n" + "\t\tid: 1, name: fix_size_alloc_test\n" + "\t\tid: 2, name: full_fit_alloc_test\n" + "\t\tid: 4, name: long_busy_list_alloc_test\n" + "\t\tid: 8, name: random_size_alloc_test\n" + "\t\tid: 16, name: fix_align_alloc_test\n" + "\t\tid: 32, name: random_size_align_alloc_test\n" + "\t\tid: 64, name: align_shift_alloc_test\n" + "\t\tid: 128, name: pcpu_alloc_test\n" + "\t\tid: 256, name: kvfree_rcu_1_arg_vmalloc_test\n" + "\t\tid: 512, name: kvfree_rcu_2_arg_vmalloc_test\n" + "\t\tid: 1024, name: kvfree_rcu_1_arg_slab_test\n" + "\t\tid: 2048, name: kvfree_rcu_2_arg_slab_test\n" /* Add a new test case description here. */ ); @@ -316,6 +322,83 @@ pcpu_alloc_test(void) return rv; } +struct test_kvfree_rcu { + struct rcu_head rcu; + unsigned char array[20]; +}; + +static int +kvfree_rcu_1_arg_vmalloc_test(void) +{ + struct test_kvfree_rcu *p; + int i; + + for (i = 0; i < test_loop_count; i++) { + p = vmalloc(1 * PAGE_SIZE); + if (!p) + return -1; + + p->array[0] = 'a'; + kvfree_rcu(p); + } + + return 0; +} + +static int +kvfree_rcu_2_arg_vmalloc_test(void) +{ + struct test_kvfree_rcu *p; + int i; + + for (i = 0; i < test_loop_count; i++) { + p = vmalloc(1 * PAGE_SIZE); + if (!p) + return -1; + + p->array[0] = 'a'; + kvfree_rcu(p, rcu); + } + + return 0; +} + +static int +kvfree_rcu_1_arg_slab_test(void) +{ + struct test_kvfree_rcu *p; + int i; + + for (i = 0; i < test_loop_count; i++) { + p = kmalloc(sizeof(*p), GFP_KERNEL); + if (!p) + return -1; + + p->array[0] = 'a'; + kvfree_rcu(p); + } + + return 0; +} + +static int +kvfree_rcu_2_arg_slab_test(void) +{ + struct test_kvfree_rcu *p; + int i; + + for (i = 0; i < test_loop_count; i++) { + p = kmalloc(sizeof(*p), GFP_KERNEL); + if (!p) + return -1; + + p->array[0] = 'a'; + kvfree_rcu(p, rcu); + } + + return 0; +} + struct test_case_desc { const char *test_name; int (*test_func)(void); @@ -330,6 +413,10 @@ static struct test_case_desc test_case_array[] = { { "random_size_align_alloc_test", random_size_align_alloc_test }, { "align_shift_alloc_test", align_shift_alloc_test }, { "pcpu_alloc_test", pcpu_alloc_test }, + { "kvfree_rcu_1_arg_vmalloc_test", kvfree_rcu_1_arg_vmalloc_test }, + { "kvfree_rcu_2_arg_vmalloc_test", kvfree_rcu_2_arg_vmalloc_test }, + { "kvfree_rcu_1_arg_slab_test", kvfree_rcu_1_arg_slab_test }, + { "kvfree_rcu_2_arg_slab_test", kvfree_rcu_2_arg_slab_test }, /* Add a new test case here. */ }; -- cgit v1.2.3 From ea6eed9f7d7382c7230202d4c3bf74185f193394 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 7 May 2020 16:47:13 -0700 Subject: rcu-tasks: Convert sleeps to idle priority This commit converts the long-standing schedule_timeout_interruptible() and schedule_timeout_uninterruptible() calls used by the various Tasks RCU's grace-period kthreads to schedule_timeout_idle(). This conversion avoids polluting the load-average with Tasks-RCU-related sleeping. Signed-off-by: Paul E. McKenney --- kernel/rcu/tasks.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index ce23f6cc5043..91fee8122acd 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -205,7 +205,7 @@ static int __noreturn rcu_tasks_kthread(void *arg) if (!rtp->cbs_head) { WARN_ON(signal_pending(current)); set_tasks_gp_state(rtp, RTGS_WAIT_WAIT_CBS); - schedule_timeout_interruptible(HZ/10); + schedule_timeout_idle(HZ/10); } continue; } @@ -227,7 +227,7 @@ static int __noreturn rcu_tasks_kthread(void *arg) cond_resched(); } /* Paranoid sleep to keep this from entering a tight loop */ - schedule_timeout_uninterruptible(HZ/10); + schedule_timeout_idle(HZ/10); set_tasks_gp_state(rtp, RTGS_WAIT_CBS); } @@ -336,7 +336,7 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp) /* Slowly back off waiting for holdouts */ set_tasks_gp_state(rtp, RTGS_WAIT_SCAN_HOLDOUTS); - schedule_timeout_interruptible(HZ/fract); + schedule_timeout_idle(HZ/fract); if (fract > 1) fract--; -- cgit v1.2.3 From 04a3c5aa7a8cb2ce97f9beb627ba742bc8b0fe03 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 28 May 2020 19:27:06 -0700 Subject: rcu-tasks: Make rcu_tasks_postscan() be static The rcu_tasks_postscan() function is not used outside of RCU's tasks.h file, so this commit makes it be static. Reported-by: kbuild test robot Signed-off-by: Paul E. McKenney --- kernel/rcu/tasks.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 91fee8122acd..da200e53d60d 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -402,7 +402,7 @@ static void rcu_tasks_pertask(struct task_struct *t, struct list_head *hop) } /* Processing between scanning taskslist and draining the holdout list. */ -void rcu_tasks_postscan(struct list_head *hop) +static void rcu_tasks_postscan(struct list_head *hop) { /* * Wait for tasks that are in the process of exiting. This -- cgit v1.2.3 From 5b3cc99bedf5885055fbaf35fe63d205f06b5be5 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 28 May 2020 19:33:47 -0700 Subject: rcu-tasks: Add #include of rcupdate_trace.h to update.c Although this is in some strict sense unnecessary, it is good to allow the compiler to compare the function declaration with its definition. This commit therefore adds a #include of linux/rcupdate_trace.h to kernel/rcu/update.c. Reported-by: kbuild test robot Signed-off-by: Paul E. McKenney --- kernel/rcu/update.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 84843adfd939..c0fea809d738 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -42,6 +42,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS -- cgit v1.2.3 From 8344496e8b49c4122c1808d6cd3f8dc71bccb595 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 28 May 2020 20:03:48 -0700 Subject: rcu-tasks: Conditionally compile show_rcu_tasks_gp_kthreads() The show_rcu_tasks_gp_kthreads() function is not invoked by Tiny RCU, but is nevertheless defined in Tiny RCU builds that enable Tasks Trace RCU. This commit therefore conditionally compiles this function so that it is defined only in builds that actually use it. Reported-by: kbuild test robot Signed-off-by: Paul E. McKenney --- kernel/rcu/tasks.h | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index da200e53d60d..d5c003c1972c 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -103,6 +103,7 @@ module_param(rcu_task_stall_timeout, int, 0644); #define RTGS_WAIT_READERS 9 #define RTGS_INVOKE_CBS 10 #define RTGS_WAIT_CBS 11 +#ifndef CONFIG_TINY_RCU static const char * const rcu_tasks_gp_state_names[] = { "RTGS_INIT", "RTGS_WAIT_WAIT_CBS", @@ -117,6 +118,7 @@ static const char * const rcu_tasks_gp_state_names[] = { "RTGS_INVOKE_CBS", "RTGS_WAIT_CBS", }; +#endif /* #ifndef CONFIG_TINY_RCU */ //////////////////////////////////////////////////////////////////////// // @@ -129,6 +131,7 @@ static void set_tasks_gp_state(struct rcu_tasks *rtp, int newstate) rtp->gp_jiffies = jiffies; } +#ifndef CONFIG_TINY_RCU /* Return state name. */ static const char *tasks_gp_state_getname(struct rcu_tasks *rtp) { @@ -139,6 +142,7 @@ static const char *tasks_gp_state_getname(struct rcu_tasks *rtp) return "???"; return rcu_tasks_gp_state_names[j]; } +#endif /* #ifndef CONFIG_TINY_RCU */ // Enqueue a callback for the specified flavor of Tasks RCU. static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func, @@ -268,6 +272,7 @@ static void __init rcu_tasks_bootup_oddness(void) #endif /* #ifndef CONFIG_TINY_RCU */ +#ifndef CONFIG_TINY_RCU /* Dump out rcutorture-relevant state common to all RCU-tasks flavors. */ static void show_rcu_tasks_generic_gp_kthread(struct rcu_tasks *rtp, char *s) { @@ -281,6 +286,7 @@ static void show_rcu_tasks_generic_gp_kthread(struct rcu_tasks *rtp, char *s) ".C"[!!data_race(rtp->cbs_head)], s); } +#endif /* #ifndef CONFIG_TINY_RCU */ static void exit_tasks_rcu_finish_trace(struct task_struct *t); @@ -557,10 +563,12 @@ static int __init rcu_spawn_tasks_kthread(void) } core_initcall(rcu_spawn_tasks_kthread); +#ifndef CONFIG_TINY_RCU static void show_rcu_tasks_classic_gp_kthread(void) { show_rcu_tasks_generic_gp_kthread(&rcu_tasks, ""); } +#endif /* #ifndef CONFIG_TINY_RCU */ /* Do the srcu_read_lock() for the above synchronize_srcu(). */ void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu) @@ -682,10 +690,12 @@ static int __init rcu_spawn_tasks_rude_kthread(void) } core_initcall(rcu_spawn_tasks_rude_kthread); +#ifndef CONFIG_TINY_RCU static void show_rcu_tasks_rude_gp_kthread(void) { show_rcu_tasks_generic_gp_kthread(&rcu_tasks_rude, ""); } +#endif /* #ifndef CONFIG_TINY_RCU */ #else /* #ifdef CONFIG_TASKS_RUDE_RCU */ static void show_rcu_tasks_rude_gp_kthread(void) {} @@ -1164,6 +1174,7 @@ static int __init rcu_spawn_tasks_trace_kthread(void) } core_initcall(rcu_spawn_tasks_trace_kthread); +#ifndef CONFIG_TINY_RCU static void show_rcu_tasks_trace_gp_kthread(void) { char buf[64]; @@ -1174,18 +1185,21 @@ static void show_rcu_tasks_trace_gp_kthread(void) data_race(n_heavy_reader_attempts)); show_rcu_tasks_generic_gp_kthread(&rcu_tasks_trace, buf); } +#endif /* #ifndef CONFIG_TINY_RCU */ #else /* #ifdef CONFIG_TASKS_TRACE_RCU */ static void exit_tasks_rcu_finish_trace(struct task_struct *t) { } static inline void show_rcu_tasks_trace_gp_kthread(void) {} #endif /* #else #ifdef CONFIG_TASKS_TRACE_RCU */ +#ifndef CONFIG_TINY_RCU void show_rcu_tasks_gp_kthreads(void) { show_rcu_tasks_classic_gp_kthread(); show_rcu_tasks_rude_gp_kthread(); show_rcu_tasks_trace_gp_kthread(); } +#endif /* #ifndef CONFIG_TINY_RCU */ #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */ static inline void rcu_tasks_bootup_oddness(void) {} -- cgit v1.2.3 From 30d8aa5128f12c9d781b67c9694c1abfa4f6ce6a Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 9 Jun 2020 09:24:51 -0700 Subject: rcu-tasks: Fix code-style issues This commit declares trc_n_readers_need_end and trc_wait static and replaced a "&" with "&&". The "&" happened to work because the values are bool, but accidents waiting to happen and all that... Reported-by: kbuild test robot Signed-off-by: Paul E. McKenney --- kernel/rcu/tasks.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index d5c003c1972c..828f222895f1 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -737,8 +737,8 @@ EXPORT_SYMBOL_GPL(rcu_trace_lock_map); #ifdef CONFIG_TASKS_TRACE_RCU -atomic_t trc_n_readers_need_end; // Number of waited-for readers. -DECLARE_WAIT_QUEUE_HEAD(trc_wait); // List of holdout tasks. +static atomic_t trc_n_readers_need_end; // Number of waited-for readers. +static DECLARE_WAIT_QUEUE_HEAD(trc_wait); // List of holdout tasks. // Record outstanding IPIs to each CPU. No point in sending two... static DEFINE_PER_CPU(bool, trc_ipi_to_cpu); @@ -845,7 +845,7 @@ static bool trc_inspect_reader(struct task_struct *t, void *arg) bool ofl = cpu_is_offline(cpu); if (task_curr(t)) { - WARN_ON_ONCE(ofl & !is_idle_task(t)); + WARN_ON_ONCE(ofl && !is_idle_task(t)); // If no chance of heavyweight readers, do it the hard way. if (!ofl && !IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB)) -- cgit v1.2.3 From 7e866460cc18797b3a59360f5f8c444598a21729 Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Mon, 25 May 2020 00:36:47 -0400 Subject: rcuperf: Remove useless while loops around wait_event wait_event() already retries if the condition for the wake up is not satisifed after wake up. Remove them from the rcuperf test. Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/rcuperf.c | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c index 16dd1e6b7c09..246da8fe199e 100644 --- a/kernel/rcu/rcuperf.c +++ b/kernel/rcu/rcuperf.c @@ -576,11 +576,8 @@ static int compute_real(int n) static int rcu_perf_shutdown(void *arg) { - do { - wait_event(shutdown_wq, - atomic_read(&n_rcu_perf_writer_finished) >= - nrealwriters); - } while (atomic_read(&n_rcu_perf_writer_finished) < nrealwriters); + wait_event(shutdown_wq, + atomic_read(&n_rcu_perf_writer_finished) >= nrealwriters); smp_mb(); /* Wake before output. */ rcu_perf_cleanup(); kernel_power_off(); @@ -693,11 +690,8 @@ kfree_perf_cleanup(void) static int kfree_perf_shutdown(void *arg) { - do { - wait_event(shutdown_wq, - atomic_read(&n_kfree_perf_thread_ended) >= - kfree_nrealthreads); - } while (atomic_read(&n_kfree_perf_thread_ended) < kfree_nrealthreads); + wait_event(shutdown_wq, + atomic_read(&n_kfree_perf_thread_ended) >= kfree_nrealthreads); smp_mb(); /* Wake before output. */ -- cgit v1.2.3 From 653ed64b01dc5989f8f579d0038e987476c2c023 Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Mon, 25 May 2020 00:36:48 -0400 Subject: refperf: Add a test to measure performance of read-side synchronization Add a test for comparing the performance of RCU with various read-side synchronization mechanisms. The test has proved useful for collecting data and performing these comparisons. Currently RCU, SRCU, reader-writer lock, reader-writer semaphore and reference counting can be measured using refperf.perf_type parameter. Each invocation of the test runs measures performance of a specific mechanism. The maximum number of CPUs to concurrently run readers on is chosen by the test itself and is 75% of the total number of CPUs. So if you had 24 CPUs, the test runs with a maximum of 18 parallel readers. A number of experiments are conducted, and in each experiment, the number of readers is increased by 1, upto the 75% of CPUs mark. During each experiment, all readers execute an empty loop with refperf.loops iterations and time the total loop duration. This is then averaged. Example output: Parameters "refperf.perf_type=srcu refperf.loops=2000000" looks like: [ 3.347133] srcu-ref-perf: [ 3.347133] Threads Time(ns) [ 3.347133] 1 36 [ 3.347133] 2 34 [ 3.347133] 3 34 [ 3.347133] 4 34 [ 3.347133] 5 33 [ 3.347133] 6 33 [ 3.347133] 7 33 [ 3.347133] 8 33 [ 3.347133] 9 33 [ 3.347133] 10 33 [ 3.347133] 11 33 [ 3.347133] 12 33 [ 3.347133] 13 33 [ 3.347133] 14 33 [ 3.347133] 15 32 [ 3.347133] 16 33 [ 3.347133] 17 33 [ 3.347133] 18 34 Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/Kconfig.debug | 19 ++ kernel/rcu/Makefile | 1 + kernel/rcu/refperf.c | 558 +++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 578 insertions(+) create mode 100644 kernel/rcu/refperf.c diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug index 452feae8de20..858765b7f644 100644 --- a/kernel/rcu/Kconfig.debug +++ b/kernel/rcu/Kconfig.debug @@ -61,6 +61,25 @@ config RCU_TORTURE_TEST Say M if you want the RCU torture tests to build as a module. Say N if you are unsure. +config RCU_REF_PERF_TEST + tristate "Performance tests for read-side synchronization (RCU and others)" + depends on DEBUG_KERNEL + select TORTURE_TEST + select SRCU + select TASKS_RCU + select TASKS_RUDE_RCU + select TASKS_TRACE_RCU + default n + help + This option provides a kernel module that runs performance tests + useful comparing RCU with various read-side synchronization mechanisms. + The kernel module may be built after the fact on the running kernel to be + tested, if desired. + + Say Y here if you want these performance tests built into the kernel. + Say M if you want to build it as a module instead. + Say N if you are unsure. + config RCU_CPU_STALL_TIMEOUT int "RCU CPU stall timeout in seconds" depends on RCU_STALL_COMMON diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile index f91f2c2cf138..ba7d82609cbe 100644 --- a/kernel/rcu/Makefile +++ b/kernel/rcu/Makefile @@ -12,6 +12,7 @@ obj-$(CONFIG_TREE_SRCU) += srcutree.o obj-$(CONFIG_TINY_SRCU) += srcutiny.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o obj-$(CONFIG_RCU_PERF_TEST) += rcuperf.o +obj-$(CONFIG_RCU_REF_PERF_TEST) += refperf.o obj-$(CONFIG_TREE_RCU) += tree.o obj-$(CONFIG_TINY_RCU) += tiny.o obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c new file mode 100644 index 000000000000..61161530acc8 --- /dev/null +++ b/kernel/rcu/refperf.c @@ -0,0 +1,558 @@ +// SPDX-License-Identifier: GPL-2.0+ +// +// Performance test comparing RCU vs other mechanisms +// for acquiring references on objects. +// +// Copyright (C) Google, 2020. +// +// Author: Joel Fernandes + +#define pr_fmt(fmt) fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "rcu.h" + +#define PERF_FLAG "-ref-perf: " + +#define PERFOUT(s, x...) \ + pr_alert("%s" PERF_FLAG s, perf_type, ## x) + +#define VERBOSE_PERFOUT(s, x...) \ + do { if (verbose) pr_alert("%s" PERF_FLAG s, perf_type, ## x); } while (0) + +#define VERBOSE_PERFOUT_ERRSTRING(s, x...) \ + do { if (verbose) pr_alert("%s" PERF_FLAG "!!! " s, perf_type, ## x); } while (0) + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Joel Fernandes (Google) "); + +static char *perf_type = "rcu"; +module_param(perf_type, charp, 0444); +MODULE_PARM_DESC(perf_type, "Type of test (rcu, srcu, refcnt, rwsem, rwlock."); + +torture_param(int, verbose, 0, "Enable verbose debugging printk()s"); + +// Number of loops per experiment, all readers execute an operation concurrently +torture_param(long, loops, 10000000, "Number of loops per experiment."); + +#ifdef MODULE +# define REFPERF_SHUTDOWN 0 +#else +# define REFPERF_SHUTDOWN 1 +#endif + +torture_param(bool, shutdown, REFPERF_SHUTDOWN, + "Shutdown at end of performance tests."); + +struct reader_task { + struct task_struct *task; + atomic_t start; + wait_queue_head_t wq; + u64 last_duration_ns; + + // The average latency When 1.. are concurrently + // running an experiment. For example, if this reader_task is + // of index 5 in the reader_tasks array, then result is for + // 6 cores. + u64 result_avg; +}; + +static struct task_struct *shutdown_task; +static wait_queue_head_t shutdown_wq; + +static struct task_struct *main_task; +static wait_queue_head_t main_wq; +static int shutdown_start; + +static struct reader_task *reader_tasks; +static int nreaders; + +// Number of readers that are part of the current experiment. +static atomic_t nreaders_exp; + +// Use to wait for all threads to start. +static atomic_t n_init; + +// Track which experiment is currently running. +static int exp_idx; + +// Operations vector for selecting different types of tests. +struct ref_perf_ops { + void (*init)(void); + void (*cleanup)(void); + int (*readlock)(void); + void (*readunlock)(int idx); + const char *name; +}; + +static struct ref_perf_ops *cur_ops; + +// Definitions for RCU ref perf testing. +static int ref_rcu_read_lock(void) __acquires(RCU) +{ + rcu_read_lock(); + return 0; +} + +static void ref_rcu_read_unlock(int idx) __releases(RCU) +{ + rcu_read_unlock(); +} + +static void rcu_sync_perf_init(void) +{ +} + +static struct ref_perf_ops rcu_ops = { + .init = rcu_sync_perf_init, + .readlock = ref_rcu_read_lock, + .readunlock = ref_rcu_read_unlock, + .name = "rcu" +}; + + +// Definitions for SRCU ref perf testing. +DEFINE_STATIC_SRCU(srcu_refctl_perf); +static struct srcu_struct *srcu_ctlp = &srcu_refctl_perf; + +static int srcu_ref_perf_read_lock(void) __acquires(srcu_ctlp) +{ + return srcu_read_lock(srcu_ctlp); +} + +static void srcu_ref_perf_read_unlock(int idx) __releases(srcu_ctlp) +{ + srcu_read_unlock(srcu_ctlp, idx); +} + +static struct ref_perf_ops srcu_ops = { + .init = rcu_sync_perf_init, + .readlock = srcu_ref_perf_read_lock, + .readunlock = srcu_ref_perf_read_unlock, + .name = "srcu" +}; + +// Definitions for reference count +static atomic_t refcnt; + +static int srcu_ref_perf_refcnt_lock(void) +{ + atomic_inc(&refcnt); + return 0; +} + +static void srcu_ref_perf_refcnt_unlock(int idx) __releases(srcu_ctlp) +{ + atomic_dec(&refcnt); + srcu_read_unlock(srcu_ctlp, idx); +} + +static struct ref_perf_ops refcnt_ops = { + .init = rcu_sync_perf_init, + .readlock = srcu_ref_perf_refcnt_lock, + .readunlock = srcu_ref_perf_refcnt_unlock, + .name = "refcnt" +}; + +// Definitions for rwlock +static rwlock_t test_rwlock; + +static void ref_perf_rwlock_init(void) +{ + rwlock_init(&test_rwlock); +} + +static int ref_perf_rwlock_lock(void) +{ + read_lock(&test_rwlock); + return 0; +} + +static void ref_perf_rwlock_unlock(int idx) +{ + read_unlock(&test_rwlock); +} + +static struct ref_perf_ops rwlock_ops = { + .init = ref_perf_rwlock_init, + .readlock = ref_perf_rwlock_lock, + .readunlock = ref_perf_rwlock_unlock, + .name = "rwlock" +}; + +// Definitions for rwsem +static struct rw_semaphore test_rwsem; + +static void ref_perf_rwsem_init(void) +{ + init_rwsem(&test_rwsem); +} + +static int ref_perf_rwsem_lock(void) +{ + down_read(&test_rwsem); + return 0; +} + +static void ref_perf_rwsem_unlock(int idx) +{ + up_read(&test_rwsem); +} + +static struct ref_perf_ops rwsem_ops = { + .init = ref_perf_rwsem_init, + .readlock = ref_perf_rwsem_lock, + .readunlock = ref_perf_rwsem_unlock, + .name = "rwsem" +}; + +// Reader kthread. Repeatedly does empty RCU read-side +// critical section, minimizing update-side interference. +static int +ref_perf_reader(void *arg) +{ + unsigned long flags; + long me = (long)arg; + struct reader_task *rt = &(reader_tasks[me]); + unsigned long spincnt; + int idx; + u64 start; + s64 duration; + + VERBOSE_PERFOUT("ref_perf_reader %ld: task started", me); + set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids)); + set_user_nice(current, MAX_NICE); + atomic_inc(&n_init); +repeat: + VERBOSE_PERFOUT("ref_perf_reader %ld: waiting to start next experiment on cpu %d", me, smp_processor_id()); + + // Wait for signal that this reader can start. + wait_event(rt->wq, (atomic_read(&nreaders_exp) && atomic_read(&rt->start)) || + torture_must_stop()); + + if (torture_must_stop()) + goto end; + + // Make sure that the CPU is affinitized appropriately during testing. + WARN_ON_ONCE(smp_processor_id() != me); + + atomic_dec(&rt->start); + + // To prevent noise, keep interrupts disabled. This also has the + // effect of preventing entries into slow path for rcu_read_unlock(). + local_irq_save(flags); + start = ktime_get_mono_fast_ns(); + + VERBOSE_PERFOUT("ref_perf_reader %ld: experiment %d started", me, exp_idx); + + for (spincnt = 0; spincnt < loops; spincnt++) { + idx = cur_ops->readlock(); + cur_ops->readunlock(idx); + } + + duration = ktime_get_mono_fast_ns() - start; + local_irq_restore(flags); + + rt->last_duration_ns = WARN_ON_ONCE(duration < 0) ? 0 : duration; + + atomic_dec(&nreaders_exp); + + VERBOSE_PERFOUT("ref_perf_reader %ld: experiment %d ended, (readers remaining=%d)", + me, exp_idx, atomic_read(&nreaders_exp)); + + if (!atomic_read(&nreaders_exp)) + wake_up(&main_wq); + + if (!torture_must_stop()) + goto repeat; +end: + torture_kthread_stopping("ref_perf_reader"); + return 0; +} + +void reset_readers(int n) +{ + int i; + struct reader_task *rt; + + for (i = 0; i < n; i++) { + rt = &(reader_tasks[i]); + + rt->last_duration_ns = 0; + } +} + +// Print the results of each reader and return the sum of all their durations. +u64 process_durations(int n) +{ + int i; + struct reader_task *rt; + char buf1[64]; + char buf[512]; + u64 sum = 0; + + buf[0] = 0; + sprintf(buf, "Experiment #%d (Format: :)", + exp_idx); + + for (i = 0; i <= n && !torture_must_stop(); i++) { + rt = &(reader_tasks[i]); + sprintf(buf1, "%d: %llu\t", i, rt->last_duration_ns); + + if (i % 5 == 0) + strcat(buf, "\n"); + strcat(buf, buf1); + + sum += rt->last_duration_ns; + } + strcat(buf, "\n"); + + PERFOUT("%s\n", buf); + + return sum; +} + +// The main_func is the main orchestrator, it performs a bunch of +// experiments. For every experiment, it orders all the readers +// involved to start and waits for them to finish the experiment. It +// then reads their timestamps and starts the next experiment. Each +// experiment progresses from 1 concurrent reader to N of them at which +// point all the timestamps are printed. +static int main_func(void *arg) +{ + int exp, r; + char buf1[64]; + char buf[512]; + + set_cpus_allowed_ptr(current, cpumask_of(nreaders % nr_cpu_ids)); + set_user_nice(current, MAX_NICE); + + VERBOSE_PERFOUT("main_func task started"); + atomic_inc(&n_init); + + // Wait for all threads to start. + wait_event(main_wq, atomic_read(&n_init) == (nreaders + 1)); + + // Start exp readers up per experiment + for (exp = 0; exp < nreaders && !torture_must_stop(); exp++) { + if (torture_must_stop()) + goto end; + + reset_readers(exp); + atomic_set(&nreaders_exp, exp + 1); + + exp_idx = exp; + + for (r = 0; r <= exp; r++) { + atomic_set(&reader_tasks[r].start, 1); + wake_up(&reader_tasks[r].wq); + } + + VERBOSE_PERFOUT("main_func: experiment started, waiting for %d readers", + exp); + + wait_event(main_wq, + !atomic_read(&nreaders_exp) || torture_must_stop()); + + VERBOSE_PERFOUT("main_func: experiment ended"); + + if (torture_must_stop()) + goto end; + + reader_tasks[exp].result_avg = process_durations(exp) / ((exp + 1) * loops); + } + + // Print the average of all experiments + PERFOUT("END OF TEST. Calculating average duration per loop (nanoseconds)...\n"); + + buf[0] = 0; + strcat(buf, "\n"); + strcat(buf, "Threads\tTime(ns)\n"); + + for (exp = 0; exp < nreaders; exp++) { + sprintf(buf1, "%d\t%llu\n", exp + 1, reader_tasks[exp].result_avg); + strcat(buf, buf1); + } + + PERFOUT("%s", buf); + + // This will shutdown everything including us. + if (shutdown) { + shutdown_start = 1; + wake_up(&shutdown_wq); + } + + // Wait for torture to stop us + while (!torture_must_stop()) + schedule_timeout_uninterruptible(1); + +end: + torture_kthread_stopping("main_func"); + return 0; +} + +static void +ref_perf_print_module_parms(struct ref_perf_ops *cur_ops, const char *tag) +{ + pr_alert("%s" PERF_FLAG + "--- %s: verbose=%d shutdown=%d loops=%ld\n", perf_type, tag, + verbose, shutdown, loops); +} + +static void +ref_perf_cleanup(void) +{ + int i; + + if (torture_cleanup_begin()) + return; + + if (!cur_ops) { + torture_cleanup_end(); + return; + } + + if (reader_tasks) { + for (i = 0; i < nreaders; i++) + torture_stop_kthread("ref_perf_reader", + reader_tasks[i].task); + } + kfree(reader_tasks); + + torture_stop_kthread("main_task", main_task); + kfree(main_task); + + // Do perf-type-specific cleanup operations. + if (cur_ops->cleanup != NULL) + cur_ops->cleanup(); + + torture_cleanup_end(); +} + +// Shutdown kthread. Just waits to be awakened, then shuts down system. +static int +ref_perf_shutdown(void *arg) +{ + wait_event(shutdown_wq, shutdown_start); + + smp_mb(); // Wake before output. + ref_perf_cleanup(); + kernel_power_off(); + + return -EINVAL; +} + +static int __init +ref_perf_init(void) +{ + long i; + int firsterr = 0; + static struct ref_perf_ops *perf_ops[] = { + &rcu_ops, &srcu_ops, &refcnt_ops, &rwlock_ops, &rwsem_ops, + }; + + if (!torture_init_begin(perf_type, verbose)) + return -EBUSY; + + for (i = 0; i < ARRAY_SIZE(perf_ops); i++) { + cur_ops = perf_ops[i]; + if (strcmp(perf_type, cur_ops->name) == 0) + break; + } + if (i == ARRAY_SIZE(perf_ops)) { + pr_alert("rcu-perf: invalid perf type: \"%s\"\n", perf_type); + pr_alert("rcu-perf types:"); + for (i = 0; i < ARRAY_SIZE(perf_ops); i++) + pr_cont(" %s", perf_ops[i]->name); + pr_cont("\n"); + WARN_ON(!IS_MODULE(CONFIG_RCU_REF_PERF_TEST)); + firsterr = -EINVAL; + cur_ops = NULL; + goto unwind; + } + if (cur_ops->init) + cur_ops->init(); + + ref_perf_print_module_parms(cur_ops, "Start of test"); + + // Shutdown task + if (shutdown) { + init_waitqueue_head(&shutdown_wq); + firsterr = torture_create_kthread(ref_perf_shutdown, NULL, + shutdown_task); + if (firsterr) + goto unwind; + schedule_timeout_uninterruptible(1); + } + + // Reader tasks (~75% of online CPUs). + nreaders = (num_online_cpus() >> 1) + (num_online_cpus() >> 2); + reader_tasks = kcalloc(nreaders, sizeof(reader_tasks[0]), + GFP_KERNEL); + if (!reader_tasks) { + VERBOSE_PERFOUT_ERRSTRING("out of memory"); + firsterr = -ENOMEM; + goto unwind; + } + + VERBOSE_PERFOUT("Starting %d reader threads\n", nreaders); + + for (i = 0; i < nreaders; i++) { + firsterr = torture_create_kthread(ref_perf_reader, (void *)i, + reader_tasks[i].task); + if (firsterr) + goto unwind; + + init_waitqueue_head(&(reader_tasks[i].wq)); + } + + // Main Task + init_waitqueue_head(&main_wq); + firsterr = torture_create_kthread(main_func, NULL, main_task); + if (firsterr) + goto unwind; + schedule_timeout_uninterruptible(1); + + + // Wait until all threads start + while (atomic_read(&n_init) < nreaders + 1) + schedule_timeout_uninterruptible(1); + + wake_up(&main_wq); + + torture_init_end(); + return 0; + +unwind: + torture_init_end(); + ref_perf_cleanup(); + return firsterr; +} + +module_init(ref_perf_init); +module_exit(ref_perf_cleanup); -- cgit v1.2.3 From 708cda31652c02e64adaeafafe7b996e4e14c3eb Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 25 May 2020 09:22:24 -0700 Subject: rcuperf: Add comments explaining the high reader overhead This commit adds comments explaining why the readers have otherwise insane levels of measurement overhead, namely that they are intended as a test load for update-side performance measurements, not as a straight-up read-side performance test. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcuperf.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c index 246da8fe199e..d906ca987936 100644 --- a/kernel/rcu/rcuperf.c +++ b/kernel/rcu/rcuperf.c @@ -69,6 +69,11 @@ MODULE_AUTHOR("Paul E. McKenney "); * value specified by nr_cpus for a read-only test. * * Various other use cases may of course be specified. + * + * Note that this test's readers are intended only as a test load for + * the writers. The reader performance statistics will be overly + * pessimistic due to the per-critical-section interrupt disabling, + * test-end checks, and the pair of calls through pointers. */ #ifdef MODULE @@ -309,8 +314,10 @@ static void rcu_perf_wait_shutdown(void) } /* - * RCU perf reader kthread. Repeatedly does empty RCU read-side - * critical section, minimizing update-side interference. + * RCU perf reader kthread. Repeatedly does empty RCU read-side critical + * section, minimizing update-side interference. However, the point of + * this test is not to evaluate reader performance, but instead to serve + * as a test load for update-side performance testing. */ static int rcu_perf_reader(void *arg) -- cgit v1.2.3 From f8b4bb23ec014a5d16663ad70b45d9f46c456ec4 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 25 May 2020 14:07:52 -0700 Subject: torture: Add refperf to the rcutorture scripting This commit updates the rcutorture scripting to include the new refperf torture-test module. Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- .../rcutorture/bin/kvm-recheck-refperf.sh | 67 ++++++++++++++++++++++ tools/testing/selftests/rcutorture/bin/kvm.sh | 9 +-- .../selftests/rcutorture/bin/parse-console.sh | 4 +- .../selftests/rcutorture/configs/refperf/CFLIST | 2 + .../selftests/rcutorture/configs/refperf/CFcommon | 2 + .../selftests/rcutorture/configs/refperf/NOPREEMPT | 18 ++++++ .../selftests/rcutorture/configs/refperf/PREEMPT | 18 ++++++ .../rcutorture/configs/refperf/ver_functions.sh | 16 ++++++ 8 files changed, 130 insertions(+), 6 deletions(-) create mode 100755 tools/testing/selftests/rcutorture/bin/kvm-recheck-refperf.sh create mode 100644 tools/testing/selftests/rcutorture/configs/refperf/CFLIST create mode 100644 tools/testing/selftests/rcutorture/configs/refperf/CFcommon create mode 100644 tools/testing/selftests/rcutorture/configs/refperf/NOPREEMPT create mode 100644 tools/testing/selftests/rcutorture/configs/refperf/PREEMPT create mode 100644 tools/testing/selftests/rcutorture/configs/refperf/ver_functions.sh diff --git a/tools/testing/selftests/rcutorture/bin/kvm-recheck-refperf.sh b/tools/testing/selftests/rcutorture/bin/kvm-recheck-refperf.sh new file mode 100755 index 000000000000..6fc06cd3538e --- /dev/null +++ b/tools/testing/selftests/rcutorture/bin/kvm-recheck-refperf.sh @@ -0,0 +1,67 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0+ +# +# Analyze a given results directory for refperf performance measurements. +# +# Usage: kvm-recheck-refperf.sh resdir +# +# Copyright (C) IBM Corporation, 2016 +# +# Authors: Paul E. McKenney + +i="$1" +if test -d "$i" -a -r "$i" +then + : +else + echo Unreadable results directory: $i + exit 1 +fi +PATH=`pwd`/tools/testing/selftests/rcutorture/bin:$PATH; export PATH +. functions.sh + +configfile=`echo $i | sed -e 's/^.*\///'` + +sed -e 's/^\[[^]]*]//' < $i/console.log | tr -d '\015' | +awk -v configfile="$configfile" ' +/^[ ]*Threads Time\(ns\) *$/ { + if (dataphase + 0 == 0) { + dataphase = 1; + # print configfile, $0; + } + next; +} + +/[^ ]*[0-9][0-9]* [0-9][0-9]*\.[0-9][0-9]*$/ { + if (dataphase == 1) { + # print $0; + readertimes[++n] = $2; + sum += $2; + } + next; +} + +{ + if (dataphase == 1) + dataphase == 2; + next; +} + +END { + print configfile " results:"; + newNR = asort(readertimes); + if (newNR <= 0) { + print "No refperf records found???" + exit; + } + medianidx = int(newNR / 2); + if (newNR == medianidx * 2) + medianvalue = (readertimes[medianidx - 1] + readertimes[medianidx]) / 2; + else + medianvalue = readertimes[medianidx]; + print "Average reader duration: " sum / newNR " nanoseconds"; + print "Minimum reader duration: " readertimes[1]; + print "Median reader duration: " medianvalue; + print "Maximum reader duration: " readertimes[newNR]; + print "Computed from refperf printk output."; +}' diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh b/tools/testing/selftests/rcutorture/bin/kvm.sh index c279cf9cb010..48b6a7248f50 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm.sh @@ -180,13 +180,14 @@ do shift ;; --torture) - checkarg --torture "(suite name)" "$#" "$2" '^\(lock\|rcu\|rcuperf\)$' '^--' + checkarg --torture "(suite name)" "$#" "$2" '^\(lock\|rcu\|rcuperf\|refperf\)$' '^--' TORTURE_SUITE=$2 shift - if test "$TORTURE_SUITE" = rcuperf + if test "$TORTURE_SUITE" = rcuperf || test "$TORTURE_SUITE" = refperf then - # If you really want jitter for rcuperf, specify - # it after specifying rcuperf. (But why?) + # If you really want jitter for refperf or + # rcuperf, specify it after specifying the rcuperf + # or the refperf. (But why jitter in these cases?) jitter=0 fi ;; diff --git a/tools/testing/selftests/rcutorture/bin/parse-console.sh b/tools/testing/selftests/rcutorture/bin/parse-console.sh index 4bf62d7b1cbc..85af11d2d0cb 100755 --- a/tools/testing/selftests/rcutorture/bin/parse-console.sh +++ b/tools/testing/selftests/rcutorture/bin/parse-console.sh @@ -33,8 +33,8 @@ then fi cat /dev/null > $file.diags -# Check for proper termination, except that rcuperf runs don't indicate this. -if test "$TORTURE_SUITE" != rcuperf +# Check for proper termination, except for rcuperf and refperf. +if test "$TORTURE_SUITE" != rcuperf && test "$TORTURE_SUITE" != refperf then # check for abject failure diff --git a/tools/testing/selftests/rcutorture/configs/refperf/CFLIST b/tools/testing/selftests/rcutorture/configs/refperf/CFLIST new file mode 100644 index 000000000000..4d62eb4a39f9 --- /dev/null +++ b/tools/testing/selftests/rcutorture/configs/refperf/CFLIST @@ -0,0 +1,2 @@ +NOPREEMPT +PREEMPT diff --git a/tools/testing/selftests/rcutorture/configs/refperf/CFcommon b/tools/testing/selftests/rcutorture/configs/refperf/CFcommon new file mode 100644 index 000000000000..8ba5ba207503 --- /dev/null +++ b/tools/testing/selftests/rcutorture/configs/refperf/CFcommon @@ -0,0 +1,2 @@ +CONFIG_RCU_REF_PERF_TEST=y +CONFIG_PRINTK_TIME=y diff --git a/tools/testing/selftests/rcutorture/configs/refperf/NOPREEMPT b/tools/testing/selftests/rcutorture/configs/refperf/NOPREEMPT new file mode 100644 index 000000000000..1cd25b7314e3 --- /dev/null +++ b/tools/testing/selftests/rcutorture/configs/refperf/NOPREEMPT @@ -0,0 +1,18 @@ +CONFIG_SMP=y +CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_VOLUNTARY=n +CONFIG_PREEMPT=n +#CHECK#CONFIG_PREEMPT_RCU=n +CONFIG_HZ_PERIODIC=n +CONFIG_NO_HZ_IDLE=y +CONFIG_NO_HZ_FULL=n +CONFIG_RCU_FAST_NO_HZ=n +CONFIG_HOTPLUG_CPU=n +CONFIG_SUSPEND=n +CONFIG_HIBERNATION=n +CONFIG_RCU_NOCB_CPU=n +CONFIG_DEBUG_LOCK_ALLOC=n +CONFIG_PROVE_LOCKING=n +CONFIG_RCU_BOOST=n +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n +CONFIG_RCU_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/refperf/PREEMPT b/tools/testing/selftests/rcutorture/configs/refperf/PREEMPT new file mode 100644 index 000000000000..d10bc694f42c --- /dev/null +++ b/tools/testing/selftests/rcutorture/configs/refperf/PREEMPT @@ -0,0 +1,18 @@ +CONFIG_SMP=y +CONFIG_PREEMPT_NONE=n +CONFIG_PREEMPT_VOLUNTARY=n +CONFIG_PREEMPT=y +#CHECK#CONFIG_PREEMPT_RCU=y +CONFIG_HZ_PERIODIC=n +CONFIG_NO_HZ_IDLE=y +CONFIG_NO_HZ_FULL=n +CONFIG_RCU_FAST_NO_HZ=n +CONFIG_HOTPLUG_CPU=n +CONFIG_SUSPEND=n +CONFIG_HIBERNATION=n +CONFIG_RCU_NOCB_CPU=n +CONFIG_DEBUG_LOCK_ALLOC=n +CONFIG_PROVE_LOCKING=n +CONFIG_RCU_BOOST=n +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n +CONFIG_RCU_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/refperf/ver_functions.sh b/tools/testing/selftests/rcutorture/configs/refperf/ver_functions.sh new file mode 100644 index 000000000000..489f05dd929a --- /dev/null +++ b/tools/testing/selftests/rcutorture/configs/refperf/ver_functions.sh @@ -0,0 +1,16 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0+ +# +# Torture-suite-dependent shell functions for the rest of the scripts. +# +# Copyright (C) IBM Corporation, 2015 +# +# Authors: Paul E. McKenney + +# per_version_boot_params bootparam-string config-file seconds +# +# Adds per-version torture-module parameters to kernels supporting them. +per_version_boot_params () { + echo $1 refperf.shutdown=1 \ + refperf.verbose=1 +} -- cgit v1.2.3 From 777a54c908ec69fa0eccab54068a49ecda38ffde Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 25 May 2020 14:16:44 -0700 Subject: refperf: Add holdoff parameter to allow CPUs to come online This commit adds an rcuperf module parameter named "holdoff" that defaults to 10 seconds if refperf is built in and to zero otherwise. The assumption is that all the CPUs are online by the time that the modprobe and insmod commands are going to do anything, and that normal systems will have all the CPUs online within ten seconds. Larger systems may take many tens of seconds or even minutes to get to this point, hence this being a module parameter instead of being a hard-coded constant. Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index 61161530acc8..4d686fdc3105 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -57,7 +57,10 @@ MODULE_PARM_DESC(perf_type, "Type of test (rcu, srcu, refcnt, rwsem, rwlock."); torture_param(int, verbose, 0, "Enable verbose debugging printk()s"); -// Number of loops per experiment, all readers execute an operation concurrently +// Wait until there are multiple CPUs before starting test. +torture_param(int, holdoff, IS_BUILTIN(CONFIG_RCU_REF_PERF_TEST) ? 10 : 0, + "Holdoff time before test start (s)"); +// Number of loops per experiment, all readers execute operations concurrently. torture_param(long, loops, 10000000, "Number of loops per experiment."); #ifdef MODULE @@ -248,6 +251,8 @@ ref_perf_reader(void *arg) set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids)); set_user_nice(current, MAX_NICE); atomic_inc(&n_init); + if (holdoff) + schedule_timeout_interruptible(holdoff * HZ); repeat: VERBOSE_PERFOUT("ref_perf_reader %ld: waiting to start next experiment on cpu %d", me, smp_processor_id()); @@ -357,6 +362,8 @@ static int main_func(void *arg) // Wait for all threads to start. wait_event(main_wq, atomic_read(&n_init) == (nreaders + 1)); + if (holdoff) + schedule_timeout_interruptible(holdoff * HZ); // Start exp readers up per experiment for (exp = 0; exp < nreaders && !torture_must_stop(); exp++) { @@ -420,8 +427,8 @@ static void ref_perf_print_module_parms(struct ref_perf_ops *cur_ops, const char *tag) { pr_alert("%s" PERF_FLAG - "--- %s: verbose=%d shutdown=%d loops=%ld\n", perf_type, tag, - verbose, shutdown, loops); + "--- %s: verbose=%d shutdown=%d holdoff=%d loops=%ld\n", perf_type, tag, + verbose, shutdown, holdoff, loops); } static void -- cgit v1.2.3 From 75dd8efef56ed5959c398974c785026f84aa0d1a Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 25 May 2020 14:59:06 -0700 Subject: refperf: Hoist function-pointer calls out of the loop Current runs show PREEMPT=n rcu_read_lock()/rcu_read_unlock() pairs consuming between 20 and 30 nanoseconds, when in fact the actual value is zero, give or take the barrier() asm's effect on compiler optimizations. The additional overhead is caused by function calls through pointers (especially in these days of Spectre mitigations) and perhaps also needless argument passing, a non-const loop limit, and an upcounting loop. This commit therefore combines the ->readlock() and ->readunlock() function pointers into a single ->readsection() function pointer that takes the loop count as a const parameter and keeps any data passed from the read-lock to the read-unlock internal to this new function. These changes reduce the measured overhead of the aforementioned PREEMPT=n rcu_read_lock()/rcu_read_unlock() pairs from between 20 and 30 nanoseconds to somewhere south of 500 picoseconds. Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 92 ++++++++++++++++++++++------------------------------ 1 file changed, 38 insertions(+), 54 deletions(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index 4d686fdc3105..57c7b7a40bd2 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -108,23 +108,20 @@ static int exp_idx; struct ref_perf_ops { void (*init)(void); void (*cleanup)(void); - int (*readlock)(void); - void (*readunlock)(int idx); + void (*readsection)(const int nloops); const char *name; }; static struct ref_perf_ops *cur_ops; -// Definitions for RCU ref perf testing. -static int ref_rcu_read_lock(void) __acquires(RCU) +static void ref_rcu_read_section(const int nloops) { - rcu_read_lock(); - return 0; -} + int i; -static void ref_rcu_read_unlock(int idx) __releases(RCU) -{ - rcu_read_unlock(); + for (i = nloops; i >= 0; i--) { + rcu_read_lock(); + rcu_read_unlock(); + } } static void rcu_sync_perf_init(void) @@ -133,8 +130,7 @@ static void rcu_sync_perf_init(void) static struct ref_perf_ops rcu_ops = { .init = rcu_sync_perf_init, - .readlock = ref_rcu_read_lock, - .readunlock = ref_rcu_read_unlock, + .readsection = ref_rcu_read_section, .name = "rcu" }; @@ -143,42 +139,39 @@ static struct ref_perf_ops rcu_ops = { DEFINE_STATIC_SRCU(srcu_refctl_perf); static struct srcu_struct *srcu_ctlp = &srcu_refctl_perf; -static int srcu_ref_perf_read_lock(void) __acquires(srcu_ctlp) +static void srcu_ref_perf_read_section(int nloops) { - return srcu_read_lock(srcu_ctlp); -} + int i; + int idx; -static void srcu_ref_perf_read_unlock(int idx) __releases(srcu_ctlp) -{ - srcu_read_unlock(srcu_ctlp, idx); + for (i = nloops; i >= 0; i--) { + idx = srcu_read_lock(srcu_ctlp); + srcu_read_unlock(srcu_ctlp, idx); + } } static struct ref_perf_ops srcu_ops = { .init = rcu_sync_perf_init, - .readlock = srcu_ref_perf_read_lock, - .readunlock = srcu_ref_perf_read_unlock, + .readsection = srcu_ref_perf_read_section, .name = "srcu" }; // Definitions for reference count static atomic_t refcnt; -static int srcu_ref_perf_refcnt_lock(void) +static void ref_perf_refcnt_section(const int nloops) { - atomic_inc(&refcnt); - return 0; -} + int i; -static void srcu_ref_perf_refcnt_unlock(int idx) __releases(srcu_ctlp) -{ - atomic_dec(&refcnt); - srcu_read_unlock(srcu_ctlp, idx); + for (i = nloops; i >= 0; i--) { + atomic_inc(&refcnt); + atomic_dec(&refcnt); + } } static struct ref_perf_ops refcnt_ops = { .init = rcu_sync_perf_init, - .readlock = srcu_ref_perf_refcnt_lock, - .readunlock = srcu_ref_perf_refcnt_unlock, + .readsection = ref_perf_refcnt_section, .name = "refcnt" }; @@ -190,21 +183,19 @@ static void ref_perf_rwlock_init(void) rwlock_init(&test_rwlock); } -static int ref_perf_rwlock_lock(void) +static void ref_perf_rwlock_section(const int nloops) { - read_lock(&test_rwlock); - return 0; -} + int i; -static void ref_perf_rwlock_unlock(int idx) -{ - read_unlock(&test_rwlock); + for (i = nloops; i >= 0; i--) { + read_lock(&test_rwlock); + read_unlock(&test_rwlock); + } } static struct ref_perf_ops rwlock_ops = { .init = ref_perf_rwlock_init, - .readlock = ref_perf_rwlock_lock, - .readunlock = ref_perf_rwlock_unlock, + .readsection = ref_perf_rwlock_section, .name = "rwlock" }; @@ -216,21 +207,19 @@ static void ref_perf_rwsem_init(void) init_rwsem(&test_rwsem); } -static int ref_perf_rwsem_lock(void) +static void ref_perf_rwsem_section(const int nloops) { - down_read(&test_rwsem); - return 0; -} + int i; -static void ref_perf_rwsem_unlock(int idx) -{ - up_read(&test_rwsem); + for (i = nloops; i >= 0; i--) { + down_read(&test_rwsem); + up_read(&test_rwsem); + } } static struct ref_perf_ops rwsem_ops = { .init = ref_perf_rwsem_init, - .readlock = ref_perf_rwsem_lock, - .readunlock = ref_perf_rwsem_unlock, + .readsection = ref_perf_rwsem_section, .name = "rwsem" }; @@ -242,8 +231,6 @@ ref_perf_reader(void *arg) unsigned long flags; long me = (long)arg; struct reader_task *rt = &(reader_tasks[me]); - unsigned long spincnt; - int idx; u64 start; s64 duration; @@ -275,10 +262,7 @@ repeat: VERBOSE_PERFOUT("ref_perf_reader %ld: experiment %d started", me, exp_idx); - for (spincnt = 0; spincnt < loops; spincnt++) { - idx = cur_ops->readlock(); - cur_ops->readunlock(idx); - } + cur_ops->readsection(loops); duration = ktime_get_mono_fast_ns() - start; local_irq_restore(flags); -- cgit v1.2.3 From 83b88c86da0e5f97faeac5a9bb19fe32f8c0394b Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 25 May 2020 15:31:07 -0700 Subject: refperf: Allow decimal nanoseconds The CONFIG_PREEMPT=n rcu_read_lock()/rcu_read_unlock() pair's overhead, even including loop overhead, is far less than one nanosecond. Since logscale plots are not all that happy with zero values, provide picoseconds as decimals. Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index 57c7b7a40bd2..e991d4820f51 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -375,7 +375,7 @@ static int main_func(void *arg) if (torture_must_stop()) goto end; - reader_tasks[exp].result_avg = process_durations(exp) / ((exp + 1) * loops); + reader_tasks[exp].result_avg = 1000 * process_durations(exp) / ((exp + 1) * loops); } // Print the average of all experiments @@ -386,7 +386,7 @@ static int main_func(void *arg) strcat(buf, "Threads\tTime(ns)\n"); for (exp = 0; exp < nreaders; exp++) { - sprintf(buf1, "%d\t%llu\n", exp + 1, reader_tasks[exp].result_avg); + sprintf(buf1, "%d\t%llu.%03d\n", exp + 1, reader_tasks[exp].result_avg / 1000, (int)(reader_tasks[exp].result_avg % 1000)); strcat(buf, buf1); } -- cgit v1.2.3 From 8fc28783a0c3704ea27505a25dbde8333d75380c Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 25 May 2020 15:48:38 -0700 Subject: refperf: Convert nreaders to a module parameter This commit converts nreaders to a module parameter, with the default of -1 specifying the old behavior of using 75% of the readers. Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index e991d4820f51..020e55a9a64b 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -62,6 +62,12 @@ torture_param(int, holdoff, IS_BUILTIN(CONFIG_RCU_REF_PERF_TEST) ? 10 : 0, "Holdoff time before test start (s)"); // Number of loops per experiment, all readers execute operations concurrently. torture_param(long, loops, 10000000, "Number of loops per experiment."); +// Number of readers, with -1 defaulting to about 75% of the CPUs. +torture_param(int, nreaders, -1, "Number of readers, -1 for 75% of CPUs."); +// Number of runs. +torture_param(int, nruns, 30, "Number of experiments to run."); +// Reader delay in nanoseconds, 0 for no delay. +torture_param(int, readdelay, 0, "Read-side delay in nanoseconds."); #ifdef MODULE # define REFPERF_SHUTDOWN 0 @@ -93,7 +99,6 @@ static wait_queue_head_t main_wq; static int shutdown_start; static struct reader_task *reader_tasks; -static int nreaders; // Number of readers that are part of the current experiment. static atomic_t nreaders_exp; @@ -411,8 +416,8 @@ static void ref_perf_print_module_parms(struct ref_perf_ops *cur_ops, const char *tag) { pr_alert("%s" PERF_FLAG - "--- %s: verbose=%d shutdown=%d holdoff=%d loops=%ld\n", perf_type, tag, - verbose, shutdown, holdoff, loops); + "--- %s: verbose=%d shutdown=%d holdoff=%d loops=%ld nreaders=%d\n", perf_type, tag, + verbose, shutdown, holdoff, loops, nreaders); } static void @@ -501,8 +506,9 @@ ref_perf_init(void) schedule_timeout_uninterruptible(1); } - // Reader tasks (~75% of online CPUs). - nreaders = (num_online_cpus() >> 1) + (num_online_cpus() >> 2); + // Reader tasks (default to ~75% of online CPUs). + if (nreaders < 0) + nreaders = (num_online_cpus() >> 1) + (num_online_cpus() >> 2); reader_tasks = kcalloc(nreaders, sizeof(reader_tasks[0]), GFP_KERNEL); if (!reader_tasks) { -- cgit v1.2.3 From dbf28efdae7bb51032eeb0fe1b6bd07d6f0f9b6c Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 25 May 2020 17:22:24 -0700 Subject: refperf: Provide module parameter to specify number of experiments The current code uses the number of threads both to limit the number of threads and to specify the number of experiments, but also varies the number of threads as the experiments progress. This commit takes a different approach by adding an refperf.nruns module parameter that specifies the number of experiments, and furthermore uses the same number of threads for each experiment. Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 43 +++++++++++++++++++++++-------------------- 1 file changed, 23 insertions(+), 20 deletions(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index 020e55a9a64b..6324449db404 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -83,12 +83,6 @@ struct reader_task { atomic_t start; wait_queue_head_t wq; u64 last_duration_ns; - - // The average latency When 1.. are concurrently - // running an experiment. For example, if this reader_task is - // of index 5 in the reader_tasks array, then result is for - // 6 cores. - u64 result_avg; }; static struct task_struct *shutdown_task; @@ -289,12 +283,12 @@ end: return 0; } -void reset_readers(int n) +void reset_readers(void) { int i; struct reader_task *rt; - for (i = 0; i < n; i++) { + for (i = 0; i < nreaders; i++) { rt = &(reader_tasks[i]); rt->last_duration_ns = 0; @@ -314,7 +308,7 @@ u64 process_durations(int n) sprintf(buf, "Experiment #%d (Format: :)", exp_idx); - for (i = 0; i <= n && !torture_must_stop(); i++) { + for (i = 0; i < n && !torture_must_stop(); i++) { rt = &(reader_tasks[i]); sprintf(buf1, "%d: %llu\t", i, rt->last_duration_ns); @@ -342,11 +336,15 @@ static int main_func(void *arg) int exp, r; char buf1[64]; char buf[512]; + u64 *result_avg; set_cpus_allowed_ptr(current, cpumask_of(nreaders % nr_cpu_ids)); set_user_nice(current, MAX_NICE); VERBOSE_PERFOUT("main_func task started"); + result_avg = kzalloc(nruns * sizeof(*result_avg), GFP_KERNEL); + if (!result_avg) + VERBOSE_PERFOUT_ERRSTRING("out of memory"); atomic_inc(&n_init); // Wait for all threads to start. @@ -355,22 +353,24 @@ static int main_func(void *arg) schedule_timeout_interruptible(holdoff * HZ); // Start exp readers up per experiment - for (exp = 0; exp < nreaders && !torture_must_stop(); exp++) { + for (exp = 0; exp < nruns && !torture_must_stop(); exp++) { + if (!result_avg) + break; if (torture_must_stop()) goto end; - reset_readers(exp); - atomic_set(&nreaders_exp, exp + 1); + reset_readers(); + atomic_set(&nreaders_exp, nreaders); exp_idx = exp; - for (r = 0; r <= exp; r++) { + for (r = 0; r < nreaders; r++) { atomic_set(&reader_tasks[r].start, 1); wake_up(&reader_tasks[r].wq); } VERBOSE_PERFOUT("main_func: experiment started, waiting for %d readers", - exp); + nreaders); wait_event(main_wq, !atomic_read(&nreaders_exp) || torture_must_stop()); @@ -380,7 +380,7 @@ static int main_func(void *arg) if (torture_must_stop()) goto end; - reader_tasks[exp].result_avg = 1000 * process_durations(exp) / ((exp + 1) * loops); + result_avg[exp] = 1000 * process_durations(nreaders) / (nreaders * loops); } // Print the average of all experiments @@ -390,12 +390,15 @@ static int main_func(void *arg) strcat(buf, "\n"); strcat(buf, "Threads\tTime(ns)\n"); - for (exp = 0; exp < nreaders; exp++) { - sprintf(buf1, "%d\t%llu.%03d\n", exp + 1, reader_tasks[exp].result_avg / 1000, (int)(reader_tasks[exp].result_avg % 1000)); + for (exp = 0; exp < nruns; exp++) { + if (!result_avg) + break; + sprintf(buf1, "%d\t%llu.%03d\n", exp + 1, result_avg[exp] / 1000, (int)(result_avg[exp] % 1000)); strcat(buf, buf1); } - PERFOUT("%s", buf); + if (result_avg) + PERFOUT("%s", buf); // This will shutdown everything including us. if (shutdown) { @@ -416,8 +419,8 @@ static void ref_perf_print_module_parms(struct ref_perf_ops *cur_ops, const char *tag) { pr_alert("%s" PERF_FLAG - "--- %s: verbose=%d shutdown=%d holdoff=%d loops=%ld nreaders=%d\n", perf_type, tag, - verbose, shutdown, holdoff, loops, nreaders); + "--- %s: verbose=%d shutdown=%d holdoff=%d loops=%ld nreaders=%d nruns=%d\n", perf_type, tag, + verbose, shutdown, holdoff, loops, nreaders, nruns); } static void -- cgit v1.2.3 From f518f154ecef347777db33b7c9b0581f245159f0 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 25 May 2020 17:32:56 -0700 Subject: refperf: Dynamically allocate experiment-summary output buffer Currently, the buffer used to accumulate the experiment-summary output is fixed size, which will cause problems if someone decides to run one hundred experiments. This commit therefore dynamically allocates this buffer. Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index 6324449db404..75b9cceaece1 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -333,9 +333,10 @@ u64 process_durations(int n) // point all the timestamps are printed. static int main_func(void *arg) { + bool errexit = false; int exp, r; char buf1[64]; - char buf[512]; + char *buf; u64 *result_avg; set_cpus_allowed_ptr(current, cpumask_of(nreaders % nr_cpu_ids)); @@ -343,8 +344,11 @@ static int main_func(void *arg) VERBOSE_PERFOUT("main_func task started"); result_avg = kzalloc(nruns * sizeof(*result_avg), GFP_KERNEL); - if (!result_avg) + buf = kzalloc(64 + nruns * 32, GFP_KERNEL); + if (!result_avg || !buf) { VERBOSE_PERFOUT_ERRSTRING("out of memory"); + errexit = true; + } atomic_inc(&n_init); // Wait for all threads to start. @@ -354,7 +358,7 @@ static int main_func(void *arg) // Start exp readers up per experiment for (exp = 0; exp < nruns && !torture_must_stop(); exp++) { - if (!result_avg) + if (errexit) break; if (torture_must_stop()) goto end; @@ -391,13 +395,13 @@ static int main_func(void *arg) strcat(buf, "Threads\tTime(ns)\n"); for (exp = 0; exp < nruns; exp++) { - if (!result_avg) + if (errexit) break; sprintf(buf1, "%d\t%llu.%03d\n", exp + 1, result_avg[exp] / 1000, (int)(result_avg[exp] % 1000)); strcat(buf, buf1); } - if (result_avg) + if (!errexit) PERFOUT("%s", buf); // This will shutdown everything including us. @@ -412,6 +416,8 @@ static int main_func(void *arg) end: torture_kthread_stopping("main_func"); + kfree(result_avg); + kfree(buf); return 0; } -- cgit v1.2.3 From 2e90de76f226f11fe26c871aa321be28152f565a Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 25 May 2020 17:45:03 -0700 Subject: refperf: Dynamically allocate thread-summary output buffer Currently, the buffer used to accumulate the thread-summary output is fixed size, which will cause problems if someone decides to run on a large number of PCUs. This commit therefore dynamically allocates this buffer. [ paulmck: Fix memory allocation as suggested by KASAN. ] Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index 75b9cceaece1..fc940e3dba1f 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -301,9 +301,12 @@ u64 process_durations(int n) int i; struct reader_task *rt; char buf1[64]; - char buf[512]; + char *buf; u64 sum = 0; + buf = kmalloc(128 + nreaders * 32, GFP_KERNEL); + if (!buf) + return 0; buf[0] = 0; sprintf(buf, "Experiment #%d (Format: :)", exp_idx); @@ -322,6 +325,7 @@ u64 process_durations(int n) PERFOUT("%s\n", buf); + kfree(buf); return sum; } -- cgit v1.2.3 From 2990750bceb05c3cdeae3a6d2683cbc4ae4de15e Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 26 May 2020 09:32:57 -0700 Subject: refperf: Make functions static Because the reset_readers() and process_durations() functions are used only within kernel/rcu/refperf.c, this commit makes them static. Reported-by: kbuild test robot Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index fc940e3dba1f..0a900f3ae151 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -283,7 +283,7 @@ end: return 0; } -void reset_readers(void) +static void reset_readers(void) { int i; struct reader_task *rt; @@ -296,7 +296,7 @@ void reset_readers(void) } // Print the results of each reader and return the sum of all their durations. -u64 process_durations(int n) +static u64 process_durations(int n) { int i; struct reader_task *rt; -- cgit v1.2.3 From b864f89ff61492f56b4e8c6713a5efec6540a0e2 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 26 May 2020 10:57:34 -0700 Subject: refperf: Tune reader measurement interval This commit moves a printk() out of the measurement interval, converts a atomic_dec()/atomic_read() pair to atomic_dec_and_test(), and adds a smp_mb__before_atomic() to avoid potential wake/wait hangs. These changes have the added benefit of reducing the number of loops required for amortizing loop overhead for CONFIG_PREEMPT=n RCU measurements from 1,000,000 to 10,000. This reduction in turn shortens the test, reducing the probability of interference. Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index 0a900f3ae151..8815ccfb6f98 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -252,15 +252,16 @@ repeat: // Make sure that the CPU is affinitized appropriately during testing. WARN_ON_ONCE(smp_processor_id() != me); + smp_mb__before_atomic(); atomic_dec(&rt->start); + VERBOSE_PERFOUT("ref_perf_reader %ld: experiment %d started", me, exp_idx); + // To prevent noise, keep interrupts disabled. This also has the // effect of preventing entries into slow path for rcu_read_unlock(). local_irq_save(flags); start = ktime_get_mono_fast_ns(); - VERBOSE_PERFOUT("ref_perf_reader %ld: experiment %d started", me, exp_idx); - cur_ops->readsection(loops); duration = ktime_get_mono_fast_ns() - start; @@ -268,14 +269,12 @@ repeat: rt->last_duration_ns = WARN_ON_ONCE(duration < 0) ? 0 : duration; - atomic_dec(&nreaders_exp); + if (atomic_dec_and_test(&nreaders_exp)) + wake_up(&main_wq); VERBOSE_PERFOUT("ref_perf_reader %ld: experiment %d ended, (readers remaining=%d)", me, exp_idx, atomic_read(&nreaders_exp)); - if (!atomic_read(&nreaders_exp)) - wake_up(&main_wq); - if (!torture_must_stop()) goto repeat; end: -- cgit v1.2.3 From af2789db13b8dc38d16e969f8c11b9468be42d46 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 26 May 2020 11:22:03 -0700 Subject: refperf: Convert reader_task structure's "start" field to int This commit converts the reader_task structure's "start" field to int in order to demote a full barrier to an smp_load_acquire() and also to simplify the code a bit. While in the area, and to enlist the compiler's help in ensuring that nothing was missed, the field's name was changed to start_reader. Also while in the area, change the main_func() store to use smp_store_release() to further fortify against wait/wake races. Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index 8815ccfb6f98..2fd3ed1a0d0d 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -80,7 +80,7 @@ torture_param(bool, shutdown, REFPERF_SHUTDOWN, struct reader_task { struct task_struct *task; - atomic_t start; + int start_reader; wait_queue_head_t wq; u64 last_duration_ns; }; @@ -243,7 +243,7 @@ repeat: VERBOSE_PERFOUT("ref_perf_reader %ld: waiting to start next experiment on cpu %d", me, smp_processor_id()); // Wait for signal that this reader can start. - wait_event(rt->wq, (atomic_read(&nreaders_exp) && atomic_read(&rt->start)) || + wait_event(rt->wq, (atomic_read(&nreaders_exp) && smp_load_acquire(&rt->start_reader)) || torture_must_stop()); if (torture_must_stop()) @@ -252,8 +252,7 @@ repeat: // Make sure that the CPU is affinitized appropriately during testing. WARN_ON_ONCE(smp_processor_id() != me); - smp_mb__before_atomic(); - atomic_dec(&rt->start); + WRITE_ONCE(rt->start_reader, 0); VERBOSE_PERFOUT("ref_perf_reader %ld: experiment %d started", me, exp_idx); @@ -372,7 +371,7 @@ static int main_func(void *arg) exp_idx = exp; for (r = 0; r < nreaders; r++) { - atomic_set(&reader_tasks[r].start, 1); + smp_store_release(&reader_tasks[r].start_reader, 1); wake_up(&reader_tasks[r].wq); } -- cgit v1.2.3 From 86e0da2bb8ed934d3dce5a337895f1118f59c087 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 26 May 2020 11:40:52 -0700 Subject: refperf: More closely synchronize reader start times Currently, readers are awakened individually. On most systems, this results in significant wakeup delay from one reader to the next, which can result in the first and last reader having sole access to the synchronization primitive in question. If that synchronization primitive involves shared memory, those readers will rack up a huge number of operations in a very short time, causing large perturbations in the results. This commit therefore has the readers busy-wait after being awakened, and uses a new n_started variable to synchronize their start times. Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index 2fd3ed1a0d0d..234bb0e84a8b 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -99,6 +99,7 @@ static atomic_t nreaders_exp; // Use to wait for all threads to start. static atomic_t n_init; +static atomic_t n_started; // Track which experiment is currently running. static int exp_idx; @@ -253,6 +254,9 @@ repeat: WARN_ON_ONCE(smp_processor_id() != me); WRITE_ONCE(rt->start_reader, 0); + if (!atomic_dec_return(&n_started)) + while (atomic_read_acquire(&n_started)) + cpu_relax(); VERBOSE_PERFOUT("ref_perf_reader %ld: experiment %d started", me, exp_idx); @@ -367,6 +371,7 @@ static int main_func(void *arg) reset_readers(); atomic_set(&nreaders_exp, nreaders); + atomic_set(&n_started, nreaders); exp_idx = exp; -- cgit v1.2.3 From 2db0bda38453f472640f4ece1e2a495cbd44f892 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 26 May 2020 12:34:57 -0700 Subject: refperf: Add warmup and cooldown processing phases This commit causes all the readers to start running unmeasured load until all readers have done at least one such run (thus having warmed up), then run the measured load, and then run unmeasured load until all readers have completed their measured load. This approach avoids any thread running measured load while other readers are idle. Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index 234bb0e84a8b..445190b97b05 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -100,6 +100,8 @@ static atomic_t nreaders_exp; // Use to wait for all threads to start. static atomic_t n_init; static atomic_t n_started; +static atomic_t n_warmedup; +static atomic_t n_cooleddown; // Track which experiment is currently running. static int exp_idx; @@ -260,8 +262,15 @@ repeat: VERBOSE_PERFOUT("ref_perf_reader %ld: experiment %d started", me, exp_idx); - // To prevent noise, keep interrupts disabled. This also has the - // effect of preventing entries into slow path for rcu_read_unlock(). + + // To reduce noise, do an initial cache-warming invocation, check + // in, and then keep warming until everyone has checked in. + cur_ops->readsection(loops); + if (!atomic_dec_return(&n_warmedup)) + while (atomic_read_acquire(&n_warmedup)) + cur_ops->readsection(loops); + // Also keep interrupts disabled. This also has the effect + // of preventing entries into slow path for rcu_read_unlock(). local_irq_save(flags); start = ktime_get_mono_fast_ns(); @@ -271,6 +280,11 @@ repeat: local_irq_restore(flags); rt->last_duration_ns = WARN_ON_ONCE(duration < 0) ? 0 : duration; + // To reduce runtime-skew noise, do maintain-load invocations until + // everyone is done. + if (!atomic_dec_return(&n_cooleddown)) + while (atomic_read_acquire(&n_cooleddown)) + cur_ops->readsection(loops); if (atomic_dec_and_test(&nreaders_exp)) wake_up(&main_wq); @@ -372,6 +386,8 @@ static int main_func(void *arg) reset_readers(); atomic_set(&nreaders_exp, nreaders); atomic_set(&n_started, nreaders); + atomic_set(&n_warmedup, nreaders); + atomic_set(&n_cooleddown, nreaders); exp_idx = exp; -- cgit v1.2.3 From 6efb06340846c788336f402e3a472a24fabb431e Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 26 May 2020 14:26:25 -0700 Subject: refperf: Label experiment-number column "Runs" The experiment-number column is currently labeled "Threads", which is misleading at best. This commit therefore relabels it as "Runs", and adjusts the scripts accordingly. Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 2 +- tools/testing/selftests/rcutorture/bin/kvm-recheck-refperf.sh | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index 445190b97b05..2d2d227d761a 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -415,7 +415,7 @@ static int main_func(void *arg) buf[0] = 0; strcat(buf, "\n"); - strcat(buf, "Threads\tTime(ns)\n"); + strcat(buf, "Runs\tTime(ns)\n"); for (exp = 0; exp < nruns; exp++) { if (errexit) diff --git a/tools/testing/selftests/rcutorture/bin/kvm-recheck-refperf.sh b/tools/testing/selftests/rcutorture/bin/kvm-recheck-refperf.sh index 6fc06cd3538e..0660f3fab215 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-recheck-refperf.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-recheck-refperf.sh @@ -24,7 +24,7 @@ configfile=`echo $i | sed -e 's/^.*\///'` sed -e 's/^\[[^]]*]//' < $i/console.log | tr -d '\015' | awk -v configfile="$configfile" ' -/^[ ]*Threads Time\(ns\) *$/ { +/^[ ]*Runs Time\(ns\) *$/ { if (dataphase + 0 == 0) { dataphase = 1; # print configfile, $0; -- cgit v1.2.3 From 9d1914d34cebe111a23ab1670633900fd770cec3 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 26 May 2020 15:30:09 -0700 Subject: refperf: Output per-experiment data points Currently, it is necessary to manually edit the console output to see anything more than statistics, and sometimes the statistics can indicate outliers that need more investigation. This commit therefore dumps out the per-experiment measurements, sorted in ascending order, just before dumping out the statistics. Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/kvm-recheck-refperf.sh | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/tools/testing/selftests/rcutorture/bin/kvm-recheck-refperf.sh b/tools/testing/selftests/rcutorture/bin/kvm-recheck-refperf.sh index 0660f3fab215..0e29cfd9986c 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-recheck-refperf.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-recheck-refperf.sh @@ -59,6 +59,10 @@ END { medianvalue = (readertimes[medianidx - 1] + readertimes[medianidx]) / 2; else medianvalue = readertimes[medianidx]; + points = "Points:"; + for (i = 1; i <= newNR; i++) + points = points " " readertimes[i]; + print points; print "Average reader duration: " sum / newNR " nanoseconds"; print "Minimum reader duration: " readertimes[1]; print "Median reader duration: " medianvalue; -- cgit v1.2.3 From 96af8669591d740a1e2695c4d96e544409dbf896 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 27 May 2020 16:46:56 -0700 Subject: refperf: Simplify initialization-time wakeup protocol This commit moves the reader-launch wait loop from ref_perf_init() to main_func(), removing one layer of wakeup and allowing slightly faster system boot. Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 17 +++++------------ 1 file changed, 5 insertions(+), 12 deletions(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index 2d2d227d761a..7839237ffc17 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -369,13 +369,14 @@ static int main_func(void *arg) VERBOSE_PERFOUT_ERRSTRING("out of memory"); errexit = true; } - atomic_inc(&n_init); - - // Wait for all threads to start. - wait_event(main_wq, atomic_read(&n_init) == (nreaders + 1)); if (holdoff) schedule_timeout_interruptible(holdoff * HZ); + // Wait for all threads to start. + atomic_inc(&n_init); + while (atomic_read(&n_init) < nreaders + 1) + schedule_timeout_uninterruptible(1); + // Start exp readers up per experiment for (exp = 0; exp < nruns && !torture_must_stop(); exp++) { if (errexit) @@ -565,14 +566,6 @@ ref_perf_init(void) firsterr = torture_create_kthread(main_func, NULL, main_task); if (firsterr) goto unwind; - schedule_timeout_uninterruptible(1); - - - // Wait until all threads start - while (atomic_read(&n_init) < nreaders + 1) - schedule_timeout_uninterruptible(1); - - wake_up(&main_wq); torture_init_end(); return 0; -- cgit v1.2.3 From b4d1e34f6502a138e32275baabdb6d593d7ea432 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 28 May 2020 16:37:35 -0700 Subject: refperf: Add read-side delay module parameter This commit adds a refperf.readdelay module parameter that controls the duration of each critical section. This parameter allows gathering data showing how the performance differences between the various primitives vary with critical-section length. Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 108 ++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 89 insertions(+), 19 deletions(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index 7839237ffc17..57a750bbcaca 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -66,8 +66,8 @@ torture_param(long, loops, 10000000, "Number of loops per experiment."); torture_param(int, nreaders, -1, "Number of readers, -1 for 75% of CPUs."); // Number of runs. torture_param(int, nruns, 30, "Number of experiments to run."); -// Reader delay in nanoseconds, 0 for no delay. -torture_param(int, readdelay, 0, "Read-side delay in nanoseconds."); +// Reader delay in microseconds, 0 for no delay. +torture_param(int, readdelay, 0, "Read-side delay in microseconds."); #ifdef MODULE # define REFPERF_SHUTDOWN 0 @@ -111,6 +111,7 @@ struct ref_perf_ops { void (*init)(void); void (*cleanup)(void); void (*readsection)(const int nloops); + void (*delaysection)(const int nloops, const int ndelay); const char *name; }; @@ -126,6 +127,17 @@ static void ref_rcu_read_section(const int nloops) } } +static void ref_rcu_delay_section(const int nloops, const int ndelay) +{ + int i; + + for (i = nloops; i >= 0; i--) { + rcu_read_lock(); + udelay(ndelay); + rcu_read_unlock(); + } +} + static void rcu_sync_perf_init(void) { } @@ -133,6 +145,7 @@ static void rcu_sync_perf_init(void) static struct ref_perf_ops rcu_ops = { .init = rcu_sync_perf_init, .readsection = ref_rcu_read_section, + .delaysection = ref_rcu_delay_section, .name = "rcu" }; @@ -141,7 +154,7 @@ static struct ref_perf_ops rcu_ops = { DEFINE_STATIC_SRCU(srcu_refctl_perf); static struct srcu_struct *srcu_ctlp = &srcu_refctl_perf; -static void srcu_ref_perf_read_section(int nloops) +static void srcu_ref_perf_read_section(const int nloops) { int i; int idx; @@ -152,16 +165,29 @@ static void srcu_ref_perf_read_section(int nloops) } } +static void srcu_ref_perf_delay_section(const int nloops, const int ndelay) +{ + int i; + int idx; + + for (i = nloops; i >= 0; i--) { + idx = srcu_read_lock(srcu_ctlp); + udelay(ndelay); + srcu_read_unlock(srcu_ctlp, idx); + } +} + static struct ref_perf_ops srcu_ops = { .init = rcu_sync_perf_init, .readsection = srcu_ref_perf_read_section, + .delaysection = srcu_ref_perf_delay_section, .name = "srcu" }; // Definitions for reference count static atomic_t refcnt; -static void ref_perf_refcnt_section(const int nloops) +static void ref_refcnt_section(const int nloops) { int i; @@ -171,45 +197,69 @@ static void ref_perf_refcnt_section(const int nloops) } } +static void ref_refcnt_delay_section(const int nloops, const int ndelay) +{ + int i; + + for (i = nloops; i >= 0; i--) { + atomic_inc(&refcnt); + udelay(ndelay); + atomic_dec(&refcnt); + } +} + static struct ref_perf_ops refcnt_ops = { .init = rcu_sync_perf_init, - .readsection = ref_perf_refcnt_section, + .readsection = ref_refcnt_section, + .delaysection = ref_refcnt_delay_section, .name = "refcnt" }; // Definitions for rwlock static rwlock_t test_rwlock; -static void ref_perf_rwlock_init(void) +static void ref_rwlock_init(void) { rwlock_init(&test_rwlock); } -static void ref_perf_rwlock_section(const int nloops) +static void ref_rwlock_section(const int nloops) +{ + int i; + + for (i = nloops; i >= 0; i--) { + read_lock(&test_rwlock); + read_unlock(&test_rwlock); + } +} + +static void ref_rwlock_delay_section(const int nloops, const int ndelay) { int i; for (i = nloops; i >= 0; i--) { read_lock(&test_rwlock); + udelay(ndelay); read_unlock(&test_rwlock); } } static struct ref_perf_ops rwlock_ops = { - .init = ref_perf_rwlock_init, - .readsection = ref_perf_rwlock_section, + .init = ref_rwlock_init, + .readsection = ref_rwlock_section, + .delaysection = ref_rwlock_delay_section, .name = "rwlock" }; // Definitions for rwsem static struct rw_semaphore test_rwsem; -static void ref_perf_rwsem_init(void) +static void ref_rwsem_init(void) { init_rwsem(&test_rwsem); } -static void ref_perf_rwsem_section(const int nloops) +static void ref_rwsem_section(const int nloops) { int i; @@ -219,12 +269,32 @@ static void ref_perf_rwsem_section(const int nloops) } } +static void ref_rwsem_delay_section(const int nloops, const int ndelay) +{ + int i; + + for (i = nloops; i >= 0; i--) { + down_read(&test_rwsem); + udelay(ndelay); + up_read(&test_rwsem); + } +} + static struct ref_perf_ops rwsem_ops = { - .init = ref_perf_rwsem_init, - .readsection = ref_perf_rwsem_section, + .init = ref_rwsem_init, + .readsection = ref_rwsem_section, + .delaysection = ref_rwsem_delay_section, .name = "rwsem" }; +static void rcu_perf_one_reader(void) +{ + if (readdelay <= 0) + cur_ops->readsection(loops); + else + cur_ops->delaysection(loops, readdelay); +} + // Reader kthread. Repeatedly does empty RCU read-side // critical section, minimizing update-side interference. static int @@ -265,16 +335,16 @@ repeat: // To reduce noise, do an initial cache-warming invocation, check // in, and then keep warming until everyone has checked in. - cur_ops->readsection(loops); + rcu_perf_one_reader(); if (!atomic_dec_return(&n_warmedup)) while (atomic_read_acquire(&n_warmedup)) - cur_ops->readsection(loops); + rcu_perf_one_reader(); // Also keep interrupts disabled. This also has the effect // of preventing entries into slow path for rcu_read_unlock(). local_irq_save(flags); start = ktime_get_mono_fast_ns(); - cur_ops->readsection(loops); + rcu_perf_one_reader(); duration = ktime_get_mono_fast_ns() - start; local_irq_restore(flags); @@ -284,7 +354,7 @@ repeat: // everyone is done. if (!atomic_dec_return(&n_cooleddown)) while (atomic_read_acquire(&n_cooleddown)) - cur_ops->readsection(loops); + rcu_perf_one_reader(); if (atomic_dec_and_test(&nreaders_exp)) wake_up(&main_wq); @@ -449,8 +519,8 @@ static void ref_perf_print_module_parms(struct ref_perf_ops *cur_ops, const char *tag) { pr_alert("%s" PERF_FLAG - "--- %s: verbose=%d shutdown=%d holdoff=%d loops=%ld nreaders=%d nruns=%d\n", perf_type, tag, - verbose, shutdown, holdoff, loops, nreaders, nruns); + "--- %s: verbose=%d shutdown=%d holdoff=%d loops=%ld nreaders=%d nruns=%d readdelay=%d\n", perf_type, tag, + verbose, shutdown, holdoff, loops, nreaders, nruns, readdelay); } static void -- cgit v1.2.3 From 4dd72a338a07486823037a6b45334d05192c913a Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 29 May 2020 13:11:26 -0700 Subject: refperf: Adjust refperf.loop default value With the various measurement optimizations, 10,000 loops normally suffices. This commit therefore reduces the refperf.loops default value from 10,000,000 to 10,000. Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index 57a750bbcaca..063eeb0473a1 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -61,7 +61,7 @@ torture_param(int, verbose, 0, "Enable verbose debugging printk()s"); torture_param(int, holdoff, IS_BUILTIN(CONFIG_RCU_REF_PERF_TEST) ? 10 : 0, "Holdoff time before test start (s)"); // Number of loops per experiment, all readers execute operations concurrently. -torture_param(long, loops, 10000000, "Number of loops per experiment."); +torture_param(long, loops, 10000, "Number of loops per experiment."); // Number of readers, with -1 defaulting to about 75% of the CPUs. torture_param(int, nreaders, -1, "Number of readers, -1 for 75% of CPUs."); // Number of runs. -- cgit v1.2.3 From 847dd70aa971a67b4dfdb8f131428dfb90d88714 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 29 May 2020 14:24:03 -0700 Subject: doc: Document rcuperf's module parameters This commit adds documentation for the rcuperf module parameters. Cc: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 36 +++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index fb95fad81c79..20cd00b78fc4 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4407,6 +4407,42 @@ reboot_cpu is s[mp]#### with #### being the processor to be used for rebooting. + refperf.holdoff= [KNL] + Set test-start holdoff period. The purpose of + this parameter is to delay the start of the + test until boot completes in order to avoid + interference. + + refperf.loops= [KNL] + Set the number of loops over the synchronization + primitive under test. Increasing this number + reduces noise due to loop start/end overhead, + but the default has already reduced the per-pass + noise to a handful of picoseconds on ca. 2020 + x86 laptops. + + refperf.nreaders= [KNL] + Set number of readers. The default value of -1 + selects N, where N is roughly 75% of the number + of CPUs. A value of zero is an interesting choice. + + refperf.nruns= [KNL] + Set number of runs, each of which is dumped onto + the console log. + + refperf.readdelay= [KNL] + Set the read-side critical-section duration, + measured in microseconds. + + refperf.shutdown= [KNL] + Shut down the system at the end of the performance + test. This defaults to 1 (shut it down) when + rcuperf is built into the kernel and to 0 (leave + it running) when rcuperf is built as a module. + + refperf.verbose= [KNL] + Enable additional printk() statements. + relax_domain_level= [KNL, SMP] Set scheduler's default relax_domain_level. See Documentation/admin-guide/cgroup-v1/cpusets.rst. -- cgit v1.2.3 From 7c944d7c67daee84e3c756bb74ad2f32b28c41cf Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Fri, 29 May 2020 14:36:26 -0700 Subject: refperf: Work around 64-bit division MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A 64-bit division was introduced in refperf, breaking compilation on all 32-bit architectures: kernel/rcu/refperf.o: in function `main_func': refperf.c:(.text+0x57c): undefined reference to `__aeabi_uldivmod' Fix this by using div_u64 to mark the expensive operation. [ paulmck: Update primitive and format per Nathan Chancellor. ] Fixes: bd5b16d6c88d ("refperf: Allow decimal nanoseconds") Reported-by: kbuild test robot Reported-by: Valdis Klētnieks Acked-by: Randy Dunlap # build-tested Signed-off-by: Arnd Bergmann Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index 063eeb0473a1..80d449060bdf 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -478,7 +478,7 @@ static int main_func(void *arg) if (torture_must_stop()) goto end; - result_avg[exp] = 1000 * process_durations(nreaders) / (nreaders * loops); + result_avg[exp] = div_u64(1000 * process_durations(nreaders), nreaders * loops); } // Print the average of all experiments @@ -489,9 +489,13 @@ static int main_func(void *arg) strcat(buf, "Runs\tTime(ns)\n"); for (exp = 0; exp < nruns; exp++) { + u64 avg; + u32 rem; + if (errexit) break; - sprintf(buf1, "%d\t%llu.%03d\n", exp + 1, result_avg[exp] / 1000, (int)(result_avg[exp] % 1000)); + avg = div_u64_rem(result_avg[exp], 1000, &rem); + sprintf(buf1, "%d\t%llu.%03u\n", exp + 1, avg, rem); strcat(buf, buf1); } -- cgit v1.2.3 From 918b351d965560c7902ad482cf87049517843ff2 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 31 May 2020 18:14:57 -0700 Subject: refperf: Change readdelay module parameter to nanoseconds The current units of microseconds are too coarse, so this commit changes the units to nanoseconds. However, ndelay is used only for the nanoseconds with udelay being used for whole microseconds. For example, setting refperf.readdelay=1500 results in a udelay(1) followed by an ndelay(500). Suggested-by: Akira Yokosawa [ paulmck: Abstracted delay per Akira feedback and move from 80 to 100 lines. ] [ paulmck: Fix names as suggested by kbuild test robot. ] Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 36 ++++++++++++++++++++++-------------- 1 file changed, 22 insertions(+), 14 deletions(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index 80d449060bdf..49fffb9bce77 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -66,8 +66,8 @@ torture_param(long, loops, 10000, "Number of loops per experiment."); torture_param(int, nreaders, -1, "Number of readers, -1 for 75% of CPUs."); // Number of runs. torture_param(int, nruns, 30, "Number of experiments to run."); -// Reader delay in microseconds, 0 for no delay. -torture_param(int, readdelay, 0, "Read-side delay in microseconds."); +// Reader delay in nanoseconds, 0 for no delay. +torture_param(int, readdelay, 0, "Read-side delay in nanoseconds."); #ifdef MODULE # define REFPERF_SHUTDOWN 0 @@ -111,12 +111,20 @@ struct ref_perf_ops { void (*init)(void); void (*cleanup)(void); void (*readsection)(const int nloops); - void (*delaysection)(const int nloops, const int ndelay); + void (*delaysection)(const int nloops, const int udl, const int ndl); const char *name; }; static struct ref_perf_ops *cur_ops; +static void un_delay(const int udl, const int ndl) +{ + if (udl) + udelay(udl); + if (ndl) + ndelay(ndl); +} + static void ref_rcu_read_section(const int nloops) { int i; @@ -127,13 +135,13 @@ static void ref_rcu_read_section(const int nloops) } } -static void ref_rcu_delay_section(const int nloops, const int ndelay) +static void ref_rcu_delay_section(const int nloops, const int udl, const int ndl) { int i; for (i = nloops; i >= 0; i--) { rcu_read_lock(); - udelay(ndelay); + un_delay(udl, ndl); rcu_read_unlock(); } } @@ -165,14 +173,14 @@ static void srcu_ref_perf_read_section(const int nloops) } } -static void srcu_ref_perf_delay_section(const int nloops, const int ndelay) +static void srcu_ref_perf_delay_section(const int nloops, const int udl, const int ndl) { int i; int idx; for (i = nloops; i >= 0; i--) { idx = srcu_read_lock(srcu_ctlp); - udelay(ndelay); + un_delay(udl, ndl); srcu_read_unlock(srcu_ctlp, idx); } } @@ -197,13 +205,13 @@ static void ref_refcnt_section(const int nloops) } } -static void ref_refcnt_delay_section(const int nloops, const int ndelay) +static void ref_refcnt_delay_section(const int nloops, const int udl, const int ndl) { int i; for (i = nloops; i >= 0; i--) { atomic_inc(&refcnt); - udelay(ndelay); + un_delay(udl, ndl); atomic_dec(&refcnt); } } @@ -233,13 +241,13 @@ static void ref_rwlock_section(const int nloops) } } -static void ref_rwlock_delay_section(const int nloops, const int ndelay) +static void ref_rwlock_delay_section(const int nloops, const int udl, const int ndl) { int i; for (i = nloops; i >= 0; i--) { read_lock(&test_rwlock); - udelay(ndelay); + un_delay(udl, ndl); read_unlock(&test_rwlock); } } @@ -269,13 +277,13 @@ static void ref_rwsem_section(const int nloops) } } -static void ref_rwsem_delay_section(const int nloops, const int ndelay) +static void ref_rwsem_delay_section(const int nloops, const int udl, const int ndl) { int i; for (i = nloops; i >= 0; i--) { down_read(&test_rwsem); - udelay(ndelay); + un_delay(udl, ndl); up_read(&test_rwsem); } } @@ -292,7 +300,7 @@ static void rcu_perf_one_reader(void) if (readdelay <= 0) cur_ops->readsection(loops); else - cur_ops->delaysection(loops, readdelay); + cur_ops->delaysection(loops, readdelay / 1000, readdelay % 1000); } // Reader kthread. Repeatedly does empty RCU read-side -- cgit v1.2.3 From 72bb749e7048d0a8d7663b59ec1a33bd56c51083 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 2 Jun 2020 08:34:41 -0700 Subject: refperf: Add test for RCU Tasks Trace readers. This commit adds testing for RCU Tasks Trace readers to the refperf module. Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 33 +++++++++++++++++++++++++++++++-- 1 file changed, 31 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index 49fffb9bce77..da7de9ac548d 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include #include @@ -157,7 +158,6 @@ static struct ref_perf_ops rcu_ops = { .name = "rcu" }; - // Definitions for SRCU ref perf testing. DEFINE_STATIC_SRCU(srcu_refctl_perf); static struct srcu_struct *srcu_ctlp = &srcu_refctl_perf; @@ -192,6 +192,35 @@ static struct ref_perf_ops srcu_ops = { .name = "srcu" }; +// Definitions for RCU Tasks Trace ref perf testing. +static void rcu_trace_ref_perf_read_section(const int nloops) +{ + int i; + + for (i = nloops; i >= 0; i--) { + rcu_read_lock_trace(); + rcu_read_unlock_trace(); + } +} + +static void rcu_trace_ref_perf_delay_section(const int nloops, const int udl, const int ndl) +{ + int i; + + for (i = nloops; i >= 0; i--) { + rcu_read_lock_trace(); + un_delay(udl, ndl); + rcu_read_unlock_trace(); + } +} + +static struct ref_perf_ops rcu_trace_ops = { + .init = rcu_sync_perf_init, + .readsection = rcu_trace_ref_perf_read_section, + .delaysection = rcu_trace_ref_perf_delay_section, + .name = "rcu-trace" +}; + // Definitions for reference count static atomic_t refcnt; @@ -584,7 +613,7 @@ ref_perf_init(void) long i; int firsterr = 0; static struct ref_perf_ops *perf_ops[] = { - &rcu_ops, &srcu_ops, &refcnt_ops, &rwlock_ops, &rwsem_ops, + &rcu_ops, &srcu_ops, &rcu_trace_ops, &refcnt_ops, &rwlock_ops, &rwsem_ops, }; if (!torture_init_begin(perf_type, verbose)) -- cgit v1.2.3 From e13ef442fe522fa1f604efec8c899a0e1fc3d426 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 3 Jun 2020 11:56:34 -0700 Subject: refperf: Add test for RCU Tasks readers This commit adds testing for RCU Tasks readers to the refperf module. This also applies to RCU Rude readers, as both flavors have empty (as in non-existent) read-side markers. Signed-off-by: Paul E. McKenney --- kernel/rcu/refperf.c | 28 +++++++++++++++++++++++++++- 1 file changed, 27 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index da7de9ac548d..2bfdcdcb6bd1 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -192,6 +192,31 @@ static struct ref_perf_ops srcu_ops = { .name = "srcu" }; +// Definitions for RCU Tasks ref perf testing: Empty read markers. +// These definitions also work for RCU Rude readers. +static void rcu_tasks_ref_perf_read_section(const int nloops) +{ + int i; + + for (i = nloops; i >= 0; i--) + continue; +} + +static void rcu_tasks_ref_perf_delay_section(const int nloops, const int udl, const int ndl) +{ + int i; + + for (i = nloops; i >= 0; i--) + un_delay(udl, ndl); +} + +static struct ref_perf_ops rcu_tasks_ops = { + .init = rcu_sync_perf_init, + .readsection = rcu_tasks_ref_perf_read_section, + .delaysection = rcu_tasks_ref_perf_delay_section, + .name = "rcu-tasks" +}; + // Definitions for RCU Tasks Trace ref perf testing. static void rcu_trace_ref_perf_read_section(const int nloops) { @@ -613,7 +638,8 @@ ref_perf_init(void) long i; int firsterr = 0; static struct ref_perf_ops *perf_ops[] = { - &rcu_ops, &srcu_ops, &rcu_trace_ops, &refcnt_ops, &rwlock_ops, &rwsem_ops, + &rcu_ops, &srcu_ops, &rcu_trace_ops, &rcu_tasks_ops, + &refcnt_ops, &rwlock_ops, &rwsem_ops, }; if (!torture_init_begin(perf_type, verbose)) -- cgit v1.2.3 From c7dcf8106f7570b133b05ff68fd4100064965d9d Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 12 Jun 2020 13:11:29 -0700 Subject: rcu-tasks: Fix synchronize_rcu_tasks_trace() header comment The synchronize_rcu_tasks_trace() header comment incorrectly claims that any number of things delimit RCU Tasks Trace read-side critical sections, when in fact only rcu_read_lock_trace() and rcu_read_unlock_trace() do so. This commit therefore fixes this comment, and, while in the area, fixes a typo in the rcu_read_lock_trace() header comment. Reported-by: Alexei Starovoitov Signed-off-by: Paul E. McKenney --- include/linux/rcupdate_trace.h | 4 ++-- kernel/rcu/tasks.h | 9 ++++----- 2 files changed, 6 insertions(+), 7 deletions(-) diff --git a/include/linux/rcupdate_trace.h b/include/linux/rcupdate_trace.h index 4c25a41f8b27..d9015aac78c6 100644 --- a/include/linux/rcupdate_trace.h +++ b/include/linux/rcupdate_trace.h @@ -36,8 +36,8 @@ void rcu_read_unlock_trace_special(struct task_struct *t, int nesting); /** * rcu_read_lock_trace - mark beginning of RCU-trace read-side critical section * - * When synchronize_rcu_trace() is invoked by one task, then that task - * is guaranteed to block until all other tasks exit their read-side + * When synchronize_rcu_tasks_trace() is invoked by one task, then that + * task is guaranteed to block until all other tasks exit their read-side * critical sections. Similarly, if call_rcu_trace() is invoked on one * task while other tasks are within RCU read-side critical sections, * invocation of the corresponding RCU callback is deferred until after diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index ce23f6cc5043..a77298c1d126 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -1118,11 +1118,10 @@ EXPORT_SYMBOL_GPL(call_rcu_tasks_trace); * synchronize_rcu_tasks_trace - wait for a trace rcu-tasks grace period * * Control will return to the caller some time after a trace rcu-tasks - * grace period has elapsed, in other words after all currently - * executing rcu-tasks read-side critical sections have elapsed. These - * read-side critical sections are delimited by calls to schedule(), - * cond_resched_tasks_rcu_qs(), userspace execution, and (in theory, - * anyway) cond_resched(). + * grace period has elapsed, in other words after all currently executing + * rcu-tasks read-side critical sections have elapsed. These read-side + * critical sections are delimited by calls to rcu_read_lock_trace() + * and rcu_read_unlock_trace(). * * This is a very specialized primitive, intended only for a few uses in * tracing and other situations requiring manipulation of function preambles -- cgit v1.2.3 From 8e4ec3d02b549a731c94b4bcddff212bb92cdbaf Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 17 Jun 2020 11:33:54 -0700 Subject: refperf: Rename RCU_REF_PERF_TEST to RCU_REF_SCALE_TEST The old Kconfig option name is all too easy to conflate with the unrelated "perf" feature, so this commit renames RCU_REF_PERF_TEST to RCU_REF_SCALE_TEST. Reported-by: Ingo Molnar Signed-off-by: Paul E. McKenney --- kernel/rcu/Kconfig.debug | 4 ++-- kernel/rcu/Makefile | 2 +- kernel/rcu/refperf.c | 6 +++--- tools/testing/selftests/rcutorture/configs/refperf/CFcommon | 2 +- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug index 858765b7f644..3cf6132a4bb9 100644 --- a/kernel/rcu/Kconfig.debug +++ b/kernel/rcu/Kconfig.debug @@ -61,8 +61,8 @@ config RCU_TORTURE_TEST Say M if you want the RCU torture tests to build as a module. Say N if you are unsure. -config RCU_REF_PERF_TEST - tristate "Performance tests for read-side synchronization (RCU and others)" +config RCU_REF_SCALE_TEST + tristate "Scalability tests for read-side synchronization (RCU and others)" depends on DEBUG_KERNEL select TORTURE_TEST select SRCU diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile index ba7d82609cbe..45d562de279a 100644 --- a/kernel/rcu/Makefile +++ b/kernel/rcu/Makefile @@ -12,7 +12,7 @@ obj-$(CONFIG_TREE_SRCU) += srcutree.o obj-$(CONFIG_TINY_SRCU) += srcutiny.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o obj-$(CONFIG_RCU_PERF_TEST) += rcuperf.o -obj-$(CONFIG_RCU_REF_PERF_TEST) += refperf.o +obj-$(CONFIG_RCU_REF_SCALE_TEST) += refperf.o obj-$(CONFIG_TREE_RCU) += tree.o obj-$(CONFIG_TINY_RCU) += tiny.o obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c index 2bfdcdcb6bd1..7c980573acbe 100644 --- a/kernel/rcu/refperf.c +++ b/kernel/rcu/refperf.c @@ -1,6 +1,6 @@ // SPDX-License-Identifier: GPL-2.0+ // -// Performance test comparing RCU vs other mechanisms +// Scalability test comparing RCU vs other mechanisms // for acquiring references on objects. // // Copyright (C) Google, 2020. @@ -59,7 +59,7 @@ MODULE_PARM_DESC(perf_type, "Type of test (rcu, srcu, refcnt, rwsem, rwlock."); torture_param(int, verbose, 0, "Enable verbose debugging printk()s"); // Wait until there are multiple CPUs before starting test. -torture_param(int, holdoff, IS_BUILTIN(CONFIG_RCU_REF_PERF_TEST) ? 10 : 0, +torture_param(int, holdoff, IS_BUILTIN(CONFIG_RCU_REF_SCALE_TEST) ? 10 : 0, "Holdoff time before test start (s)"); // Number of loops per experiment, all readers execute operations concurrently. torture_param(long, loops, 10000, "Number of loops per experiment."); @@ -656,7 +656,7 @@ ref_perf_init(void) for (i = 0; i < ARRAY_SIZE(perf_ops); i++) pr_cont(" %s", perf_ops[i]->name); pr_cont("\n"); - WARN_ON(!IS_MODULE(CONFIG_RCU_REF_PERF_TEST)); + WARN_ON(!IS_MODULE(CONFIG_RCU_REF_SCALE_TEST)); firsterr = -EINVAL; cur_ops = NULL; goto unwind; diff --git a/tools/testing/selftests/rcutorture/configs/refperf/CFcommon b/tools/testing/selftests/rcutorture/configs/refperf/CFcommon index 8ba5ba207503..a98b58b54bb1 100644 --- a/tools/testing/selftests/rcutorture/configs/refperf/CFcommon +++ b/tools/testing/selftests/rcutorture/configs/refperf/CFcommon @@ -1,2 +1,2 @@ -CONFIG_RCU_REF_PERF_TEST=y +CONFIG_RCU_REF_SCALE_TEST=y CONFIG_PRINTK_TIME=y -- cgit v1.2.3 From 1fbeb3a8c4de29433a8d230ee600b13d369b6c0f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 17 Jun 2020 11:53:53 -0700 Subject: refperf: Rename refperf.c to refscale.c and change internal names This commit further avoids conflation of refperf with the kernel's perf feature by renaming kernel/rcu/refperf.c to kernel/rcu/refscale.c, and also by similarly renaming the functions and variables inside this file. This has the side effect of changing the names of the kernel boot parameters, so kernel-parameters.txt and ver_functions.sh are also updated. The rcutorture --torture type remains refperf, and this will be addressed in a separate commit. Reported-by: Ingo Molnar Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 17 +- kernel/rcu/Makefile | 2 +- kernel/rcu/refperf.c | 717 --------------------- kernel/rcu/refscale.c | 717 +++++++++++++++++++++ .../rcutorture/configs/refperf/ver_functions.sh | 4 +- 5 files changed, 730 insertions(+), 727 deletions(-) delete mode 100644 kernel/rcu/refperf.c create mode 100644 kernel/rcu/refscale.c diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 20cd00b78fc4..a4e4e0f6a550 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4407,13 +4407,13 @@ reboot_cpu is s[mp]#### with #### being the processor to be used for rebooting. - refperf.holdoff= [KNL] + refscale.holdoff= [KNL] Set test-start holdoff period. The purpose of this parameter is to delay the start of the test until boot completes in order to avoid interference. - refperf.loops= [KNL] + refscale.loops= [KNL] Set the number of loops over the synchronization primitive under test. Increasing this number reduces noise due to loop start/end overhead, @@ -4421,26 +4421,29 @@ noise to a handful of picoseconds on ca. 2020 x86 laptops. - refperf.nreaders= [KNL] + refscale.nreaders= [KNL] Set number of readers. The default value of -1 selects N, where N is roughly 75% of the number of CPUs. A value of zero is an interesting choice. - refperf.nruns= [KNL] + refscale.nruns= [KNL] Set number of runs, each of which is dumped onto the console log. - refperf.readdelay= [KNL] + refscale.readdelay= [KNL] Set the read-side critical-section duration, measured in microseconds. - refperf.shutdown= [KNL] + refscale.scale_type= [KNL] + Specify the read-protection implementation to test. + + refscale.shutdown= [KNL] Shut down the system at the end of the performance test. This defaults to 1 (shut it down) when rcuperf is built into the kernel and to 0 (leave it running) when rcuperf is built as a module. - refperf.verbose= [KNL] + refscale.verbose= [KNL] Enable additional printk() statements. relax_domain_level= diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile index 45d562de279a..95f5117ef8da 100644 --- a/kernel/rcu/Makefile +++ b/kernel/rcu/Makefile @@ -12,7 +12,7 @@ obj-$(CONFIG_TREE_SRCU) += srcutree.o obj-$(CONFIG_TINY_SRCU) += srcutiny.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o obj-$(CONFIG_RCU_PERF_TEST) += rcuperf.o -obj-$(CONFIG_RCU_REF_SCALE_TEST) += refperf.o +obj-$(CONFIG_RCU_REF_SCALE_TEST) += refscale.o obj-$(CONFIG_TREE_RCU) += tree.o obj-$(CONFIG_TINY_RCU) += tiny.o obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o diff --git a/kernel/rcu/refperf.c b/kernel/rcu/refperf.c deleted file mode 100644 index 7c980573acbe..000000000000 --- a/kernel/rcu/refperf.c +++ /dev/null @@ -1,717 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0+ -// -// Scalability test comparing RCU vs other mechanisms -// for acquiring references on objects. -// -// Copyright (C) Google, 2020. -// -// Author: Joel Fernandes - -#define pr_fmt(fmt) fmt - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include "rcu.h" - -#define PERF_FLAG "-ref-perf: " - -#define PERFOUT(s, x...) \ - pr_alert("%s" PERF_FLAG s, perf_type, ## x) - -#define VERBOSE_PERFOUT(s, x...) \ - do { if (verbose) pr_alert("%s" PERF_FLAG s, perf_type, ## x); } while (0) - -#define VERBOSE_PERFOUT_ERRSTRING(s, x...) \ - do { if (verbose) pr_alert("%s" PERF_FLAG "!!! " s, perf_type, ## x); } while (0) - -MODULE_LICENSE("GPL"); -MODULE_AUTHOR("Joel Fernandes (Google) "); - -static char *perf_type = "rcu"; -module_param(perf_type, charp, 0444); -MODULE_PARM_DESC(perf_type, "Type of test (rcu, srcu, refcnt, rwsem, rwlock."); - -torture_param(int, verbose, 0, "Enable verbose debugging printk()s"); - -// Wait until there are multiple CPUs before starting test. -torture_param(int, holdoff, IS_BUILTIN(CONFIG_RCU_REF_SCALE_TEST) ? 10 : 0, - "Holdoff time before test start (s)"); -// Number of loops per experiment, all readers execute operations concurrently. -torture_param(long, loops, 10000, "Number of loops per experiment."); -// Number of readers, with -1 defaulting to about 75% of the CPUs. -torture_param(int, nreaders, -1, "Number of readers, -1 for 75% of CPUs."); -// Number of runs. -torture_param(int, nruns, 30, "Number of experiments to run."); -// Reader delay in nanoseconds, 0 for no delay. -torture_param(int, readdelay, 0, "Read-side delay in nanoseconds."); - -#ifdef MODULE -# define REFPERF_SHUTDOWN 0 -#else -# define REFPERF_SHUTDOWN 1 -#endif - -torture_param(bool, shutdown, REFPERF_SHUTDOWN, - "Shutdown at end of performance tests."); - -struct reader_task { - struct task_struct *task; - int start_reader; - wait_queue_head_t wq; - u64 last_duration_ns; -}; - -static struct task_struct *shutdown_task; -static wait_queue_head_t shutdown_wq; - -static struct task_struct *main_task; -static wait_queue_head_t main_wq; -static int shutdown_start; - -static struct reader_task *reader_tasks; - -// Number of readers that are part of the current experiment. -static atomic_t nreaders_exp; - -// Use to wait for all threads to start. -static atomic_t n_init; -static atomic_t n_started; -static atomic_t n_warmedup; -static atomic_t n_cooleddown; - -// Track which experiment is currently running. -static int exp_idx; - -// Operations vector for selecting different types of tests. -struct ref_perf_ops { - void (*init)(void); - void (*cleanup)(void); - void (*readsection)(const int nloops); - void (*delaysection)(const int nloops, const int udl, const int ndl); - const char *name; -}; - -static struct ref_perf_ops *cur_ops; - -static void un_delay(const int udl, const int ndl) -{ - if (udl) - udelay(udl); - if (ndl) - ndelay(ndl); -} - -static void ref_rcu_read_section(const int nloops) -{ - int i; - - for (i = nloops; i >= 0; i--) { - rcu_read_lock(); - rcu_read_unlock(); - } -} - -static void ref_rcu_delay_section(const int nloops, const int udl, const int ndl) -{ - int i; - - for (i = nloops; i >= 0; i--) { - rcu_read_lock(); - un_delay(udl, ndl); - rcu_read_unlock(); - } -} - -static void rcu_sync_perf_init(void) -{ -} - -static struct ref_perf_ops rcu_ops = { - .init = rcu_sync_perf_init, - .readsection = ref_rcu_read_section, - .delaysection = ref_rcu_delay_section, - .name = "rcu" -}; - -// Definitions for SRCU ref perf testing. -DEFINE_STATIC_SRCU(srcu_refctl_perf); -static struct srcu_struct *srcu_ctlp = &srcu_refctl_perf; - -static void srcu_ref_perf_read_section(const int nloops) -{ - int i; - int idx; - - for (i = nloops; i >= 0; i--) { - idx = srcu_read_lock(srcu_ctlp); - srcu_read_unlock(srcu_ctlp, idx); - } -} - -static void srcu_ref_perf_delay_section(const int nloops, const int udl, const int ndl) -{ - int i; - int idx; - - for (i = nloops; i >= 0; i--) { - idx = srcu_read_lock(srcu_ctlp); - un_delay(udl, ndl); - srcu_read_unlock(srcu_ctlp, idx); - } -} - -static struct ref_perf_ops srcu_ops = { - .init = rcu_sync_perf_init, - .readsection = srcu_ref_perf_read_section, - .delaysection = srcu_ref_perf_delay_section, - .name = "srcu" -}; - -// Definitions for RCU Tasks ref perf testing: Empty read markers. -// These definitions also work for RCU Rude readers. -static void rcu_tasks_ref_perf_read_section(const int nloops) -{ - int i; - - for (i = nloops; i >= 0; i--) - continue; -} - -static void rcu_tasks_ref_perf_delay_section(const int nloops, const int udl, const int ndl) -{ - int i; - - for (i = nloops; i >= 0; i--) - un_delay(udl, ndl); -} - -static struct ref_perf_ops rcu_tasks_ops = { - .init = rcu_sync_perf_init, - .readsection = rcu_tasks_ref_perf_read_section, - .delaysection = rcu_tasks_ref_perf_delay_section, - .name = "rcu-tasks" -}; - -// Definitions for RCU Tasks Trace ref perf testing. -static void rcu_trace_ref_perf_read_section(const int nloops) -{ - int i; - - for (i = nloops; i >= 0; i--) { - rcu_read_lock_trace(); - rcu_read_unlock_trace(); - } -} - -static void rcu_trace_ref_perf_delay_section(const int nloops, const int udl, const int ndl) -{ - int i; - - for (i = nloops; i >= 0; i--) { - rcu_read_lock_trace(); - un_delay(udl, ndl); - rcu_read_unlock_trace(); - } -} - -static struct ref_perf_ops rcu_trace_ops = { - .init = rcu_sync_perf_init, - .readsection = rcu_trace_ref_perf_read_section, - .delaysection = rcu_trace_ref_perf_delay_section, - .name = "rcu-trace" -}; - -// Definitions for reference count -static atomic_t refcnt; - -static void ref_refcnt_section(const int nloops) -{ - int i; - - for (i = nloops; i >= 0; i--) { - atomic_inc(&refcnt); - atomic_dec(&refcnt); - } -} - -static void ref_refcnt_delay_section(const int nloops, const int udl, const int ndl) -{ - int i; - - for (i = nloops; i >= 0; i--) { - atomic_inc(&refcnt); - un_delay(udl, ndl); - atomic_dec(&refcnt); - } -} - -static struct ref_perf_ops refcnt_ops = { - .init = rcu_sync_perf_init, - .readsection = ref_refcnt_section, - .delaysection = ref_refcnt_delay_section, - .name = "refcnt" -}; - -// Definitions for rwlock -static rwlock_t test_rwlock; - -static void ref_rwlock_init(void) -{ - rwlock_init(&test_rwlock); -} - -static void ref_rwlock_section(const int nloops) -{ - int i; - - for (i = nloops; i >= 0; i--) { - read_lock(&test_rwlock); - read_unlock(&test_rwlock); - } -} - -static void ref_rwlock_delay_section(const int nloops, const int udl, const int ndl) -{ - int i; - - for (i = nloops; i >= 0; i--) { - read_lock(&test_rwlock); - un_delay(udl, ndl); - read_unlock(&test_rwlock); - } -} - -static struct ref_perf_ops rwlock_ops = { - .init = ref_rwlock_init, - .readsection = ref_rwlock_section, - .delaysection = ref_rwlock_delay_section, - .name = "rwlock" -}; - -// Definitions for rwsem -static struct rw_semaphore test_rwsem; - -static void ref_rwsem_init(void) -{ - init_rwsem(&test_rwsem); -} - -static void ref_rwsem_section(const int nloops) -{ - int i; - - for (i = nloops; i >= 0; i--) { - down_read(&test_rwsem); - up_read(&test_rwsem); - } -} - -static void ref_rwsem_delay_section(const int nloops, const int udl, const int ndl) -{ - int i; - - for (i = nloops; i >= 0; i--) { - down_read(&test_rwsem); - un_delay(udl, ndl); - up_read(&test_rwsem); - } -} - -static struct ref_perf_ops rwsem_ops = { - .init = ref_rwsem_init, - .readsection = ref_rwsem_section, - .delaysection = ref_rwsem_delay_section, - .name = "rwsem" -}; - -static void rcu_perf_one_reader(void) -{ - if (readdelay <= 0) - cur_ops->readsection(loops); - else - cur_ops->delaysection(loops, readdelay / 1000, readdelay % 1000); -} - -// Reader kthread. Repeatedly does empty RCU read-side -// critical section, minimizing update-side interference. -static int -ref_perf_reader(void *arg) -{ - unsigned long flags; - long me = (long)arg; - struct reader_task *rt = &(reader_tasks[me]); - u64 start; - s64 duration; - - VERBOSE_PERFOUT("ref_perf_reader %ld: task started", me); - set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids)); - set_user_nice(current, MAX_NICE); - atomic_inc(&n_init); - if (holdoff) - schedule_timeout_interruptible(holdoff * HZ); -repeat: - VERBOSE_PERFOUT("ref_perf_reader %ld: waiting to start next experiment on cpu %d", me, smp_processor_id()); - - // Wait for signal that this reader can start. - wait_event(rt->wq, (atomic_read(&nreaders_exp) && smp_load_acquire(&rt->start_reader)) || - torture_must_stop()); - - if (torture_must_stop()) - goto end; - - // Make sure that the CPU is affinitized appropriately during testing. - WARN_ON_ONCE(smp_processor_id() != me); - - WRITE_ONCE(rt->start_reader, 0); - if (!atomic_dec_return(&n_started)) - while (atomic_read_acquire(&n_started)) - cpu_relax(); - - VERBOSE_PERFOUT("ref_perf_reader %ld: experiment %d started", me, exp_idx); - - - // To reduce noise, do an initial cache-warming invocation, check - // in, and then keep warming until everyone has checked in. - rcu_perf_one_reader(); - if (!atomic_dec_return(&n_warmedup)) - while (atomic_read_acquire(&n_warmedup)) - rcu_perf_one_reader(); - // Also keep interrupts disabled. This also has the effect - // of preventing entries into slow path for rcu_read_unlock(). - local_irq_save(flags); - start = ktime_get_mono_fast_ns(); - - rcu_perf_one_reader(); - - duration = ktime_get_mono_fast_ns() - start; - local_irq_restore(flags); - - rt->last_duration_ns = WARN_ON_ONCE(duration < 0) ? 0 : duration; - // To reduce runtime-skew noise, do maintain-load invocations until - // everyone is done. - if (!atomic_dec_return(&n_cooleddown)) - while (atomic_read_acquire(&n_cooleddown)) - rcu_perf_one_reader(); - - if (atomic_dec_and_test(&nreaders_exp)) - wake_up(&main_wq); - - VERBOSE_PERFOUT("ref_perf_reader %ld: experiment %d ended, (readers remaining=%d)", - me, exp_idx, atomic_read(&nreaders_exp)); - - if (!torture_must_stop()) - goto repeat; -end: - torture_kthread_stopping("ref_perf_reader"); - return 0; -} - -static void reset_readers(void) -{ - int i; - struct reader_task *rt; - - for (i = 0; i < nreaders; i++) { - rt = &(reader_tasks[i]); - - rt->last_duration_ns = 0; - } -} - -// Print the results of each reader and return the sum of all their durations. -static u64 process_durations(int n) -{ - int i; - struct reader_task *rt; - char buf1[64]; - char *buf; - u64 sum = 0; - - buf = kmalloc(128 + nreaders * 32, GFP_KERNEL); - if (!buf) - return 0; - buf[0] = 0; - sprintf(buf, "Experiment #%d (Format: :)", - exp_idx); - - for (i = 0; i < n && !torture_must_stop(); i++) { - rt = &(reader_tasks[i]); - sprintf(buf1, "%d: %llu\t", i, rt->last_duration_ns); - - if (i % 5 == 0) - strcat(buf, "\n"); - strcat(buf, buf1); - - sum += rt->last_duration_ns; - } - strcat(buf, "\n"); - - PERFOUT("%s\n", buf); - - kfree(buf); - return sum; -} - -// The main_func is the main orchestrator, it performs a bunch of -// experiments. For every experiment, it orders all the readers -// involved to start and waits for them to finish the experiment. It -// then reads their timestamps and starts the next experiment. Each -// experiment progresses from 1 concurrent reader to N of them at which -// point all the timestamps are printed. -static int main_func(void *arg) -{ - bool errexit = false; - int exp, r; - char buf1[64]; - char *buf; - u64 *result_avg; - - set_cpus_allowed_ptr(current, cpumask_of(nreaders % nr_cpu_ids)); - set_user_nice(current, MAX_NICE); - - VERBOSE_PERFOUT("main_func task started"); - result_avg = kzalloc(nruns * sizeof(*result_avg), GFP_KERNEL); - buf = kzalloc(64 + nruns * 32, GFP_KERNEL); - if (!result_avg || !buf) { - VERBOSE_PERFOUT_ERRSTRING("out of memory"); - errexit = true; - } - if (holdoff) - schedule_timeout_interruptible(holdoff * HZ); - - // Wait for all threads to start. - atomic_inc(&n_init); - while (atomic_read(&n_init) < nreaders + 1) - schedule_timeout_uninterruptible(1); - - // Start exp readers up per experiment - for (exp = 0; exp < nruns && !torture_must_stop(); exp++) { - if (errexit) - break; - if (torture_must_stop()) - goto end; - - reset_readers(); - atomic_set(&nreaders_exp, nreaders); - atomic_set(&n_started, nreaders); - atomic_set(&n_warmedup, nreaders); - atomic_set(&n_cooleddown, nreaders); - - exp_idx = exp; - - for (r = 0; r < nreaders; r++) { - smp_store_release(&reader_tasks[r].start_reader, 1); - wake_up(&reader_tasks[r].wq); - } - - VERBOSE_PERFOUT("main_func: experiment started, waiting for %d readers", - nreaders); - - wait_event(main_wq, - !atomic_read(&nreaders_exp) || torture_must_stop()); - - VERBOSE_PERFOUT("main_func: experiment ended"); - - if (torture_must_stop()) - goto end; - - result_avg[exp] = div_u64(1000 * process_durations(nreaders), nreaders * loops); - } - - // Print the average of all experiments - PERFOUT("END OF TEST. Calculating average duration per loop (nanoseconds)...\n"); - - buf[0] = 0; - strcat(buf, "\n"); - strcat(buf, "Runs\tTime(ns)\n"); - - for (exp = 0; exp < nruns; exp++) { - u64 avg; - u32 rem; - - if (errexit) - break; - avg = div_u64_rem(result_avg[exp], 1000, &rem); - sprintf(buf1, "%d\t%llu.%03u\n", exp + 1, avg, rem); - strcat(buf, buf1); - } - - if (!errexit) - PERFOUT("%s", buf); - - // This will shutdown everything including us. - if (shutdown) { - shutdown_start = 1; - wake_up(&shutdown_wq); - } - - // Wait for torture to stop us - while (!torture_must_stop()) - schedule_timeout_uninterruptible(1); - -end: - torture_kthread_stopping("main_func"); - kfree(result_avg); - kfree(buf); - return 0; -} - -static void -ref_perf_print_module_parms(struct ref_perf_ops *cur_ops, const char *tag) -{ - pr_alert("%s" PERF_FLAG - "--- %s: verbose=%d shutdown=%d holdoff=%d loops=%ld nreaders=%d nruns=%d readdelay=%d\n", perf_type, tag, - verbose, shutdown, holdoff, loops, nreaders, nruns, readdelay); -} - -static void -ref_perf_cleanup(void) -{ - int i; - - if (torture_cleanup_begin()) - return; - - if (!cur_ops) { - torture_cleanup_end(); - return; - } - - if (reader_tasks) { - for (i = 0; i < nreaders; i++) - torture_stop_kthread("ref_perf_reader", - reader_tasks[i].task); - } - kfree(reader_tasks); - - torture_stop_kthread("main_task", main_task); - kfree(main_task); - - // Do perf-type-specific cleanup operations. - if (cur_ops->cleanup != NULL) - cur_ops->cleanup(); - - torture_cleanup_end(); -} - -// Shutdown kthread. Just waits to be awakened, then shuts down system. -static int -ref_perf_shutdown(void *arg) -{ - wait_event(shutdown_wq, shutdown_start); - - smp_mb(); // Wake before output. - ref_perf_cleanup(); - kernel_power_off(); - - return -EINVAL; -} - -static int __init -ref_perf_init(void) -{ - long i; - int firsterr = 0; - static struct ref_perf_ops *perf_ops[] = { - &rcu_ops, &srcu_ops, &rcu_trace_ops, &rcu_tasks_ops, - &refcnt_ops, &rwlock_ops, &rwsem_ops, - }; - - if (!torture_init_begin(perf_type, verbose)) - return -EBUSY; - - for (i = 0; i < ARRAY_SIZE(perf_ops); i++) { - cur_ops = perf_ops[i]; - if (strcmp(perf_type, cur_ops->name) == 0) - break; - } - if (i == ARRAY_SIZE(perf_ops)) { - pr_alert("rcu-perf: invalid perf type: \"%s\"\n", perf_type); - pr_alert("rcu-perf types:"); - for (i = 0; i < ARRAY_SIZE(perf_ops); i++) - pr_cont(" %s", perf_ops[i]->name); - pr_cont("\n"); - WARN_ON(!IS_MODULE(CONFIG_RCU_REF_SCALE_TEST)); - firsterr = -EINVAL; - cur_ops = NULL; - goto unwind; - } - if (cur_ops->init) - cur_ops->init(); - - ref_perf_print_module_parms(cur_ops, "Start of test"); - - // Shutdown task - if (shutdown) { - init_waitqueue_head(&shutdown_wq); - firsterr = torture_create_kthread(ref_perf_shutdown, NULL, - shutdown_task); - if (firsterr) - goto unwind; - schedule_timeout_uninterruptible(1); - } - - // Reader tasks (default to ~75% of online CPUs). - if (nreaders < 0) - nreaders = (num_online_cpus() >> 1) + (num_online_cpus() >> 2); - reader_tasks = kcalloc(nreaders, sizeof(reader_tasks[0]), - GFP_KERNEL); - if (!reader_tasks) { - VERBOSE_PERFOUT_ERRSTRING("out of memory"); - firsterr = -ENOMEM; - goto unwind; - } - - VERBOSE_PERFOUT("Starting %d reader threads\n", nreaders); - - for (i = 0; i < nreaders; i++) { - firsterr = torture_create_kthread(ref_perf_reader, (void *)i, - reader_tasks[i].task); - if (firsterr) - goto unwind; - - init_waitqueue_head(&(reader_tasks[i].wq)); - } - - // Main Task - init_waitqueue_head(&main_wq); - firsterr = torture_create_kthread(main_func, NULL, main_task); - if (firsterr) - goto unwind; - - torture_init_end(); - return 0; - -unwind: - torture_init_end(); - ref_perf_cleanup(); - return firsterr; -} - -module_init(ref_perf_init); -module_exit(ref_perf_cleanup); diff --git a/kernel/rcu/refscale.c b/kernel/rcu/refscale.c new file mode 100644 index 000000000000..d9291f883b54 --- /dev/null +++ b/kernel/rcu/refscale.c @@ -0,0 +1,717 @@ +// SPDX-License-Identifier: GPL-2.0+ +// +// Scalability test comparing RCU vs other mechanisms +// for acquiring references on objects. +// +// Copyright (C) Google, 2020. +// +// Author: Joel Fernandes + +#define pr_fmt(fmt) fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "rcu.h" + +#define SCALE_FLAG "-ref-scale: " + +#define SCALEOUT(s, x...) \ + pr_alert("%s" SCALE_FLAG s, scale_type, ## x) + +#define VERBOSE_SCALEOUT(s, x...) \ + do { if (verbose) pr_alert("%s" SCALE_FLAG s, scale_type, ## x); } while (0) + +#define VERBOSE_SCALEOUT_ERRSTRING(s, x...) \ + do { if (verbose) pr_alert("%s" SCALE_FLAG "!!! " s, scale_type, ## x); } while (0) + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Joel Fernandes (Google) "); + +static char *scale_type = "rcu"; +module_param(scale_type, charp, 0444); +MODULE_PARM_DESC(scale_type, "Type of test (rcu, srcu, refcnt, rwsem, rwlock."); + +torture_param(int, verbose, 0, "Enable verbose debugging printk()s"); + +// Wait until there are multiple CPUs before starting test. +torture_param(int, holdoff, IS_BUILTIN(CONFIG_RCU_REF_SCALE_TEST) ? 10 : 0, + "Holdoff time before test start (s)"); +// Number of loops per experiment, all readers execute operations concurrently. +torture_param(long, loops, 10000, "Number of loops per experiment."); +// Number of readers, with -1 defaulting to about 75% of the CPUs. +torture_param(int, nreaders, -1, "Number of readers, -1 for 75% of CPUs."); +// Number of runs. +torture_param(int, nruns, 30, "Number of experiments to run."); +// Reader delay in nanoseconds, 0 for no delay. +torture_param(int, readdelay, 0, "Read-side delay in nanoseconds."); + +#ifdef MODULE +# define REFSCALE_SHUTDOWN 0 +#else +# define REFSCALE_SHUTDOWN 1 +#endif + +torture_param(bool, shutdown, REFSCALE_SHUTDOWN, + "Shutdown at end of scalability tests."); + +struct reader_task { + struct task_struct *task; + int start_reader; + wait_queue_head_t wq; + u64 last_duration_ns; +}; + +static struct task_struct *shutdown_task; +static wait_queue_head_t shutdown_wq; + +static struct task_struct *main_task; +static wait_queue_head_t main_wq; +static int shutdown_start; + +static struct reader_task *reader_tasks; + +// Number of readers that are part of the current experiment. +static atomic_t nreaders_exp; + +// Use to wait for all threads to start. +static atomic_t n_init; +static atomic_t n_started; +static atomic_t n_warmedup; +static atomic_t n_cooleddown; + +// Track which experiment is currently running. +static int exp_idx; + +// Operations vector for selecting different types of tests. +struct ref_scale_ops { + void (*init)(void); + void (*cleanup)(void); + void (*readsection)(const int nloops); + void (*delaysection)(const int nloops, const int udl, const int ndl); + const char *name; +}; + +static struct ref_scale_ops *cur_ops; + +static void un_delay(const int udl, const int ndl) +{ + if (udl) + udelay(udl); + if (ndl) + ndelay(ndl); +} + +static void ref_rcu_read_section(const int nloops) +{ + int i; + + for (i = nloops; i >= 0; i--) { + rcu_read_lock(); + rcu_read_unlock(); + } +} + +static void ref_rcu_delay_section(const int nloops, const int udl, const int ndl) +{ + int i; + + for (i = nloops; i >= 0; i--) { + rcu_read_lock(); + un_delay(udl, ndl); + rcu_read_unlock(); + } +} + +static void rcu_sync_scale_init(void) +{ +} + +static struct ref_scale_ops rcu_ops = { + .init = rcu_sync_scale_init, + .readsection = ref_rcu_read_section, + .delaysection = ref_rcu_delay_section, + .name = "rcu" +}; + +// Definitions for SRCU ref scale testing. +DEFINE_STATIC_SRCU(srcu_refctl_scale); +static struct srcu_struct *srcu_ctlp = &srcu_refctl_scale; + +static void srcu_ref_scale_read_section(const int nloops) +{ + int i; + int idx; + + for (i = nloops; i >= 0; i--) { + idx = srcu_read_lock(srcu_ctlp); + srcu_read_unlock(srcu_ctlp, idx); + } +} + +static void srcu_ref_scale_delay_section(const int nloops, const int udl, const int ndl) +{ + int i; + int idx; + + for (i = nloops; i >= 0; i--) { + idx = srcu_read_lock(srcu_ctlp); + un_delay(udl, ndl); + srcu_read_unlock(srcu_ctlp, idx); + } +} + +static struct ref_scale_ops srcu_ops = { + .init = rcu_sync_scale_init, + .readsection = srcu_ref_scale_read_section, + .delaysection = srcu_ref_scale_delay_section, + .name = "srcu" +}; + +// Definitions for RCU Tasks ref scale testing: Empty read markers. +// These definitions also work for RCU Rude readers. +static void rcu_tasks_ref_scale_read_section(const int nloops) +{ + int i; + + for (i = nloops; i >= 0; i--) + continue; +} + +static void rcu_tasks_ref_scale_delay_section(const int nloops, const int udl, const int ndl) +{ + int i; + + for (i = nloops; i >= 0; i--) + un_delay(udl, ndl); +} + +static struct ref_scale_ops rcu_tasks_ops = { + .init = rcu_sync_scale_init, + .readsection = rcu_tasks_ref_scale_read_section, + .delaysection = rcu_tasks_ref_scale_delay_section, + .name = "rcu-tasks" +}; + +// Definitions for RCU Tasks Trace ref scale testing. +static void rcu_trace_ref_scale_read_section(const int nloops) +{ + int i; + + for (i = nloops; i >= 0; i--) { + rcu_read_lock_trace(); + rcu_read_unlock_trace(); + } +} + +static void rcu_trace_ref_scale_delay_section(const int nloops, const int udl, const int ndl) +{ + int i; + + for (i = nloops; i >= 0; i--) { + rcu_read_lock_trace(); + un_delay(udl, ndl); + rcu_read_unlock_trace(); + } +} + +static struct ref_scale_ops rcu_trace_ops = { + .init = rcu_sync_scale_init, + .readsection = rcu_trace_ref_scale_read_section, + .delaysection = rcu_trace_ref_scale_delay_section, + .name = "rcu-trace" +}; + +// Definitions for reference count +static atomic_t refcnt; + +static void ref_refcnt_section(const int nloops) +{ + int i; + + for (i = nloops; i >= 0; i--) { + atomic_inc(&refcnt); + atomic_dec(&refcnt); + } +} + +static void ref_refcnt_delay_section(const int nloops, const int udl, const int ndl) +{ + int i; + + for (i = nloops; i >= 0; i--) { + atomic_inc(&refcnt); + un_delay(udl, ndl); + atomic_dec(&refcnt); + } +} + +static struct ref_scale_ops refcnt_ops = { + .init = rcu_sync_scale_init, + .readsection = ref_refcnt_section, + .delaysection = ref_refcnt_delay_section, + .name = "refcnt" +}; + +// Definitions for rwlock +static rwlock_t test_rwlock; + +static void ref_rwlock_init(void) +{ + rwlock_init(&test_rwlock); +} + +static void ref_rwlock_section(const int nloops) +{ + int i; + + for (i = nloops; i >= 0; i--) { + read_lock(&test_rwlock); + read_unlock(&test_rwlock); + } +} + +static void ref_rwlock_delay_section(const int nloops, const int udl, const int ndl) +{ + int i; + + for (i = nloops; i >= 0; i--) { + read_lock(&test_rwlock); + un_delay(udl, ndl); + read_unlock(&test_rwlock); + } +} + +static struct ref_scale_ops rwlock_ops = { + .init = ref_rwlock_init, + .readsection = ref_rwlock_section, + .delaysection = ref_rwlock_delay_section, + .name = "rwlock" +}; + +// Definitions for rwsem +static struct rw_semaphore test_rwsem; + +static void ref_rwsem_init(void) +{ + init_rwsem(&test_rwsem); +} + +static void ref_rwsem_section(const int nloops) +{ + int i; + + for (i = nloops; i >= 0; i--) { + down_read(&test_rwsem); + up_read(&test_rwsem); + } +} + +static void ref_rwsem_delay_section(const int nloops, const int udl, const int ndl) +{ + int i; + + for (i = nloops; i >= 0; i--) { + down_read(&test_rwsem); + un_delay(udl, ndl); + up_read(&test_rwsem); + } +} + +static struct ref_scale_ops rwsem_ops = { + .init = ref_rwsem_init, + .readsection = ref_rwsem_section, + .delaysection = ref_rwsem_delay_section, + .name = "rwsem" +}; + +static void rcu_scale_one_reader(void) +{ + if (readdelay <= 0) + cur_ops->readsection(loops); + else + cur_ops->delaysection(loops, readdelay / 1000, readdelay % 1000); +} + +// Reader kthread. Repeatedly does empty RCU read-side +// critical section, minimizing update-side interference. +static int +ref_scale_reader(void *arg) +{ + unsigned long flags; + long me = (long)arg; + struct reader_task *rt = &(reader_tasks[me]); + u64 start; + s64 duration; + + VERBOSE_SCALEOUT("ref_scale_reader %ld: task started", me); + set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids)); + set_user_nice(current, MAX_NICE); + atomic_inc(&n_init); + if (holdoff) + schedule_timeout_interruptible(holdoff * HZ); +repeat: + VERBOSE_SCALEOUT("ref_scale_reader %ld: waiting to start next experiment on cpu %d", me, smp_processor_id()); + + // Wait for signal that this reader can start. + wait_event(rt->wq, (atomic_read(&nreaders_exp) && smp_load_acquire(&rt->start_reader)) || + torture_must_stop()); + + if (torture_must_stop()) + goto end; + + // Make sure that the CPU is affinitized appropriately during testing. + WARN_ON_ONCE(smp_processor_id() != me); + + WRITE_ONCE(rt->start_reader, 0); + if (!atomic_dec_return(&n_started)) + while (atomic_read_acquire(&n_started)) + cpu_relax(); + + VERBOSE_SCALEOUT("ref_scale_reader %ld: experiment %d started", me, exp_idx); + + + // To reduce noise, do an initial cache-warming invocation, check + // in, and then keep warming until everyone has checked in. + rcu_scale_one_reader(); + if (!atomic_dec_return(&n_warmedup)) + while (atomic_read_acquire(&n_warmedup)) + rcu_scale_one_reader(); + // Also keep interrupts disabled. This also has the effect + // of preventing entries into slow path for rcu_read_unlock(). + local_irq_save(flags); + start = ktime_get_mono_fast_ns(); + + rcu_scale_one_reader(); + + duration = ktime_get_mono_fast_ns() - start; + local_irq_restore(flags); + + rt->last_duration_ns = WARN_ON_ONCE(duration < 0) ? 0 : duration; + // To reduce runtime-skew noise, do maintain-load invocations until + // everyone is done. + if (!atomic_dec_return(&n_cooleddown)) + while (atomic_read_acquire(&n_cooleddown)) + rcu_scale_one_reader(); + + if (atomic_dec_and_test(&nreaders_exp)) + wake_up(&main_wq); + + VERBOSE_SCALEOUT("ref_scale_reader %ld: experiment %d ended, (readers remaining=%d)", + me, exp_idx, atomic_read(&nreaders_exp)); + + if (!torture_must_stop()) + goto repeat; +end: + torture_kthread_stopping("ref_scale_reader"); + return 0; +} + +static void reset_readers(void) +{ + int i; + struct reader_task *rt; + + for (i = 0; i < nreaders; i++) { + rt = &(reader_tasks[i]); + + rt->last_duration_ns = 0; + } +} + +// Print the results of each reader and return the sum of all their durations. +static u64 process_durations(int n) +{ + int i; + struct reader_task *rt; + char buf1[64]; + char *buf; + u64 sum = 0; + + buf = kmalloc(128 + nreaders * 32, GFP_KERNEL); + if (!buf) + return 0; + buf[0] = 0; + sprintf(buf, "Experiment #%d (Format: :)", + exp_idx); + + for (i = 0; i < n && !torture_must_stop(); i++) { + rt = &(reader_tasks[i]); + sprintf(buf1, "%d: %llu\t", i, rt->last_duration_ns); + + if (i % 5 == 0) + strcat(buf, "\n"); + strcat(buf, buf1); + + sum += rt->last_duration_ns; + } + strcat(buf, "\n"); + + SCALEOUT("%s\n", buf); + + kfree(buf); + return sum; +} + +// The main_func is the main orchestrator, it performs a bunch of +// experiments. For every experiment, it orders all the readers +// involved to start and waits for them to finish the experiment. It +// then reads their timestamps and starts the next experiment. Each +// experiment progresses from 1 concurrent reader to N of them at which +// point all the timestamps are printed. +static int main_func(void *arg) +{ + bool errexit = false; + int exp, r; + char buf1[64]; + char *buf; + u64 *result_avg; + + set_cpus_allowed_ptr(current, cpumask_of(nreaders % nr_cpu_ids)); + set_user_nice(current, MAX_NICE); + + VERBOSE_SCALEOUT("main_func task started"); + result_avg = kzalloc(nruns * sizeof(*result_avg), GFP_KERNEL); + buf = kzalloc(64 + nruns * 32, GFP_KERNEL); + if (!result_avg || !buf) { + VERBOSE_SCALEOUT_ERRSTRING("out of memory"); + errexit = true; + } + if (holdoff) + schedule_timeout_interruptible(holdoff * HZ); + + // Wait for all threads to start. + atomic_inc(&n_init); + while (atomic_read(&n_init) < nreaders + 1) + schedule_timeout_uninterruptible(1); + + // Start exp readers up per experiment + for (exp = 0; exp < nruns && !torture_must_stop(); exp++) { + if (errexit) + break; + if (torture_must_stop()) + goto end; + + reset_readers(); + atomic_set(&nreaders_exp, nreaders); + atomic_set(&n_started, nreaders); + atomic_set(&n_warmedup, nreaders); + atomic_set(&n_cooleddown, nreaders); + + exp_idx = exp; + + for (r = 0; r < nreaders; r++) { + smp_store_release(&reader_tasks[r].start_reader, 1); + wake_up(&reader_tasks[r].wq); + } + + VERBOSE_SCALEOUT("main_func: experiment started, waiting for %d readers", + nreaders); + + wait_event(main_wq, + !atomic_read(&nreaders_exp) || torture_must_stop()); + + VERBOSE_SCALEOUT("main_func: experiment ended"); + + if (torture_must_stop()) + goto end; + + result_avg[exp] = div_u64(1000 * process_durations(nreaders), nreaders * loops); + } + + // Print the average of all experiments + SCALEOUT("END OF TEST. Calculating average duration per loop (nanoseconds)...\n"); + + buf[0] = 0; + strcat(buf, "\n"); + strcat(buf, "Runs\tTime(ns)\n"); + + for (exp = 0; exp < nruns; exp++) { + u64 avg; + u32 rem; + + if (errexit) + break; + avg = div_u64_rem(result_avg[exp], 1000, &rem); + sprintf(buf1, "%d\t%llu.%03u\n", exp + 1, avg, rem); + strcat(buf, buf1); + } + + if (!errexit) + SCALEOUT("%s", buf); + + // This will shutdown everything including us. + if (shutdown) { + shutdown_start = 1; + wake_up(&shutdown_wq); + } + + // Wait for torture to stop us + while (!torture_must_stop()) + schedule_timeout_uninterruptible(1); + +end: + torture_kthread_stopping("main_func"); + kfree(result_avg); + kfree(buf); + return 0; +} + +static void +ref_scale_print_module_parms(struct ref_scale_ops *cur_ops, const char *tag) +{ + pr_alert("%s" SCALE_FLAG + "--- %s: verbose=%d shutdown=%d holdoff=%d loops=%ld nreaders=%d nruns=%d readdelay=%d\n", scale_type, tag, + verbose, shutdown, holdoff, loops, nreaders, nruns, readdelay); +} + +static void +ref_scale_cleanup(void) +{ + int i; + + if (torture_cleanup_begin()) + return; + + if (!cur_ops) { + torture_cleanup_end(); + return; + } + + if (reader_tasks) { + for (i = 0; i < nreaders; i++) + torture_stop_kthread("ref_scale_reader", + reader_tasks[i].task); + } + kfree(reader_tasks); + + torture_stop_kthread("main_task", main_task); + kfree(main_task); + + // Do scale-type-specific cleanup operations. + if (cur_ops->cleanup != NULL) + cur_ops->cleanup(); + + torture_cleanup_end(); +} + +// Shutdown kthread. Just waits to be awakened, then shuts down system. +static int +ref_scale_shutdown(void *arg) +{ + wait_event(shutdown_wq, shutdown_start); + + smp_mb(); // Wake before output. + ref_scale_cleanup(); + kernel_power_off(); + + return -EINVAL; +} + +static int __init +ref_scale_init(void) +{ + long i; + int firsterr = 0; + static struct ref_scale_ops *scale_ops[] = { + &rcu_ops, &srcu_ops, &rcu_trace_ops, &rcu_tasks_ops, + &refcnt_ops, &rwlock_ops, &rwsem_ops, + }; + + if (!torture_init_begin(scale_type, verbose)) + return -EBUSY; + + for (i = 0; i < ARRAY_SIZE(scale_ops); i++) { + cur_ops = scale_ops[i]; + if (strcmp(scale_type, cur_ops->name) == 0) + break; + } + if (i == ARRAY_SIZE(scale_ops)) { + pr_alert("rcu-scale: invalid scale type: \"%s\"\n", scale_type); + pr_alert("rcu-scale types:"); + for (i = 0; i < ARRAY_SIZE(scale_ops); i++) + pr_cont(" %s", scale_ops[i]->name); + pr_cont("\n"); + WARN_ON(!IS_MODULE(CONFIG_RCU_REF_SCALE_TEST)); + firsterr = -EINVAL; + cur_ops = NULL; + goto unwind; + } + if (cur_ops->init) + cur_ops->init(); + + ref_scale_print_module_parms(cur_ops, "Start of test"); + + // Shutdown task + if (shutdown) { + init_waitqueue_head(&shutdown_wq); + firsterr = torture_create_kthread(ref_scale_shutdown, NULL, + shutdown_task); + if (firsterr) + goto unwind; + schedule_timeout_uninterruptible(1); + } + + // Reader tasks (default to ~75% of online CPUs). + if (nreaders < 0) + nreaders = (num_online_cpus() >> 1) + (num_online_cpus() >> 2); + reader_tasks = kcalloc(nreaders, sizeof(reader_tasks[0]), + GFP_KERNEL); + if (!reader_tasks) { + VERBOSE_SCALEOUT_ERRSTRING("out of memory"); + firsterr = -ENOMEM; + goto unwind; + } + + VERBOSE_SCALEOUT("Starting %d reader threads\n", nreaders); + + for (i = 0; i < nreaders; i++) { + firsterr = torture_create_kthread(ref_scale_reader, (void *)i, + reader_tasks[i].task); + if (firsterr) + goto unwind; + + init_waitqueue_head(&(reader_tasks[i].wq)); + } + + // Main Task + init_waitqueue_head(&main_wq); + firsterr = torture_create_kthread(main_func, NULL, main_task); + if (firsterr) + goto unwind; + + torture_init_end(); + return 0; + +unwind: + torture_init_end(); + ref_scale_cleanup(); + return firsterr; +} + +module_init(ref_scale_init); +module_exit(ref_scale_cleanup); diff --git a/tools/testing/selftests/rcutorture/configs/refperf/ver_functions.sh b/tools/testing/selftests/rcutorture/configs/refperf/ver_functions.sh index 489f05dd929a..321e82641287 100644 --- a/tools/testing/selftests/rcutorture/configs/refperf/ver_functions.sh +++ b/tools/testing/selftests/rcutorture/configs/refperf/ver_functions.sh @@ -11,6 +11,6 @@ # # Adds per-version torture-module parameters to kernels supporting them. per_version_boot_params () { - echo $1 refperf.shutdown=1 \ - refperf.verbose=1 + echo $1 refscale.shutdown=1 \ + refscale.verbose=1 } -- cgit v1.2.3 From f71d8311ec278525508dac211de700b2b682a15f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 17 Jun 2020 12:06:47 -0700 Subject: refscale: Change --torture type from refperf to refscale This commit renames the rcutorture config/refperf to config/refscale to further avoid conflation with the Linux kernel's perf feature. Reported-by: Ingo Molnar Signed-off-by: Paul E. McKenney --- .../rcutorture/bin/kvm-recheck-refperf.sh | 71 ---------------------- .../rcutorture/bin/kvm-recheck-refscale.sh | 71 ++++++++++++++++++++++ tools/testing/selftests/rcutorture/bin/kvm.sh | 8 +-- .../selftests/rcutorture/bin/parse-console.sh | 4 +- .../selftests/rcutorture/configs/refperf/CFLIST | 2 - .../selftests/rcutorture/configs/refperf/CFcommon | 2 - .../selftests/rcutorture/configs/refperf/NOPREEMPT | 18 ------ .../selftests/rcutorture/configs/refperf/PREEMPT | 18 ------ .../rcutorture/configs/refperf/ver_functions.sh | 16 ----- .../selftests/rcutorture/configs/refscale/CFLIST | 2 + .../selftests/rcutorture/configs/refscale/CFcommon | 2 + .../rcutorture/configs/refscale/NOPREEMPT | 18 ++++++ .../selftests/rcutorture/configs/refscale/PREEMPT | 18 ++++++ .../rcutorture/configs/refscale/ver_functions.sh | 16 +++++ 14 files changed, 133 insertions(+), 133 deletions(-) delete mode 100755 tools/testing/selftests/rcutorture/bin/kvm-recheck-refperf.sh create mode 100755 tools/testing/selftests/rcutorture/bin/kvm-recheck-refscale.sh delete mode 100644 tools/testing/selftests/rcutorture/configs/refperf/CFLIST delete mode 100644 tools/testing/selftests/rcutorture/configs/refperf/CFcommon delete mode 100644 tools/testing/selftests/rcutorture/configs/refperf/NOPREEMPT delete mode 100644 tools/testing/selftests/rcutorture/configs/refperf/PREEMPT delete mode 100644 tools/testing/selftests/rcutorture/configs/refperf/ver_functions.sh create mode 100644 tools/testing/selftests/rcutorture/configs/refscale/CFLIST create mode 100644 tools/testing/selftests/rcutorture/configs/refscale/CFcommon create mode 100644 tools/testing/selftests/rcutorture/configs/refscale/NOPREEMPT create mode 100644 tools/testing/selftests/rcutorture/configs/refscale/PREEMPT create mode 100644 tools/testing/selftests/rcutorture/configs/refscale/ver_functions.sh diff --git a/tools/testing/selftests/rcutorture/bin/kvm-recheck-refperf.sh b/tools/testing/selftests/rcutorture/bin/kvm-recheck-refperf.sh deleted file mode 100755 index 0e29cfd9986c..000000000000 --- a/tools/testing/selftests/rcutorture/bin/kvm-recheck-refperf.sh +++ /dev/null @@ -1,71 +0,0 @@ -#!/bin/bash -# SPDX-License-Identifier: GPL-2.0+ -# -# Analyze a given results directory for refperf performance measurements. -# -# Usage: kvm-recheck-refperf.sh resdir -# -# Copyright (C) IBM Corporation, 2016 -# -# Authors: Paul E. McKenney - -i="$1" -if test -d "$i" -a -r "$i" -then - : -else - echo Unreadable results directory: $i - exit 1 -fi -PATH=`pwd`/tools/testing/selftests/rcutorture/bin:$PATH; export PATH -. functions.sh - -configfile=`echo $i | sed -e 's/^.*\///'` - -sed -e 's/^\[[^]]*]//' < $i/console.log | tr -d '\015' | -awk -v configfile="$configfile" ' -/^[ ]*Runs Time\(ns\) *$/ { - if (dataphase + 0 == 0) { - dataphase = 1; - # print configfile, $0; - } - next; -} - -/[^ ]*[0-9][0-9]* [0-9][0-9]*\.[0-9][0-9]*$/ { - if (dataphase == 1) { - # print $0; - readertimes[++n] = $2; - sum += $2; - } - next; -} - -{ - if (dataphase == 1) - dataphase == 2; - next; -} - -END { - print configfile " results:"; - newNR = asort(readertimes); - if (newNR <= 0) { - print "No refperf records found???" - exit; - } - medianidx = int(newNR / 2); - if (newNR == medianidx * 2) - medianvalue = (readertimes[medianidx - 1] + readertimes[medianidx]) / 2; - else - medianvalue = readertimes[medianidx]; - points = "Points:"; - for (i = 1; i <= newNR; i++) - points = points " " readertimes[i]; - print points; - print "Average reader duration: " sum / newNR " nanoseconds"; - print "Minimum reader duration: " readertimes[1]; - print "Median reader duration: " medianvalue; - print "Maximum reader duration: " readertimes[newNR]; - print "Computed from refperf printk output."; -}' diff --git a/tools/testing/selftests/rcutorture/bin/kvm-recheck-refscale.sh b/tools/testing/selftests/rcutorture/bin/kvm-recheck-refscale.sh new file mode 100755 index 000000000000..35a463dddffe --- /dev/null +++ b/tools/testing/selftests/rcutorture/bin/kvm-recheck-refscale.sh @@ -0,0 +1,71 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0+ +# +# Analyze a given results directory for refscale performance measurements. +# +# Usage: kvm-recheck-refscale.sh resdir +# +# Copyright (C) IBM Corporation, 2016 +# +# Authors: Paul E. McKenney + +i="$1" +if test -d "$i" -a -r "$i" +then + : +else + echo Unreadable results directory: $i + exit 1 +fi +PATH=`pwd`/tools/testing/selftests/rcutorture/bin:$PATH; export PATH +. functions.sh + +configfile=`echo $i | sed -e 's/^.*\///'` + +sed -e 's/^\[[^]]*]//' < $i/console.log | tr -d '\015' | +awk -v configfile="$configfile" ' +/^[ ]*Runs Time\(ns\) *$/ { + if (dataphase + 0 == 0) { + dataphase = 1; + # print configfile, $0; + } + next; +} + +/[^ ]*[0-9][0-9]* [0-9][0-9]*\.[0-9][0-9]*$/ { + if (dataphase == 1) { + # print $0; + readertimes[++n] = $2; + sum += $2; + } + next; +} + +{ + if (dataphase == 1) + dataphase == 2; + next; +} + +END { + print configfile " results:"; + newNR = asort(readertimes); + if (newNR <= 0) { + print "No refscale records found???" + exit; + } + medianidx = int(newNR / 2); + if (newNR == medianidx * 2) + medianvalue = (readertimes[medianidx - 1] + readertimes[medianidx]) / 2; + else + medianvalue = readertimes[medianidx]; + points = "Points:"; + for (i = 1; i <= newNR; i++) + points = points " " readertimes[i]; + print points; + print "Average reader duration: " sum / newNR " nanoseconds"; + print "Minimum reader duration: " readertimes[1]; + print "Median reader duration: " medianvalue; + print "Maximum reader duration: " readertimes[newNR]; + print "Computed from refscale printk output."; +}' diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh b/tools/testing/selftests/rcutorture/bin/kvm.sh index 48b6a7248f50..ce05db324057 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm.sh @@ -180,14 +180,14 @@ do shift ;; --torture) - checkarg --torture "(suite name)" "$#" "$2" '^\(lock\|rcu\|rcuperf\|refperf\)$' '^--' + checkarg --torture "(suite name)" "$#" "$2" '^\(lock\|rcu\|rcuperf\|refscale\)$' '^--' TORTURE_SUITE=$2 shift - if test "$TORTURE_SUITE" = rcuperf || test "$TORTURE_SUITE" = refperf + if test "$TORTURE_SUITE" = rcuperf || test "$TORTURE_SUITE" = refscale then - # If you really want jitter for refperf or + # If you really want jitter for refscale or # rcuperf, specify it after specifying the rcuperf - # or the refperf. (But why jitter in these cases?) + # or the refscale. (But why jitter in these cases?) jitter=0 fi ;; diff --git a/tools/testing/selftests/rcutorture/bin/parse-console.sh b/tools/testing/selftests/rcutorture/bin/parse-console.sh index 85af11d2d0cb..8cb908fb852b 100755 --- a/tools/testing/selftests/rcutorture/bin/parse-console.sh +++ b/tools/testing/selftests/rcutorture/bin/parse-console.sh @@ -33,8 +33,8 @@ then fi cat /dev/null > $file.diags -# Check for proper termination, except for rcuperf and refperf. -if test "$TORTURE_SUITE" != rcuperf && test "$TORTURE_SUITE" != refperf +# Check for proper termination, except for rcuperf and refscale. +if test "$TORTURE_SUITE" != rcuperf && test "$TORTURE_SUITE" != refscale then # check for abject failure diff --git a/tools/testing/selftests/rcutorture/configs/refperf/CFLIST b/tools/testing/selftests/rcutorture/configs/refperf/CFLIST deleted file mode 100644 index 4d62eb4a39f9..000000000000 --- a/tools/testing/selftests/rcutorture/configs/refperf/CFLIST +++ /dev/null @@ -1,2 +0,0 @@ -NOPREEMPT -PREEMPT diff --git a/tools/testing/selftests/rcutorture/configs/refperf/CFcommon b/tools/testing/selftests/rcutorture/configs/refperf/CFcommon deleted file mode 100644 index a98b58b54bb1..000000000000 --- a/tools/testing/selftests/rcutorture/configs/refperf/CFcommon +++ /dev/null @@ -1,2 +0,0 @@ -CONFIG_RCU_REF_SCALE_TEST=y -CONFIG_PRINTK_TIME=y diff --git a/tools/testing/selftests/rcutorture/configs/refperf/NOPREEMPT b/tools/testing/selftests/rcutorture/configs/refperf/NOPREEMPT deleted file mode 100644 index 1cd25b7314e3..000000000000 --- a/tools/testing/selftests/rcutorture/configs/refperf/NOPREEMPT +++ /dev/null @@ -1,18 +0,0 @@ -CONFIG_SMP=y -CONFIG_PREEMPT_NONE=y -CONFIG_PREEMPT_VOLUNTARY=n -CONFIG_PREEMPT=n -#CHECK#CONFIG_PREEMPT_RCU=n -CONFIG_HZ_PERIODIC=n -CONFIG_NO_HZ_IDLE=y -CONFIG_NO_HZ_FULL=n -CONFIG_RCU_FAST_NO_HZ=n -CONFIG_HOTPLUG_CPU=n -CONFIG_SUSPEND=n -CONFIG_HIBERNATION=n -CONFIG_RCU_NOCB_CPU=n -CONFIG_DEBUG_LOCK_ALLOC=n -CONFIG_PROVE_LOCKING=n -CONFIG_RCU_BOOST=n -CONFIG_DEBUG_OBJECTS_RCU_HEAD=n -CONFIG_RCU_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/refperf/PREEMPT b/tools/testing/selftests/rcutorture/configs/refperf/PREEMPT deleted file mode 100644 index d10bc694f42c..000000000000 --- a/tools/testing/selftests/rcutorture/configs/refperf/PREEMPT +++ /dev/null @@ -1,18 +0,0 @@ -CONFIG_SMP=y -CONFIG_PREEMPT_NONE=n -CONFIG_PREEMPT_VOLUNTARY=n -CONFIG_PREEMPT=y -#CHECK#CONFIG_PREEMPT_RCU=y -CONFIG_HZ_PERIODIC=n -CONFIG_NO_HZ_IDLE=y -CONFIG_NO_HZ_FULL=n -CONFIG_RCU_FAST_NO_HZ=n -CONFIG_HOTPLUG_CPU=n -CONFIG_SUSPEND=n -CONFIG_HIBERNATION=n -CONFIG_RCU_NOCB_CPU=n -CONFIG_DEBUG_LOCK_ALLOC=n -CONFIG_PROVE_LOCKING=n -CONFIG_RCU_BOOST=n -CONFIG_DEBUG_OBJECTS_RCU_HEAD=n -CONFIG_RCU_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/refperf/ver_functions.sh b/tools/testing/selftests/rcutorture/configs/refperf/ver_functions.sh deleted file mode 100644 index 321e82641287..000000000000 --- a/tools/testing/selftests/rcutorture/configs/refperf/ver_functions.sh +++ /dev/null @@ -1,16 +0,0 @@ -#!/bin/bash -# SPDX-License-Identifier: GPL-2.0+ -# -# Torture-suite-dependent shell functions for the rest of the scripts. -# -# Copyright (C) IBM Corporation, 2015 -# -# Authors: Paul E. McKenney - -# per_version_boot_params bootparam-string config-file seconds -# -# Adds per-version torture-module parameters to kernels supporting them. -per_version_boot_params () { - echo $1 refscale.shutdown=1 \ - refscale.verbose=1 -} diff --git a/tools/testing/selftests/rcutorture/configs/refscale/CFLIST b/tools/testing/selftests/rcutorture/configs/refscale/CFLIST new file mode 100644 index 000000000000..4d62eb4a39f9 --- /dev/null +++ b/tools/testing/selftests/rcutorture/configs/refscale/CFLIST @@ -0,0 +1,2 @@ +NOPREEMPT +PREEMPT diff --git a/tools/testing/selftests/rcutorture/configs/refscale/CFcommon b/tools/testing/selftests/rcutorture/configs/refscale/CFcommon new file mode 100644 index 000000000000..a98b58b54bb1 --- /dev/null +++ b/tools/testing/selftests/rcutorture/configs/refscale/CFcommon @@ -0,0 +1,2 @@ +CONFIG_RCU_REF_SCALE_TEST=y +CONFIG_PRINTK_TIME=y diff --git a/tools/testing/selftests/rcutorture/configs/refscale/NOPREEMPT b/tools/testing/selftests/rcutorture/configs/refscale/NOPREEMPT new file mode 100644 index 000000000000..1cd25b7314e3 --- /dev/null +++ b/tools/testing/selftests/rcutorture/configs/refscale/NOPREEMPT @@ -0,0 +1,18 @@ +CONFIG_SMP=y +CONFIG_PREEMPT_NONE=y +CONFIG_PREEMPT_VOLUNTARY=n +CONFIG_PREEMPT=n +#CHECK#CONFIG_PREEMPT_RCU=n +CONFIG_HZ_PERIODIC=n +CONFIG_NO_HZ_IDLE=y +CONFIG_NO_HZ_FULL=n +CONFIG_RCU_FAST_NO_HZ=n +CONFIG_HOTPLUG_CPU=n +CONFIG_SUSPEND=n +CONFIG_HIBERNATION=n +CONFIG_RCU_NOCB_CPU=n +CONFIG_DEBUG_LOCK_ALLOC=n +CONFIG_PROVE_LOCKING=n +CONFIG_RCU_BOOST=n +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n +CONFIG_RCU_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/refscale/PREEMPT b/tools/testing/selftests/rcutorture/configs/refscale/PREEMPT new file mode 100644 index 000000000000..d10bc694f42c --- /dev/null +++ b/tools/testing/selftests/rcutorture/configs/refscale/PREEMPT @@ -0,0 +1,18 @@ +CONFIG_SMP=y +CONFIG_PREEMPT_NONE=n +CONFIG_PREEMPT_VOLUNTARY=n +CONFIG_PREEMPT=y +#CHECK#CONFIG_PREEMPT_RCU=y +CONFIG_HZ_PERIODIC=n +CONFIG_NO_HZ_IDLE=y +CONFIG_NO_HZ_FULL=n +CONFIG_RCU_FAST_NO_HZ=n +CONFIG_HOTPLUG_CPU=n +CONFIG_SUSPEND=n +CONFIG_HIBERNATION=n +CONFIG_RCU_NOCB_CPU=n +CONFIG_DEBUG_LOCK_ALLOC=n +CONFIG_PROVE_LOCKING=n +CONFIG_RCU_BOOST=n +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n +CONFIG_RCU_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/refscale/ver_functions.sh b/tools/testing/selftests/rcutorture/configs/refscale/ver_functions.sh new file mode 100644 index 000000000000..321e82641287 --- /dev/null +++ b/tools/testing/selftests/rcutorture/configs/refscale/ver_functions.sh @@ -0,0 +1,16 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0+ +# +# Torture-suite-dependent shell functions for the rest of the scripts. +# +# Copyright (C) IBM Corporation, 2015 +# +# Authors: Paul E. McKenney + +# per_version_boot_params bootparam-string config-file seconds +# +# Adds per-version torture-module parameters to kernels supporting them. +per_version_boot_params () { + echo $1 refscale.shutdown=1 \ + refscale.verbose=1 +} -- cgit v1.2.3 From 7fef6cff8f2814bf8eb632e2bb8f0a987ffd9ece Mon Sep 17 00:00:00 2001 From: Ethon Paul Date: Sat, 18 Apr 2020 19:46:47 +0800 Subject: srcu: Fix a typo in comment "amoritized"->"amortized" This commit fixes a typo in a comment. Signed-off-by: Ethon Paul Signed-off-by: Paul E. McKenney --- kernel/rcu/srcutree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 6d3ef700fb0e..8ff71e5d0fe8 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -766,7 +766,7 @@ static void srcu_flip(struct srcu_struct *ssp) * it, if this function was preempted for enough time for the counters * to wrap, it really doesn't matter whether or not we expedite the grace * period. The extra overhead of a needlessly expedited grace period is - * negligible when amoritized over that time period, and the extra latency + * negligible when amortized over that time period, and the extra latency * of a needlessly non-expedited grace period is similarly negligible. */ static bool srcu_might_be_idle(struct srcu_struct *ssp) -- cgit v1.2.3 From bde50d8ff83e4ce9e576f7c5ba1edb48a3610a5b Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Tue, 26 May 2020 15:41:34 +0200 Subject: srcu: Avoid local_irq_save() before acquiring spinlock_t SRCU disables interrupts to get a stable per-CPU pointer and then acquires the spinlock which is in the per-CPU data structure. The release uses spin_unlock_irqrestore(). While this is correct on a non-RT kernel, this conflicts with the RT semantics because the spinlock is converted to a 'sleeping' spinlock. Sleeping locks can obviously not be acquired with interrupts disabled. Acquire the per-CPU pointer `ssp->sda' without disabling preemption and then acquire the spinlock_t of the per-CPU data structure. The lock will ensure that the data is consistent. The added call to check_init_srcu_struct() is now needed because a statically defined srcu_struct may remain uninitialized until this point and the newly introduced locking operation requires an initialized spinlock_t. This change was tested for four hours with 8*SRCU-N and 8*SRCU-P without causing any warnings. Cc: Lai Jiangshan Cc: "Paul E. McKenney" Cc: Josh Triplett Cc: Steven Rostedt Cc: Mathieu Desnoyers Cc: rcu@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Paul E. McKenney --- kernel/rcu/srcutree.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 8ff71e5d0fe8..c100acf332ed 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -777,14 +777,15 @@ static bool srcu_might_be_idle(struct srcu_struct *ssp) unsigned long t; unsigned long tlast; + check_init_srcu_struct(ssp); /* If the local srcu_data structure has callbacks, not idle. */ - local_irq_save(flags); - sdp = this_cpu_ptr(ssp->sda); + sdp = raw_cpu_ptr(ssp->sda); + spin_lock_irqsave_rcu_node(sdp, flags); if (rcu_segcblist_pend_cbs(&sdp->srcu_cblist)) { - local_irq_restore(flags); + spin_unlock_irqrestore_rcu_node(sdp, flags); return false; /* Callbacks already present, so not idle. */ } - local_irq_restore(flags); + spin_unlock_irqrestore_rcu_node(sdp, flags); /* * No local callbacks, so probabalistically probe global state. @@ -864,9 +865,8 @@ static void __call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp, } rhp->func = func; idx = srcu_read_lock(ssp); - local_irq_save(flags); - sdp = this_cpu_ptr(ssp->sda); - spin_lock_rcu_node(sdp); + sdp = raw_cpu_ptr(ssp->sda); + spin_lock_irqsave_rcu_node(sdp, flags); rcu_segcblist_enqueue(&sdp->srcu_cblist, rhp); rcu_segcblist_advance(&sdp->srcu_cblist, rcu_seq_current(&ssp->srcu_gp_seq)); -- cgit v1.2.3 From 88513ae533756d10358e406743c21e8cf61fb72a Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 28 Apr 2020 14:41:48 -0700 Subject: torture: Remove qemu dependency on EFI firmware On some (probably misconfigured) systems, the torture-test scripting will cause qemu to complain about missing EFI firmware, often because qemu is trying to traverse broken symbolic links to find that firmware. Which is a bit silly given that the default torture-test guest OS has but a single binary for its userspace, and thus is unlikely to do much in the way of networking in any case. This commit therefore avoids such problems by specifying "-net none" to qemu unless the TORTURE_QEMU_INTERACTIVE environment variable is set (for example, by having specified "--interactive" to kvm.sh), in which case "-net nic -net user" is specified to qemu instead. Either choice may be overridden by specifying the "-net" argument of your choice to the kvm.sh "--qemu-args" parameter. Link: https://lore.kernel.org/lkml/20190701141403.GA246562@google.com Reported-by: Joel Fernandes Signed-off-by: Paul E. McKenney Cc: Sebastian Andrzej Siewior --- tools/testing/selftests/rcutorture/bin/functions.sh | 21 ++++++++++++++++++--- .../selftests/rcutorture/bin/kvm-test-1-run.sh | 1 + 2 files changed, 19 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/rcutorture/bin/functions.sh b/tools/testing/selftests/rcutorture/bin/functions.sh index 12810229fddc..436b1542cf27 100644 --- a/tools/testing/selftests/rcutorture/bin/functions.sh +++ b/tools/testing/selftests/rcutorture/bin/functions.sh @@ -215,9 +215,6 @@ identify_qemu_args () { then echo -device spapr-vlan,netdev=net0,mac=$TORTURE_QEMU_MAC echo -netdev bridge,br=br0,id=net0 - elif test -n "$TORTURE_QEMU_INTERACTIVE" - then - echo -net nic -net user fi ;; esac @@ -275,3 +272,21 @@ specify_qemu_cpus () { esac fi } + +# specify_qemu_net qemu-args +# +# Appends a string containing "-net none" to qemu-args, unless the incoming +# qemu-args already contains "-smp" or unless the TORTURE_QEMU_INTERACTIVE +# environment variable is set, in which case the string that is be added is +# instead "-net nic -net user". +specify_qemu_net () { + if echo $1 | grep -q -e -net + then + echo $1 + elif test -n "$TORTURE_QEMU_INTERACTIVE" + then + echo $1 -net nic -net user + else + echo $1 -net none + fi +} diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh index 6ff611c630d1..1b9aebd54cc9 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh @@ -141,6 +141,7 @@ then cpu_count=$TORTURE_ALLOTED_CPUS fi qemu_args="`specify_qemu_cpus "$QEMU" "$qemu_args" "$cpu_count"`" +qemu_args="`specify_qemu_net "$qemu_args"`" # Generate architecture-specific and interaction-specific qemu arguments qemu_args="$qemu_args `identify_qemu_args "$QEMU" "$resdir/console.log"`" -- cgit v1.2.3 From 6582e7f184e49a754ee09c996a886b89113d7354 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 4 May 2020 15:55:47 -0700 Subject: torture: Add script to smoke-test commits in a branch This commit adds a kvm-check-branches.sh script that takes a list of commits and commit ranges and runs a short rcutorture test on all scenarios on each specified commit. A summary is printed at the end, and the script returns success if all rcutorture runs completed without error. Signed-off-by: Paul E. McKenney --- .../selftests/rcutorture/bin/kvm-check-branches.sh | 108 +++++++++++++++++++++ 1 file changed, 108 insertions(+) create mode 100755 tools/testing/selftests/rcutorture/bin/kvm-check-branches.sh diff --git a/tools/testing/selftests/rcutorture/bin/kvm-check-branches.sh b/tools/testing/selftests/rcutorture/bin/kvm-check-branches.sh new file mode 100755 index 000000000000..6e65c134e5f1 --- /dev/null +++ b/tools/testing/selftests/rcutorture/bin/kvm-check-branches.sh @@ -0,0 +1,108 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0+ +# +# Run a group of kvm.sh tests on the specified commits. This currently +# unconditionally does three-minute runs on each scenario in CFLIST, +# taking advantage of all available CPUs and trusting the "make" utility. +# In the short term, adjustments can be made by editing this script and +# CFLIST. If some adjustments appear to have ongoing value, this script +# might grow some command-line arguments. +# +# Usage: kvm-check-branches.sh commit1 commit2..commit3 commit4 ... +# +# This script considers its arguments one at a time. If more elaborate +# specification of commits is needed, please use "git rev-list" to +# produce something that this simple script can understand. The reason +# for retaining the simplicity is that it allows the user to more easily +# see which commit came from which branch. +# +# This script creates a yyyy.mm.dd-hh.mm.ss-group entry in the "res" +# directory. The calls to kvm.sh create the usual entries, but this script +# moves them under the yyyy.mm.dd-hh.mm.ss-group entry, each in its own +# directory numbered in run order, that is, "0001", "0002", and so on. +# For successful runs, the large build artifacts are removed. Doing this +# reduces the disk space required by about two orders of magnitude for +# successful runs. +# +# Copyright (C) Facebook, 2020 +# +# Authors: Paul E. McKenney + +if ! git status > /dev/null 2>&1 +then + echo '!!!' This script needs to run in a git archive. 1>&2 + echo '!!!' Giving up. 1>&2 + exit 1 +fi + +# Remember where we started so that we can get back and the end. +curcommit="`git status | head -1 | awk '{ print $NF }'`" + +nfail=0 +ntry=0 +resdir="tools/testing/selftests/rcutorture/res" +ds="`date +%Y.%m.%d-%H.%M.%S`-group" +if ! test -e $resdir +then + mkdir $resdir || : +fi +mkdir $resdir/$ds +echo Results directory: $resdir/$ds + +KVM="`pwd`/tools/testing/selftests/rcutorture"; export KVM +PATH=${KVM}/bin:$PATH; export PATH +. functions.sh +cpus="`identify_qemu_vcpus`" +echo Using up to $cpus CPUs. + +# Each pass through this loop does one command-line argument. +for gitbr in $@ +do + echo ' --- git branch ' $gitbr + + # Each pass through this loop tests one commit. + for i in `git rev-list "$gitbr"` + do + ntry=`expr $ntry + 1` + idir=`awk -v ntry="$ntry" 'END { printf "%04d", ntry; }' < /dev/null` + echo ' --- commit ' $i from branch $gitbr + date + mkdir $resdir/$ds/$idir + echo $gitbr > $resdir/$ds/$idir/gitbr + echo $i >> $resdir/$ds/$idir/gitbr + + # Test the specified commit. + git checkout $i > $resdir/$ds/$idir/git-checkout.out 2>&1 + echo git checkout return code: $? "(Commit $ntry: $i)" + kvm.sh --cpus $cpus --duration 3 --trust-make > $resdir/$ds/$idir/kvm.sh.out 2>&1 + ret=$? + echo kvm.sh return code $ret for commit $i from branch $gitbr + + # Move the build products to their resting place. + runresdir="`grep -m 1 '^Results directory:' < $resdir/$ds/$idir/kvm.sh.out | sed -e 's/^Results directory://'`" + mv $runresdir $resdir/$ds/$idir + rrd="`echo $runresdir | sed -e 's,^.*/,,'`" + echo Run results: $resdir/$ds/$idir/$rrd + if test "$ret" -ne 0 + then + # Failure, so leave all evidence intact. + nfail=`expr $nfail + 1` + else + # Success, so remove large files to save about 1GB. + ( cd $resdir/$ds/$idir/$rrd; rm -f */vmlinux */bzImage */System.map */Module.symvers ) + fi + done +done +date + +# Go back to the original commit. +git checkout "$curcommit" + +if test $nfail -ne 0 +then + echo '!!! ' $nfail failures in $ntry 'runs!!!' + exit 1 +else + echo No failures in $ntry runs. + exit 0 +fi -- cgit v1.2.3 From d02c6b52d12fa30eeabfaf5aefe12078eacb94b2 Mon Sep 17 00:00:00 2001 From: Zou Wei Date: Mon, 13 Apr 2020 20:02:59 +0800 Subject: locktorture: Use true and false to assign to bool variables This commit fixes the following coccicheck warnings: kernel/locking/locktorture.c:689:6-10: WARNING: Assignment of 0/1 to bool variable kernel/locking/locktorture.c:907:2-20: WARNING: Assignment of 0/1 to bool variable kernel/locking/locktorture.c:938:3-20: WARNING: Assignment of 0/1 to bool variable kernel/locking/locktorture.c:668:2-19: WARNING: Assignment of 0/1 to bool variable kernel/locking/locktorture.c:674:2-19: WARNING: Assignment of 0/1 to bool variable kernel/locking/locktorture.c:634:2-20: WARNING: Assignment of 0/1 to bool variable kernel/locking/locktorture.c:640:2-20: WARNING: Assignment of 0/1 to bool variable Reported-by: Hulk Robot Signed-off-by: Zou Wei Signed-off-by: Paul E. McKenney --- kernel/locking/locktorture.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c index 5efbfc68ce99..8ff6f50e06a0 100644 --- a/kernel/locking/locktorture.c +++ b/kernel/locking/locktorture.c @@ -631,13 +631,13 @@ static int lock_torture_writer(void *arg) cxt.cur_ops->writelock(); if (WARN_ON_ONCE(lock_is_write_held)) lwsp->n_lock_fail++; - lock_is_write_held = 1; + lock_is_write_held = true; if (WARN_ON_ONCE(lock_is_read_held)) lwsp->n_lock_fail++; /* rare, but... */ lwsp->n_lock_acquired++; cxt.cur_ops->write_delay(&rand); - lock_is_write_held = 0; + lock_is_write_held = false; cxt.cur_ops->writeunlock(); stutter_wait("lock_torture_writer"); @@ -665,13 +665,13 @@ static int lock_torture_reader(void *arg) schedule_timeout_uninterruptible(1); cxt.cur_ops->readlock(); - lock_is_read_held = 1; + lock_is_read_held = true; if (WARN_ON_ONCE(lock_is_write_held)) lrsp->n_lock_fail++; /* rare, but... */ lrsp->n_lock_acquired++; cxt.cur_ops->read_delay(&rand); - lock_is_read_held = 0; + lock_is_read_held = false; cxt.cur_ops->readunlock(); stutter_wait("lock_torture_reader"); @@ -686,7 +686,7 @@ static int lock_torture_reader(void *arg) static void __torture_print_stats(char *page, struct lock_stress_stats *statp, bool write) { - bool fail = 0; + bool fail = false; int i, n_stress; long max = 0, min = statp ? statp[0].n_lock_acquired : 0; long long sum = 0; @@ -904,7 +904,7 @@ static int __init lock_torture_init(void) /* Initialize the statistics so that each run gets its own numbers. */ if (nwriters_stress) { - lock_is_write_held = 0; + lock_is_write_held = false; cxt.lwsa = kmalloc_array(cxt.nrealwriters_stress, sizeof(*cxt.lwsa), GFP_KERNEL); @@ -935,7 +935,7 @@ static int __init lock_torture_init(void) } if (nreaders_stress) { - lock_is_read_held = 0; + lock_is_read_held = false; cxt.lrsa = kmalloc_array(cxt.nrealreaders_stress, sizeof(*cxt.lrsa), GFP_KERNEL); -- cgit v1.2.3 From 4a5f133c15b77c4018e8d7996541868ac94afb4f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 24 Apr 2020 11:21:40 -0700 Subject: rcutorture: Add races with task-exit processing Several variants of Linux-kernel RCU interact with task-exit processing, including preemptible RCU, Tasks RCU, and Tasks Trace RCU. This commit therefore adds testing of this interaction to rcutorture by adding rcutorture.read_exit_burst and rcutorture.read_exit_delay kernel-boot parameters. These kernel parameters control the frequency and spacing of special read-then-exit kthreads that are spawned. [ paulmck: Apply feedback from Dan Carpenter's static checker. ] [ paulmck: Reduce latency to avoid false-positive shutdown hangs. ] Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 14 +++ include/linux/torture.h | 5 ++ kernel/rcu/rcutorture.c | 112 +++++++++++++++++++++++- 3 files changed, 128 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index fb95fad81c79..a0dcc925c8a2 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4258,6 +4258,20 @@ Set time (jiffies) between CPU-hotplug operations, or zero to disable CPU-hotplug testing. + rcutorture.read_exit= [KNL] + Set the number of read-then-exit kthreads used + to test the interaction of RCU updaters and + task-exit processing. + + rcutorture.read_exit_burst= [KNL] + The number of times in a given read-then-exit + episode that a set of read-then-exit kthreads + is spawned. + + rcutorture.read_exit_delay= [KNL] + The delay, in seconds, between successive + read-then-exit testing episodes. + rcutorture.shuffle_interval= [KNL] Set task-shuffle interval (s). Shuffling tasks allows some CPUs to go into dyntick-idle mode diff --git a/include/linux/torture.h b/include/linux/torture.h index 629b66e6c161..7f65bd1dd307 100644 --- a/include/linux/torture.h +++ b/include/linux/torture.h @@ -55,6 +55,11 @@ struct torture_random_state { #define DEFINE_TORTURE_RANDOM_PERCPU(name) \ DEFINE_PER_CPU(struct torture_random_state, name) unsigned long torture_random(struct torture_random_state *trsp); +static inline void torture_random_init(struct torture_random_state *trsp) +{ + trsp->trs_state = 0; + trsp->trs_count = 0; +} /* Task shuffler, which causes CPUs to occasionally go idle. */ void torture_shuffle_task_register(struct task_struct *tp); diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index efb792e13fca..2621a339c8a4 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -109,6 +109,10 @@ torture_param(int, object_debug, 0, torture_param(int, onoff_holdoff, 0, "Time after boot before CPU hotplugs (s)"); torture_param(int, onoff_interval, 0, "Time between CPU hotplugs (jiffies), 0=disable"); +torture_param(int, read_exit_delay, 13, + "Delay between read-then-exit episodes (s)"); +torture_param(int, read_exit_burst, 16, + "# of read-then-exit bursts per episode, zero to disable"); torture_param(int, shuffle_interval, 3, "Number of seconds between shuffles"); torture_param(int, shutdown_secs, 0, "Shutdown time (s), <= zero to disable."); torture_param(int, stall_cpu, 0, "Stall duration (s), zero to disable."); @@ -146,6 +150,7 @@ static struct task_struct *stall_task; static struct task_struct *fwd_prog_task; static struct task_struct **barrier_cbs_tasks; static struct task_struct *barrier_task; +static struct task_struct *read_exit_task; #define RCU_TORTURE_PIPE_LEN 10 @@ -177,6 +182,7 @@ static long n_rcu_torture_boosts; static atomic_long_t n_rcu_torture_timers; static long n_barrier_attempts; static long n_barrier_successes; /* did rcu_barrier test succeed? */ +static unsigned long n_read_exits; static struct list_head rcu_torture_removed; static unsigned long shutdown_jiffies; @@ -1539,10 +1545,11 @@ rcu_torture_stats_print(void) n_rcu_torture_boosts, atomic_long_read(&n_rcu_torture_timers)); torture_onoff_stats(); - pr_cont("barrier: %ld/%ld:%ld\n", + pr_cont("barrier: %ld/%ld:%ld ", data_race(n_barrier_successes), data_race(n_barrier_attempts), data_race(n_rcu_torture_barrier_error)); + pr_cont("read-exits: %ld\n", data_race(n_read_exits)); pr_alert("%s%s ", torture_type, TORTURE_FLAG); if (atomic_read(&n_rcu_torture_mberror) || @@ -1634,7 +1641,8 @@ rcu_torture_print_module_parms(struct rcu_torture_ops *cur_ops, const char *tag) "stall_cpu=%d stall_cpu_holdoff=%d stall_cpu_irqsoff=%d " "stall_cpu_block=%d " "n_barrier_cbs=%d " - "onoff_interval=%d onoff_holdoff=%d\n", + "onoff_interval=%d onoff_holdoff=%d " + "read_exit_delay=%d read_exit_burst=%d\n", torture_type, tag, nrealreaders, nfakewriters, stat_interval, verbose, test_no_idle_hz, shuffle_interval, stutter, irqreader, fqs_duration, fqs_holdoff, fqs_stutter, @@ -1643,7 +1651,8 @@ rcu_torture_print_module_parms(struct rcu_torture_ops *cur_ops, const char *tag) stall_cpu, stall_cpu_holdoff, stall_cpu_irqsoff, stall_cpu_block, n_barrier_cbs, - onoff_interval, onoff_holdoff); + onoff_interval, onoff_holdoff, + read_exit_delay, read_exit_burst); } static int rcutorture_booster_cleanup(unsigned int cpu) @@ -2338,6 +2347,99 @@ static bool rcu_torture_can_boost(void) return true; } +static bool read_exit_child_stop; +static bool read_exit_child_stopped; +static wait_queue_head_t read_exit_wq; + +// Child kthread which just does an rcutorture reader and exits. +static int rcu_torture_read_exit_child(void *trsp_in) +{ + struct torture_random_state *trsp = trsp_in; + + set_user_nice(current, MAX_NICE); + // Minimize time between reading and exiting. + while (!kthread_should_stop()) + schedule_timeout_uninterruptible(1); + (void)rcu_torture_one_read(trsp); + return 0; +} + +// Parent kthread which creates and destroys read-exit child kthreads. +static int rcu_torture_read_exit(void *unused) +{ + int count = 0; + bool errexit = false; + int i; + struct task_struct *tsp; + DEFINE_TORTURE_RANDOM(trs); + + // Allocate and initialize. + set_user_nice(current, MAX_NICE); + VERBOSE_TOROUT_STRING("rcu_torture_read_exit: Start of test"); + + // Each pass through this loop does one read-exit episode. + do { + if (++count > read_exit_burst) { + VERBOSE_TOROUT_STRING("rcu_torture_read_exit: End of episode"); + rcu_barrier(); // Wait for task_struct free, avoid OOM. + for (i = 0; i < read_exit_delay; i++) { + schedule_timeout_uninterruptible(HZ); + if (READ_ONCE(read_exit_child_stop)) + break; + } + if (!READ_ONCE(read_exit_child_stop)) + VERBOSE_TOROUT_STRING("rcu_torture_read_exit: Start of episode"); + count = 0; + } + if (READ_ONCE(read_exit_child_stop)) + break; + // Spawn child. + tsp = kthread_run(rcu_torture_read_exit_child, + &trs, "%s", + "rcu_torture_read_exit_child"); + if (IS_ERR(tsp)) { + VERBOSE_TOROUT_ERRSTRING("out of memory"); + errexit = true; + tsp = NULL; + break; + } + cond_resched(); + kthread_stop(tsp); + n_read_exits ++; + stutter_wait("rcu_torture_read_exit"); + } while (!errexit && !READ_ONCE(read_exit_child_stop)); + + // Clean up and exit. + smp_store_release(&read_exit_child_stopped, true); // After reaping. + smp_mb(); // Store before wakeup. + wake_up(&read_exit_wq); + while (!torture_must_stop()) + schedule_timeout_uninterruptible(1); + torture_kthread_stopping("rcu_torture_read_exit"); + return 0; +} + +static int rcu_torture_read_exit_init(void) +{ + if (read_exit_burst <= 0) + return -EINVAL; + init_waitqueue_head(&read_exit_wq); + read_exit_child_stop = false; + read_exit_child_stopped = false; + return torture_create_kthread(rcu_torture_read_exit, NULL, + read_exit_task); +} + +static void rcu_torture_read_exit_cleanup(void) +{ + if (!read_exit_task) + return; + WRITE_ONCE(read_exit_child_stop, true); + smp_mb(); // Above write before wait. + wait_event(read_exit_wq, smp_load_acquire(&read_exit_child_stopped)); + torture_stop_kthread(rcutorture_read_exit, read_exit_task); +} + static enum cpuhp_state rcutor_hp; static void @@ -2359,6 +2461,7 @@ rcu_torture_cleanup(void) } show_rcu_gp_kthreads(); + rcu_torture_read_exit_cleanup(); rcu_torture_barrier_cleanup(); torture_stop_kthread(rcu_torture_fwd_prog, fwd_prog_task); torture_stop_kthread(rcu_torture_stall, stall_task); @@ -2680,6 +2783,9 @@ rcu_torture_init(void) if (firsterr) goto unwind; firsterr = rcu_torture_barrier_init(); + if (firsterr) + goto unwind; + firsterr = rcu_torture_read_exit_init(); if (firsterr) goto unwind; if (object_debug) -- cgit v1.2.3 From 61251d6899803594a108c3165aeb072c73e09cc8 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 26 Apr 2020 16:48:46 -0700 Subject: torture: Set configfile variable to current scenario The torture-test recheck logic fails to set the configfile variable to the current scenario, so this commit properly initializes this variable. This change isn't critical given that all errors for a given scenario follow that scenario's heading, but it is easier on the eyes to repeat it. And this repetition also prevents confusion as to whether a given message goes with the previous heading or the next one. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/kvm-recheck.sh | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/rcutorture/bin/kvm-recheck.sh b/tools/testing/selftests/rcutorture/bin/kvm-recheck.sh index 736f04749b90..2261aa676304 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-recheck.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-recheck.sh @@ -31,6 +31,7 @@ do head -1 $resdir/log fi TORTURE_SUITE="`cat $i/../TORTURE_SUITE`" + configfile=`echo $i | sed -e 's,^.*/,,'` rm -f $i/console.log.*.diags kvm-recheck-${TORTURE_SUITE}.sh $i if test -f "$i/qemu-retval" && test "`cat $i/qemu-retval`" -ne 0 && test "`cat $i/qemu-retval`" -ne 137 -- cgit v1.2.3 From 59359e4f2a0906920389ec1e33296ac9a19178ba Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 26 Apr 2020 16:51:56 -0700 Subject: rcutorture: Handle non-statistic bang-string error messages The current console parsing assumes that console lines containing "!!!" are statistics lines from which it can parse the number of rcutorture too-short grace-period failures. This prints confusing output for other problems, including memory exhaustion. This commit therefore differentiates between these cases and prints an appropriate error string. Signed-off-by: Paul E. McKenney --- .../testing/selftests/rcutorture/bin/parse-console.sh | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/rcutorture/bin/parse-console.sh b/tools/testing/selftests/rcutorture/bin/parse-console.sh index 4bf62d7b1cbc..1c64ca85438c 100755 --- a/tools/testing/selftests/rcutorture/bin/parse-console.sh +++ b/tools/testing/selftests/rcutorture/bin/parse-console.sh @@ -44,11 +44,23 @@ then tail -1 | awk ' { - for (i=NF-8;i<=NF;i++) + normalexit = 1; + for (i=NF-8;i<=NF;i++) { + if (i <= 0 || i !~ /^[0-9]*$/) { + bangstring = $0; + gsub(/^\[[^]]*] /, "", bangstring); + print bangstring; + normalexit = 0; + exit 0; + } sum+=$i; + } } - END { print sum }'` - print_bug $title FAILURE, $nerrs instances + END { + if (normalexit) + print sum " instances" + }'` + print_bug $title FAILURE, $nerrs exit fi -- cgit v1.2.3 From cae7cc6ba5bad320c2055ac54f73affd051e76ca Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 26 Apr 2020 19:20:37 -0700 Subject: rcutorture: NULL rcu_torture_current earlier in cleanup code Currently, the rcu_torture_current variable remains non-NULL until after all readers have stopped. During this time, rcu_torture_stats_print() will think that the test is still ongoing, which can result in confusing dmesg output. This commit therefore NULLs rcu_torture_current immediately after the rcu_torture_writer() kthread has decided to stop, thus informing rcu_torture_stats_print() much sooner. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 2621a339c8a4..59112077a6da 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1172,6 +1172,7 @@ rcu_torture_writer(void *arg) WARN(1, "%s: rtort_pipe_count: %d\n", __func__, rcu_tortures[i].rtort_pipe_count); } } while (!torture_must_stop()); + rcu_torture_current = NULL; // Let stats task know that we are done. /* Reset expediting back to unexpedited. */ if (expediting > 0) expediting = -expediting; @@ -2473,7 +2474,6 @@ rcu_torture_cleanup(void) reader_tasks[i]); kfree(reader_tasks); } - rcu_torture_current = NULL; if (fakewriter_tasks) { for (i = 0; i < nfakewriters; i++) { -- cgit v1.2.3 From d3cb26312ecfdb4ee8dedf931e24e60df1d7fbc9 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 4 May 2020 16:40:53 -0700 Subject: torture: Remove whitespace from identify_qemu_vcpus output The identify_qemu_vcpus bash function can return numbers including whitespace characters, which can be a bit annoying in some bash dollar-sign substitutions. This commit therefore strips all spaces and tabs from the value that identify_qemu_vcpus outputs. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/functions.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/rcutorture/bin/functions.sh b/tools/testing/selftests/rcutorture/bin/functions.sh index 436b1542cf27..51f3464b96d3 100644 --- a/tools/testing/selftests/rcutorture/bin/functions.sh +++ b/tools/testing/selftests/rcutorture/bin/functions.sh @@ -231,7 +231,7 @@ identify_qemu_args () { # Returns the number of virtual CPUs available to the aggregate of the # guest OSes. identify_qemu_vcpus () { - lscpu | grep '^CPU(s):' | sed -e 's/CPU(s)://' + lscpu | grep '^CPU(s):' | sed -e 's/CPU(s)://' -e 's/[ ]*//g' } # print_bug -- cgit v1.2.3 From a3ba4972f2ef8408dcc8a2a3d433621d6c990594 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 4 May 2020 16:41:53 -0700 Subject: torture: Add --allcpus argument to the kvm.sh script Leaving off the kvm.sh script's --cpus argument results in the script testing the scenarios sequentially, which can be quite slow. However, having to specify the actual number of CPUs can be error-prone. This commit therefore adds a --allcpus argument that causes kvm.sh to use all available CPUs. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/kvm.sh | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh b/tools/testing/selftests/rcutorture/bin/kvm.sh index c279cf9cb010..7dbce7a43413 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm.sh @@ -73,6 +73,10 @@ usage () { while test $# -gt 0 do case "$1" in + --allcpus) + cpus=$TORTURE_ALLOTED_CPUS + max_cpus=$TORTURE_ALLOTED_CPUS + ;; --bootargs|--bootarg) checkarg --bootargs "(list of kernel boot arguments)" "$#" "$2" '.*' '^--' TORTURE_BOOTARGS="$2" -- cgit v1.2.3 From 8f43d5911b38f00dfa46169dcb1feb1e101dd906 Mon Sep 17 00:00:00 2001 From: Jules Irenge Date: Mon, 1 Jun 2020 19:45:48 +0100 Subject: rcu/rcutorture: Replace 0 with false Coccinelle reports a warning WARNING: Assignment of 0/1 to bool variable The root cause is that the variable lastphase is a bool, but is initialised with integer 0. This commit therefore replaces the 0 with a false. Signed-off-by: Jules Irenge Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 59112077a6da..37455a12898e 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -2185,7 +2185,7 @@ static void rcu_torture_barrier1cb(void *rcu_void) static int rcu_torture_barrier_cbs(void *arg) { long myid = (long)arg; - bool lastphase = 0; + bool lastphase = false; bool newphase; struct rcu_head rcu; -- cgit v1.2.3 From 3e93a51f191aa710760591961240f8910d952b5b Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 5 Jun 2020 10:29:28 -0700 Subject: torture: Create qemu-cmd in --buildonly runs One reason to do a --buildonly run is to use the build products elsewhere, for example, to do the actual test on some other system. Part of doing the test is the actual qemu command, which is not currently produced by --buildonly runs. This commit therefore causes --buildonly runs to create this file. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh index 1b9aebd54cc9..064dd735de39 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh @@ -153,6 +153,7 @@ qemu_append="`identify_qemu_append "$QEMU"`" boot_args="`configfrag_boot_params "$boot_args" "$config_template"`" # Generate kernel-version-specific boot parameters boot_args="`per_version_boot_params "$boot_args" $resdir/.config $seconds`" +echo $QEMU $qemu_args -m $TORTURE_QEMU_MEM -kernel $KERNEL -append \"$qemu_append $boot_args\" > $resdir/qemu-cmd if test -n "$TORTURE_BUILDONLY" then @@ -161,7 +162,6 @@ then exit 0 fi echo "NOTE: $QEMU either did not run or was interactive" > $resdir/console.log -echo $QEMU $qemu_args -m $TORTURE_QEMU_MEM -kernel $KERNEL -append \"$qemu_append $boot_args\" > $resdir/qemu-cmd ( $QEMU $qemu_args -m $TORTURE_QEMU_MEM -kernel $KERNEL -append "$qemu_append $boot_args" > $resdir/qemu-output 2>&1 & echo $! > $resdir/qemu_pid; wait `cat $resdir/qemu_pid`; echo $? > $resdir/qemu-retval ) & commandcompleted=0 sleep 10 # Give qemu's pid a chance to reach the file -- cgit v1.2.3 From 6387ecbc94bf5ac07239104b84d2304da6e79b51 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 9 Jun 2020 17:58:30 -0700 Subject: torture: Add a stop-run capability When bisecting RCU issues, it is often the case that the first error in an unsuccessful run will happen quickly, but that a successful run must go on for some time in order to obtain a sufficiently low false-negative error rate. In many cases, a bisection requires multiple concurrent runs, in which case the first failure in any run indicates failure, pure and simple. In such cases, it would speed things up greatly if the first failure terminated all runs. This commit therefore adds scripting that checks for a file named "STOP" in the top-level results directory, terminating the run when it appears. Note that in-progress builds will continue until completion, but future builds and all runs will be cut short. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/jitter.sh | 6 ++++++ tools/testing/selftests/rcutorture/bin/kvm-build.sh | 6 ++++++ tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 13 +++++++++++-- tools/testing/selftests/rcutorture/bin/kvm.sh | 2 ++ 4 files changed, 25 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/rcutorture/bin/jitter.sh b/tools/testing/selftests/rcutorture/bin/jitter.sh index 30cb5b27d32e..188b864bc4bf 100755 --- a/tools/testing/selftests/rcutorture/bin/jitter.sh +++ b/tools/testing/selftests/rcutorture/bin/jitter.sh @@ -46,6 +46,12 @@ do exit 0; fi + # Check for stop request. + if test -f "$TORTURE_STOPFILE" + then + exit 1; + fi + # Set affinity to randomly selected online CPU if cpus=`grep 1 /sys/devices/system/cpu/*/online 2>&1 | sed -e 's,/[^/]*$,,' -e 's/^[^0-9]*//'` diff --git a/tools/testing/selftests/rcutorture/bin/kvm-build.sh b/tools/testing/selftests/rcutorture/bin/kvm-build.sh index 18d6518504ee..115e1822b26f 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-build.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-build.sh @@ -9,6 +9,12 @@ # # Authors: Paul E. McKenney +if test -f "$TORTURE_STOPFILE" +then + echo "kvm-build.sh early exit due to run STOP request" + exit 1 +fi + config_template=${1} if test -z "$config_template" -o ! -f "$config_template" -o ! -r "$config_template" then diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh index 064dd735de39..5ec095da095f 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh @@ -182,7 +182,7 @@ do kruntime=`gawk 'BEGIN { print systime() - '"$kstarttime"' }' < /dev/null` if test -z "$qemu_pid" || kill -0 "$qemu_pid" > /dev/null 2>&1 then - if test $kruntime -ge $seconds + if test $kruntime -ge $seconds -o -f "$TORTURE_STOPFILE" then break; fi @@ -211,10 +211,19 @@ then fi if test $commandcompleted -eq 0 -a -n "$qemu_pid" then - echo Grace period for qemu job at pid $qemu_pid + if ! test -f "$TORTURE_STOPFILE" + then + echo Grace period for qemu job at pid $qemu_pid + fi oldline="`tail $resdir/console.log`" while : do + if test -f "$TORTURE_STOPFILE" + then + echo "PID $qemu_pid killed due to run STOP request" >> $resdir/Warnings 2>&1 + kill -KILL $qemu_pid + break + fi kruntime=`gawk 'BEGIN { print systime() - '"$kstarttime"' }' < /dev/null` if kill -0 $qemu_pid > /dev/null 2>&1 then diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh b/tools/testing/selftests/rcutorture/bin/kvm.sh index 7dbce7a43413..3578c85ea8c4 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm.sh @@ -337,6 +337,8 @@ then mkdir -p "$resdir" || : fi mkdir $resdir/$ds +TORTURE_RESDIR="$resdir/$ds"; export TORTURE_RESDIR +TORTURE_STOPFILE="$resdir/$ds/STOP"; export TORTURE_STOPFILE echo Results directory: $resdir/$ds echo $scriptname $args touch $resdir/$ds/log -- cgit v1.2.3 From bc77a72cd188d44881ee1b9d0a9d65ca8108b508 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 10 Jun 2020 14:08:19 -0700 Subject: torture: Abstract out console-log error detection This commit pulls the simple pattern-based error detection from the console log into a new console-badness.sh file. This will enable future commits to end a run on the first error. Signed-off-by: Paul E. McKenney --- .../testing/selftests/rcutorture/bin/console-badness.sh | 16 ++++++++++++++++ tools/testing/selftests/rcutorture/bin/parse-console.sh | 5 +---- 2 files changed, 17 insertions(+), 4 deletions(-) create mode 100755 tools/testing/selftests/rcutorture/bin/console-badness.sh diff --git a/tools/testing/selftests/rcutorture/bin/console-badness.sh b/tools/testing/selftests/rcutorture/bin/console-badness.sh new file mode 100755 index 000000000000..0e4c0b2eb7f0 --- /dev/null +++ b/tools/testing/selftests/rcutorture/bin/console-badness.sh @@ -0,0 +1,16 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0+ +# +# Scan standard input for error messages, dumping any found to standard +# output. +# +# Usage: console-badness.sh +# +# Copyright (C) 2020 Facebook, Inc. +# +# Authors: Paul E. McKenney + +egrep 'Badness|WARNING:|Warn|BUG|===========|Call Trace:|Oops:|detected stalls on CPUs/tasks:|self-detected stall on CPU|Stall ended before state dump start|\?\?\? Writer stall state|rcu_.*kthread starved for|!!!' | +grep -v 'ODEBUG: ' | +grep -v 'This means that this is a DEBUG kernel and it is' | +grep -v 'Warning: unable to open an initial console' diff --git a/tools/testing/selftests/rcutorture/bin/parse-console.sh b/tools/testing/selftests/rcutorture/bin/parse-console.sh index 1c64ca85438c..98478e12ac3d 100755 --- a/tools/testing/selftests/rcutorture/bin/parse-console.sh +++ b/tools/testing/selftests/rcutorture/bin/parse-console.sh @@ -116,10 +116,7 @@ then fi fi | tee -a $file.diags -egrep 'Badness|WARNING:|Warn|BUG|===========|Call Trace:|Oops:|detected stalls on CPUs/tasks:|self-detected stall on CPU|Stall ended before state dump start|\?\?\? Writer stall state|rcu_.*kthread starved for' < $file | -grep -v 'ODEBUG: ' | -grep -v 'This means that this is a DEBUG kernel and it is' | -grep -v 'Warning: unable to open an initial console' > $T.diags +console-badness.sh < $file > $T.diags if test -s $T.diags then print_warning "Assertion failure in $file $title" -- cgit v1.2.3 From 775227511843202e65a7f194cbf64f38de01f004 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 11 Jun 2020 16:43:14 -0700 Subject: rcutorture: Check for unwatched readers RCU is supposed to be watching all non-idle kernel code and also all softirq handlers. This commit adds some teeth to this statement by adding a WARN_ON_ONCE(). Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 37455a12898e..9c310016585b 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1377,6 +1377,7 @@ static bool rcu_torture_one_read(struct torture_random_state *trsp) struct rt_read_seg *rtrsp1; unsigned long long ts; + WARN_ON_ONCE(!rcu_is_watching()); newstate = rcutorture_extend_mask(readstate, trsp); rcutorture_one_extend(&readstate, newstate, trsp, rtrsp++); started = cur_ops->get_gp_seq(); -- cgit v1.2.3 From 603d11ad6976e1289f19c2a19e2f75a83d0dc296 Mon Sep 17 00:00:00 2001 From: Marco Elver Date: Tue, 16 Jun 2020 11:49:24 +0200 Subject: torture: Pass --kmake-arg to all make invocations We need to pass the arguments provided to --kmake-arg to all make invocations. In particular, the make invocations generating the configs need to see the final make arguments, e.g. if config variables depend on particular variables that are passed to make. For example, when using '--kcsan --kmake-arg CC=clang-11', we would lose CONFIG_KCSAN=y due to 'make oldconfig' not seeing that we want to use a compiler that supports KCSAN. Signed-off-by: Marco Elver Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/configinit.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/rcutorture/bin/configinit.sh b/tools/testing/selftests/rcutorture/bin/configinit.sh index 93e80a42249a..d6e5ce084b1c 100755 --- a/tools/testing/selftests/rcutorture/bin/configinit.sh +++ b/tools/testing/selftests/rcutorture/bin/configinit.sh @@ -32,11 +32,11 @@ if test -z "$TORTURE_TRUST_MAKE" then make clean > $resdir/Make.clean 2>&1 fi -make $TORTURE_DEFCONFIG > $resdir/Make.defconfig.out 2>&1 +make $TORTURE_KMAKE_ARG $TORTURE_DEFCONFIG > $resdir/Make.defconfig.out 2>&1 mv .config .config.sav sh $T/upd.sh < .config.sav > .config cp .config .config.new -yes '' | make oldconfig > $resdir/Make.oldconfig.out 2> $resdir/Make.oldconfig.err +yes '' | make $TORTURE_KMAKE_ARG oldconfig > $resdir/Make.oldconfig.out 2> $resdir/Make.oldconfig.err # verify new config matches specification. configcheck.sh .config $c -- cgit v1.2.3 From 6bcaf2a0876633b6a7c5e70ee88801e16280210a Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 16 Jun 2020 10:02:54 -0700 Subject: torture: Correctly summarize build-only runs Currently, kvm-recheck.sh complains that qemu failed for --buildonly runs, which is sort of true given that qemu can hardly succeed if not invoked in the first place. Nevertheless, this commit swaps the order of checks in kvm-recheck.sh so that --buildonly runs will be summarized more straightforwardly. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/kvm-recheck.sh | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/rcutorture/bin/kvm-recheck.sh b/tools/testing/selftests/rcutorture/bin/kvm-recheck.sh index 2261aa676304..357899cfe249 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-recheck.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-recheck.sh @@ -56,15 +56,15 @@ do cat $i/Warnings fi else - if test -f "$i/qemu-cmd" - then - print_bug qemu failed - echo " $i" - elif test -f "$i/buildonly" + if test -f "$i/buildonly" then echo Build-only run, no boot/test configcheck.sh $i/.config $i/ConfigFragment parse-build.sh $i/Make.out $configfile + elif test -f "$i/qemu-cmd" + then + print_bug qemu failed + echo " $i" else print_bug Build failed echo " $i" -- cgit v1.2.3 From 61b77be09e29e6dc152b1984691e5b1708e8a6ac Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 16 Jun 2020 10:38:57 -0700 Subject: torture: Improve diagnostic for KCSAN-incapable compilers Using --kcsan when the compiler does not support KCSAN results in this: :CONFIG_KCSAN=y: improperly set :CONFIG_KCSAN_REPORT_ONCE_IN_MS=100000: improperly set :CONFIG_KCSAN_VERBOSE=y: improperly set :CONFIG_KCSAN_INTERRUPT_WATCHER=y: improperly set Clean KCSAN run in /home/git/linux-rcu/tools/testing/selftests/rcutorture/res/2020.06.16-09.53.16 This is a bit obtuse, so this commit adds checks resulting in this: :CONFIG_KCSAN=y: improperly set :CONFIG_KCSAN_REPORT_ONCE_IN_MS=100000: improperly set :CONFIG_KCSAN_VERBOSE=y: improperly set :CONFIG_KCSAN_INTERRUPT_WATCHER=y: improperly set Compiler or architecture does not support KCSAN! Did you forget to switch your compiler with --kmake-arg CC=? Suggested-by: Marco Elver Signed-off-by: Paul E. McKenney Acked-by: Marco Elver --- tools/testing/selftests/rcutorture/bin/kvm-recheck.sh | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/rcutorture/bin/kvm-recheck.sh b/tools/testing/selftests/rcutorture/bin/kvm-recheck.sh index 357899cfe249..840a4679a0d7 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-recheck.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-recheck.sh @@ -44,7 +44,8 @@ do then echo QEMU killed fi - configcheck.sh $i/.config $i/ConfigFragment + configcheck.sh $i/.config $i/ConfigFragment > $T 2>&1 + cat $T if test -r $i/Make.oldconfig.err then cat $i/Make.oldconfig.err @@ -73,7 +74,11 @@ do done if test -f "$rd/kcsan.sum" then - if test -s "$rd/kcsan.sum" + if grep -q CONFIG_KCSAN=y $T + then + echo "Compiler or architecture does not support KCSAN!" + echo Did you forget to switch your compiler with '--kmake-arg CC='? + elif test -s "$rd/kcsan.sum" then echo KCSAN summary in $rd/kcsan.sum else -- cgit v1.2.3 From 9ccba350bd824ecacbfd8965f4f3ac980b96f951 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 16 Jun 2020 11:16:18 -0700 Subject: torture: Add more tracing crib notes to kvm.sh This commit adds a few more hints about how to use tracing as comments at the end of kvm.sh. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/kvm.sh | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh b/tools/testing/selftests/rcutorture/bin/kvm.sh index 3578c85ea8c4..bdfa0c076ae6 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm.sh @@ -503,3 +503,7 @@ fi # Tracing: trace_event=rcu:rcu_grace_period,rcu:rcu_future_grace_period,rcu:rcu_grace_period_init,rcu:rcu_nocb_wake,rcu:rcu_preempt_task,rcu:rcu_unlock_preempted_task,rcu:rcu_quiescent_state_report,rcu:rcu_fqs,rcu:rcu_callback,rcu:rcu_kfree_callback,rcu:rcu_batch_start,rcu:rcu_invoke_callback,rcu:rcu_invoke_kfree_callback,rcu:rcu_batch_end,rcu:rcu_torture_read,rcu:rcu_barrier # Function-graph tracing: ftrace=function_graph ftrace_graph_filter=sched_setaffinity,migration_cpu_stop # Also --kconfig "CONFIG_FUNCTION_TRACER=y CONFIG_FUNCTION_GRAPH_TRACER=y" +# Control buffer size: --bootargs trace_buf_size=3k +# Get trace-buffer dumps on all oopses: --bootargs ftrace_dump_on_oops +# Ditto, but dump only the oopsing CPU: --bootargs ftrace_dump_on_oops=orig_cpu +# Heavy-handed way to also dump on warnings: --bootargs panic_on_warn -- cgit v1.2.3 From 06efa9b4b27f926eeb8c935f430f8557eb8b106e Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 16 Jun 2020 14:14:09 -0700 Subject: torture: Add kvm-tranform.sh script for qemu-cmd files This commit adds a script that transforms qemu-cmd files to allow them and the corresponding kernels to be run in contexts other than the one that they were created for, including on systems other than the one that they were built on. For example, this allows the build products from a --buildonly run to be transformed to allow distributed rcutorture testing. Signed-off-by: Paul E. McKenney --- .../selftests/rcutorture/bin/kvm-transform.sh | 51 ++++++++++++++++++++++ 1 file changed, 51 insertions(+) create mode 100755 tools/testing/selftests/rcutorture/bin/kvm-transform.sh diff --git a/tools/testing/selftests/rcutorture/bin/kvm-transform.sh b/tools/testing/selftests/rcutorture/bin/kvm-transform.sh new file mode 100755 index 000000000000..c45a953ef393 --- /dev/null +++ b/tools/testing/selftests/rcutorture/bin/kvm-transform.sh @@ -0,0 +1,51 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0+ +# +# Transform a qemu-cmd file to allow reuse. +# +# Usage: kvm-transform.sh bzImage console.log < qemu-cmd-in > qemu-cmd-out +# +# bzImage: Kernel and initrd from the same prior kvm.sh run. +# console.log: File into which to place console output. +# +# The original qemu-cmd file is provided on standard input. +# The transformed qemu-cmd file is on standard output. +# The transformation assumes that the qemu command is confined to a +# single line. It also assumes no whitespace in filenames. +# +# Copyright (C) 2020 Facebook, Inc. +# +# Authors: Paul E. McKenney + +image="$1" +if test -z "$image" +then + echo Need kernel image file. + exit 1 +fi +consolelog="$2" +if test -z "$consolelog" +then + echo "Need console log file name." + exit 1 +fi + +awk -v image="$image" -v consolelog="$consolelog" ' +{ + line = ""; + for (i = 1; i <= NF; i++) { + if (line == "") + line = $i; + else + line = line " " $i; + if ($i == "-serial") { + i++; + line = line " file:" consolelog; + } + if ($i == "-kernel") { + i++; + line = line " " image; + } + } + print line; +}' -- cgit v1.2.3 From 2102ad290af06119ccfb56ddc3a0e5011a91537e Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 16 Jun 2020 15:38:24 -0700 Subject: torture: Dump ftrace at shutdown only if requested If there is a large number of torture tests running concurrently, all of which are dumping large ftrace buffers at shutdown time, the resulting dumping can take a very long time, particularly on systems with rotating-rust storage. This commit therefore adds a default-off torture.ftrace_dump_at_shutdown module parameter that enables shutdown-time ftrace-buffer dumping. Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 7 +++++++ kernel/torture.c | 6 +++++- 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index a0dcc925c8a2..9f11ff80d4ad 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5096,6 +5096,13 @@ Prevent the CPU-hotplug component of torturing until after init has spawned. + torture.ftrace_dump_at_shutdown= [KNL] + Dump the ftrace buffer at torture-test shutdown, + even if there were no errors. This can be a + very costly operation when many torture tests + are running concurrently, especially on systems + with rotating-rust storage. + tp720= [HW,PS2] tpm_suspend_pcr=[HW,TPM] diff --git a/kernel/torture.c b/kernel/torture.c index a1a41484ff6d..1061492f14bd 100644 --- a/kernel/torture.c +++ b/kernel/torture.c @@ -45,6 +45,9 @@ MODULE_AUTHOR("Paul E. McKenney "); static bool disable_onoff_at_boot; module_param(disable_onoff_at_boot, bool, 0444); +static bool ftrace_dump_at_shutdown; +module_param(ftrace_dump_at_shutdown, bool, 0444); + static char *torture_type; static int verbose; @@ -527,7 +530,8 @@ static int torture_shutdown(void *arg) torture_shutdown_hook(); else VERBOSE_TOROUT_STRING("No torture_shutdown_hook(), skipping."); - rcu_ftrace_dump(DUMP_ALL); + if (ftrace_dump_at_shutdown) + rcu_ftrace_dump(DUMP_ALL); kernel_power_off(); /* Shut down the system. */ return 0; } -- cgit v1.2.3 From 316db5897ee5d7408f2adea4d5992ed380316928 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 16 Jun 2020 16:34:52 -0700 Subject: torture: Avoid duplicate specification of qemu command Currently, the qemu command is constructed twice, once to dump it to the qemu-cmd file and again to execute it. This is of course an accident waiting to happen, but is done to ensure that the remainder of the script has an accurate idea of the running qemu command's PID. This commit therefore places both the qemu command and the PID capture into a new temporary file and sources that temporary file. Thus the single construction of the qemu command into the qemu-cmd file suffices for both purposes. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh index 5ec095da095f..484445bd3010 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh @@ -161,8 +161,16 @@ then touch $resdir/buildonly exit 0 fi + +# Decorate qemu-cmd with redirection, backgrounding, and PID capture +sed -e 's/$/ 2>\&1 \&/' < $resdir/qemu-cmd > $T/qemu-cmd +echo 'echo $! > $resdir/qemu_pid' >> $T/qemu-cmd + +# In case qemu refuses to run... echo "NOTE: $QEMU either did not run or was interactive" > $resdir/console.log -( $QEMU $qemu_args -m $TORTURE_QEMU_MEM -kernel $KERNEL -append "$qemu_append $boot_args" > $resdir/qemu-output 2>&1 & echo $! > $resdir/qemu_pid; wait `cat $resdir/qemu_pid`; echo $? > $resdir/qemu-retval ) & + +# Attempt to run qemu +( . $T/qemu-cmd; wait `cat $resdir/qemu_pid`; echo $? > $resdir/qemu-retval ) & commandcompleted=0 sleep 10 # Give qemu's pid a chance to reach the file if test -s "$resdir/qemu_pid" -- cgit v1.2.3 From 7a6bbeaa01f71af2722fd775a4a4ff9593d12838 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 16 Jun 2020 17:07:15 -0700 Subject: torture: Remove obsolete "cd $KVM" In the dim distant past, qemu commands needed to be run from the rcutorture directory, but this is no longer the case. This commit therefore removes the now-useless "cd $KVM" from the kvm-test-1-run.sh script. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 1 - 1 file changed, 1 deletion(-) diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh index 484445bd3010..e07779a62634 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh @@ -124,7 +124,6 @@ seconds=$4 qemu_args=$5 boot_args=$6 -cd $KVM kstarttime=`gawk 'BEGIN { print systime() }' < /dev/null` if test -z "$TORTURE_BUILDONLY" then -- cgit v1.2.3