summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2023-11-14sched/fair: Fix the decision for load balanceKeisuke Nishimura
should_we_balance is called for the decision to do load-balancing. When sched ticks invoke this function, only one CPU should return true. However, in the current code, two CPUs can return true. The following situation, where b means busy and i means idle, is an example, because CPU 0 and CPU 2 return true. [0, 1] [2, 3] b b i b This fix checks if there exists an idle CPU with busy sibling(s) after looking for a CPU on an idle core. If some idle CPUs with busy siblings are found, just the first one should do load-balancing. Fixes: b1bfeab9b002 ("sched/fair: Consider the idle state of the whole core for load balance") Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20231031133821.1570861-1-keisuke.nishimura@inria.fr
2023-11-14sched/eevdf: Fix vruntime adjustment on reweightAbel Wu
vruntime of the (on_rq && !0-lag) entity needs to be adjusted when it gets re-weighted, and the calculations can be simplified based on the fact that re-weight won't change the w-average of all the entities. Please check the proofs in comments. But adjusting vruntime can also cause position change in RB-tree hence require re-queue to fix up which might be costly. This might be avoided by deferring adjustment to the time the entity actually leaves tree (dequeue/pick), but that will negatively affect task selection and probably not good enough either. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231107090510.71322-2-wuyun.abel@bytedance.com
2023-11-02Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "As usual, lots of singleton and doubleton patches all over the tree and there's little I can say which isn't in the individual changelogs. The lengthier patch series are - 'kdump: use generic functions to simplify crashkernel reservation in arch', from Baoquan He. This is mainly cleanups and consolidation of the 'crashkernel=' kernel parameter handling - After much discussion, David Laight's 'minmax: Relax type checks in min() and max()' is here. Hopefully reduces some typecasting and the use of min_t() and max_t() - A group of patches from Oleg Nesterov which clean up and slightly fix our handling of reads from /proc/PID/task/... and which remove task_struct.thread_group" * tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits) scripts/gdb/vmalloc: disable on no-MMU scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n .mailmap: add address mapping for Tomeu Vizoso mailmap: update email address for Claudiu Beznea tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions .mailmap: map Benjamin Poirier's address scripts/gdb: add lx_current support for riscv ocfs2: fix a spelling typo in comment proc: test ProtectionKey in proc-empty-vm test proc: fix proc-empty-vm test with vsyscall fs/proc/base.c: remove unneeded semicolon do_io_accounting: use sig->stats_lock do_io_accounting: use __for_each_thread() ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error() ocfs2: fix a typo in a comment scripts/show_delta: add __main__ judgement before main code treewide: mark stuff as __ro_after_init fs: ocfs2: check status values proc: test /proc/${pid}/statm compiler.h: move __is_constexpr() to compiler.h ...
2023-11-02Merge tag 'mm-stable-2023-11-01-14-33' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
2023-11-01Merge tag 'asm-generic-6.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull ia64 removal and asm-generic updates from Arnd Bergmann: - The ia64 architecture gets its well-earned retirement as planned, now that there is one last (mostly) working release that will be maintained as an LTS kernel. - The architecture specific system call tables are updated for the added map_shadow_stack() syscall and to remove references to the long-gone sys_lookup_dcookie() syscall. * tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: hexagon: Remove unusable symbols from the ptrace.h uapi asm-generic: Fix spelling of architecture arch: Reserve map_shadow_stack() syscall number for all architectures syscalls: Cleanup references to sys_lookup_dcookie() Documentation: Drop or replace remaining mentions of IA64 lib/raid6: Drop IA64 support Documentation: Drop IA64 from feature descriptions kernel: Drop IA64 support from sig_fault handlers arch: Remove Itanium (IA-64) architecture
2023-10-31Merge tag 'pm-6.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These add new hardware support (new Qualcomm SoC versions in cpufreq, RK3568/RK3588 in devfreq), extend the OPP (operating performance points) framework, improve cpufreq governors, fix issues and clean up code (most of the changes are in cpufreq and devfreq). Specifics: - Add support for several Qualcomm SoC versions and other similar changes (Christian Marangi, Dmitry Baryshkov, Luca Weiss, Neil Armstrong, Richard Acayan, Robert Marko, Rohit Agarwal, Stephan Gerhold and Varadarajan Narayanan) - Clean up the tegra cpufreq driver (Sumit Gupta) - Use of_property_read_reg() to parse "reg" in pmac32 driver (Rob Herring) - Add support for TI's am62p5 Soc (Bryan Brattlof) - Make ARM_BRCMSTB_AVS_CPUFREQ depends on !ARM_SCMI_CPUFREQ (Florian Fainelli) - Update Kconfig to mention i.MX7 as well (Alexander Stein) - Revise global turbo disable check in intel_pstate (Srinivas Pandruvada) - Carry out initialization of sg_cpu in the schedutil cpufreq governor in one loop (Liao Chang) - Simplify the condition for storing 'down_threshold' in the conservative cpufreq governor (Liao Chang) - Use fine-grained mutex in the userspace cpufreq governor (Liao Chang) - Move is_managed indicator in the userspace cpufreq governor into a per-policy structure (Liao Chang) - Rebuild sched-domains when removing cpufreq driver (Pierre Gondois) - Fix buffer overflow detection in trans_stats() (Christian Marangi) - Switch to dev_pm_opp_find_freq_(ceil/floor)_indexed() APIs to support specific devices like UFS which handle multiple clocks through OPP (Operating Performance Point) framework (Manivannan Sadhasivam) - Add perf support to the Rockchip DFI (DDR Monitor Module) devfreq- event driver: * Generalize rockchip-dfi.c to support new RK3568/RK3588 using different DDR type (Sascha Hauer). * Convert DT binding document format to yaml (Sascha Hauer). * Add perf support for DFI (a unit suitable for measuring DDR utilization) to rockchip-dfi.c to extend DFI usage (Sascha Hauer) - Add locking to the OPP handling code in the Mediatek CCI devfreq driver, because the voltage of shared OPP might be changed by multiple drivers (Mark Tseng, Dan Carpenter) - Use device_get_match_data() in the Samsung Exynos PPMU devfreq-event driver (Rob Herring) - Extend support for the opp-level beyond required-opps (Ulf Hansson) - Add dev_pm_opp_find_level_floor() (Krishna chaitanya chundru) - dt-bindings: Allow opp-peak-kBpsfor kryo CPUs, support Qualcomm Krait SoCs and document named opp-microvolt property (Bjorn Andersson, Dmitry Baryshkov and Christian Marangi) - Fix -Wunsequenced warning _of_add_opp_table_v1() (Nathan Chancellor) - General cleanup of OPP code (Viresh Kumar) - Use __get_safe_page() rather than touching the list in hibernation snapshot code (Brian Geffon) - Fix symbol export for _SIMPLE_ variants of _PM_OPS() (Raag Jadav) - Clean up sync_read handling in snapshot_write_next() (Brian Geffon) - Fix kerneldoc comments for swsusp_check() and swsusp_close() to better match code (Christoph Hellwig) - Downgrade BIOS locked limits pr_warn() in the Intel RAPL power capping driver to pr_debug() (Ville Syrjälä) - Change the minimum python version for the intel_pstate_tracer utility from 2.7 to 3.6 (Doug Smythies)" * tag 'pm-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (82 commits) dt-bindings: cpufreq: qcom-hw: document SM8650 CPUFREQ Hardware cpufreq: arm: Kconfig: Add i.MX7 to supported SoC for ARM_IMX_CPUFREQ_DT cpufreq: qcom-nvmem: add support for IPQ8064 cpufreq: qcom-nvmem: also accept operating-points-v2-krait-cpu cpufreq: qcom-nvmem: drop pvs_ver for format a fuses dt-bindings: cpufreq: qcom-cpufreq-nvmem: Document krait-cpu cpufreq: qcom-nvmem: add support for IPQ6018 dt-bindings: cpufreq: qcom-cpufreq-nvmem: document IPQ6018 cpufreq: qcom-nvmem: Add MSM8909 cpufreq: qcom-nvmem: Simplify driver data allocation powercap: intel_rapl: Downgrade BIOS locked limits pr_warn() to pr_debug() cpufreq: stats: Fix buffer overflow detection in trans_stats() dt-bindings: devfreq: event: rockchip,dfi: Add rk3588 support dt-bindings: devfreq: event: rockchip,dfi: Add rk3568 support dt-bindings: devfreq: event: convert Rockchip DFI binding to yaml PM / devfreq: rockchip-dfi: add support for RK3588 PM / devfreq: rockchip-dfi: account for multiple DDRMON_CTRL registers PM / devfreq: rockchip-dfi: make register stride SoC specific PM / devfreq: rockchip-dfi: Add perf support PM / devfreq: rockchip-dfi: give variable a better name ...
2023-10-30Merge tag 'sched-core-2023-10-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Fair scheduler (SCHED_OTHER) improvements: - Remove the old and now unused SIS_PROP code & option - Scan cluster before LLC in the wake-up path - Use candidate prev/recent_used CPU if scanning failed for cluster wakeup NUMA scheduling improvements: - Improve the VMA access-PID code to better skip/scan VMAs - Extend tracing to cover VMA-skipping decisions - Improve/fix the recently introduced sched_numa_find_nth_cpu() code - Generalize numa_map_to_online_node() Energy scheduling improvements: - Remove the EM_MAX_COMPLEXITY limit - Add tracepoints to track energy computation - Make the behavior of the 'sched_energy_aware' sysctl more consistent - Consolidate and clean up access to a CPU's max compute capacity - Fix uclamp code corner cases RT scheduling improvements: - Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates - Drive the ->rto_mask with rt_rq->pushable_tasks updates Scheduler scalability improvements: - Rate-limit updates to tg->load_avg - On x86 disable IBRS when CPU is offline to improve single-threaded performance - Micro-optimize in_task() and in_interrupt() - Micro-optimize the PSI code - Avoid updating PSI triggers and ->rtpoll_total when there are no state changes Core scheduler infrastructure improvements: - Use saved_state to reduce some spurious freezer wakeups - Bring in a handful of fast-headers improvements to scheduler headers - Make the scheduler UAPI headers more widely usable by user-space - Simplify the control flow of scheduler syscalls by using lock guards - Fix sched_setaffinity() vs. CPU hotplug race Scheduler debuggability improvements: - Disallow writing invalid values to sched_rt_period_us - Fix a race in the rq-clock debugging code triggering warnings - Fix a warning in the bandwidth distribution code - Micro-optimize in_atomic_preempt_off() checks - Enforce that the tasklist_lock is held in for_each_thread() - Print the TGID in sched_show_task() - Remove the /proc/sys/kernel/sched_child_runs_first sysctl ... and misc cleanups & fixes" * tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits) sched/fair: Remove SIS_PROP sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup sched/fair: Scan cluster before scanning LLC in wake-up path sched: Add cpus_share_resources API sched/core: Fix RQCF_ACT_SKIP leak sched/fair: Remove unused 'curr' argument from pick_next_entity() sched/nohz: Update comments about NEWILB_KICK sched/fair: Remove duplicate #include sched/psi: Update poll => rtpoll in relevant comments sched: Make PELT acronym definition searchable sched: Fix stop_one_cpu_nowait() vs hotplug sched/psi: Bail out early from irq time accounting sched/topology: Rename 'DIE' domain to 'PKG' sched/psi: Delete the 'update_total' function parameter from update_triggers() sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes sched/headers: Remove comment referring to rq::cpu_load, since this has been removed sched/numa: Complete scanning of inactive VMAs when there is no alternative sched/numa: Complete scanning of partial VMAs regardless of PID activity sched/numa: Move up the access pid reset logic sched/numa: Trace decisions related to skipping VMAs ...
2023-10-30Merge tag 'locking-core-2023-10-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Info Molnar: "Futex improvements: - Add the 'futex2' syscall ABI, which is an attempt to get away from the multiplex syscall and adds a little room for extentions, while lifting some limitations. - Fix futex PI recursive rt_mutex waiter state bug - Fix inter-process shared futexes on no-MMU systems - Use folios instead of pages Micro-optimizations of locking primitives: - Improve arch_spin_value_unlocked() on asm-generic ticket spinlock architectures, to improve lockref code generation - Improve the x86-32 lockref_get_not_zero() main loop by adding build-time CMPXCHG8B support detection for the relevant lockref code, and by better interfacing the CMPXCHG8B assembly code with the compiler - Introduce arch_sync_try_cmpxchg() on x86 to improve sync_try_cmpxchg() code generation. Convert some sync_cmpxchg() users to sync_try_cmpxchg(). - Micro-optimize rcuref_put_slowpath() Locking debuggability improvements: - Improve CONFIG_DEBUG_RT_MUTEXES=y to have a fast-path as well - Enforce atomicity of sched_submit_work(), which is de-facto atomic but was un-enforced previously. - Extend <linux/cleanup.h>'s no_free_ptr() with __must_check semantics - Fix ww_mutex self-tests - Clean up const-propagation in <linux/seqlock.h> and simplify the API-instantiation macros a bit RT locking improvements: - Provide the rt_mutex_*_schedule() primitives/helpers and use them in the rtmutex code to avoid recursion vs. rtlock on the PI state. - Add nested blocking lockdep asserts to rt_mutex_lock(), rtlock_lock() and rwbase_read_lock() .. plus misc fixes & cleanups" * tag 'locking-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits) futex: Don't include process MM in futex key on no-MMU locking/seqlock: Fix grammar in comment alpha: Fix up new futex syscall numbers locking/seqlock: Propagate 'const' pointers within read-only methods, remove forced type casts locking/lockdep: Fix string sizing bug that triggers a format-truncation compiler-warning locking/seqlock: Change __seqprop() to return the function pointer locking/seqlock: Simplify SEQCOUNT_LOCKNAME() locking/atomics: Use atomic_try_cmpxchg_release() to micro-optimize rcuref_put_slowpath() locking/atomic, xen: Use sync_try_cmpxchg() instead of sync_cmpxchg() locking/atomic/x86: Introduce arch_sync_try_cmpxchg() locking/atomic: Add generic support for sync_try_cmpxchg() and its fallback locking/seqlock: Fix typo in comment futex/requeue: Remove unnecessary ‘NULL’ initialization from futex_proxy_trylock_atomic() locking/local, arch: Rewrite local_add_unless() as a static inline function locking/debug: Fix debugfs API return value checks to use IS_ERR() locking/ww_mutex/test: Make sure we bail out instead of livelock locking/ww_mutex/test: Fix potential workqueue corruption locking/ww_mutex/test: Use prng instead of rng to avoid hangs at bootup futex: Add sys_futex_requeue() futex: Add flags2 argument to futex_requeue() ...
2023-10-26Merge branch 'pm-cpufreq'Rafael J. Wysocki
Merge cpufreq updates for 6.7-rc1: - Add support for several Qualcomm SoC versions and other similar changes (Christian Marangi, Dmitry Baryshkov, Luca Weiss, Neil Armstrong, Richard Acayan, Robert Marko, Rohit Agarwal, Stephan Gerhold and Varadarajan Narayanan). - Clean up the tegra cpufreq driver (Sumit Gupta). - Use of_property_read_reg() to parse "reg" in pmac32 driver (Rob Herring). - Add support for TI's am62p5 Soc (Bryan Brattlof). - Make ARM_BRCMSTB_AVS_CPUFREQ depends on !ARM_SCMI_CPUFREQ (Florian Fainelli). - Update Kconfig to mention i.MX7 as well (Alexander Stein). - Revise global turbo disable check in intel_pstate (Srinivas Pandruvada). - Carry out initialization of sg_cpu in the schedutil cpufreq governor in one loop (Liao Chang). - Simplify the condition for storing 'down_threshold' in the conservative cpufreq governor (Liao Chang). - Use fine-grained mutex in the userspace cpufreq governor (Liao Chang). - Move is_managed indicator in the userspace cpufreq governor into a per-policy structure (Liao Chang). - Rebuild sched-domains when removing cpufreq driver (Pierre Gondois). - Fix buffer overflow detection in trans_stats() (Christian Marangi). * pm-cpufreq: (32 commits) dt-bindings: cpufreq: qcom-hw: document SM8650 CPUFREQ Hardware cpufreq: arm: Kconfig: Add i.MX7 to supported SoC for ARM_IMX_CPUFREQ_DT cpufreq: qcom-nvmem: add support for IPQ8064 cpufreq: qcom-nvmem: also accept operating-points-v2-krait-cpu cpufreq: qcom-nvmem: drop pvs_ver for format a fuses dt-bindings: cpufreq: qcom-cpufreq-nvmem: Document krait-cpu cpufreq: qcom-nvmem: add support for IPQ6018 dt-bindings: cpufreq: qcom-cpufreq-nvmem: document IPQ6018 cpufreq: qcom-nvmem: Add MSM8909 cpufreq: qcom-nvmem: Simplify driver data allocation cpufreq: stats: Fix buffer overflow detection in trans_stats() dt-bindings: cpufreq: cpufreq-qcom-hw: Add SDX75 compatible cpufreq: ARM_BRCMSTB_AVS_CPUFREQ cannot be used with ARM_SCMI_CPUFREQ cpufreq: ti-cpufreq: Add opp support for am62p5 SoCs cpufreq: dt-platdev: add am62p5 to blocklist cpufreq: tegra194: remove redundant AND with cpu_online_mask cpufreq: tegra194: use refclk delta based loop instead of udelay cpufreq: tegra194: save CPU data to avoid repeated SMP calls cpufreq: Rebuild sched-domains when removing cpufreq driver cpufreq: userspace: Move is_managed indicator into per-policy structure ...
2023-10-25sched/fair: use folio_xchg_last_cpupid() in should_numa_migrate_memory()Kefeng Wang
Convert to use folio_xchg_last_cpupid() in should_numa_migrate_memory(). Link: https://lkml.kernel.org/r/20231018140806.2783514-14-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25sched/fair: use folio_xchg_access_time() in numa_hint_fault_latency()Kefeng Wang
Convert to use folio_xchg_access_time() in numa_hint_fault_latency(). Link: https://lkml.kernel.org/r/20231018140806.2783514-9-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-24sched/fair: Remove SIS_PROPPeter Zijlstra
SIS_UTIL seems to work well, lets remove the old thing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20231020134337.GD33965@noisy.programming.kicks-ass.net
2023-10-24sched/fair: Use candidate prev/recent_used CPU if scanning failed for ↵Yicong Yang
cluster wakeup Chen Yu reports a hackbench regression of cluster wakeup when hackbench threads equal to the CPU number [1]. Analysis shows it's because we wake up more on the target CPU even if the prev_cpu is a good wakeup candidate and leads to the decrease of the CPU utilization. Generally if the task's prev_cpu is idle we'll wake up the task on it without scanning. On cluster machines we'll try to wake up the task in the same cluster of the target for better cache affinity, so if the prev_cpu is idle but not sharing the same cluster with the target we'll still try to find an idle CPU within the cluster. This will improve the performance at low loads on cluster machines. But in the issue above, if the prev_cpu is idle but not in the cluster with the target CPU, we'll try to scan an idle one in the cluster. But since the system is busy, we're likely to fail the scanning and use target instead, even if the prev_cpu is idle. Then leads to the regression. This patch solves this in 2 steps: o record the prev_cpu/recent_used_cpu if they're good wakeup candidates but not sharing the cluster with the target. o on scanning failure use the prev_cpu/recent_used_cpu if they're recorded as idle [1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/ Closes: https://lore.kernel.org/all/ZGsLy83wPIpamy6x@chenyu5-mobl1/ Reported-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20231019033323.54147-4-yangyicong@huawei.com
2023-10-24sched/fair: Scan cluster before scanning LLC in wake-up pathBarry Song
For platforms having clusters like Kunpeng920, CPUs within the same cluster have lower latency when synchronizing and accessing shared resources like cache. Thus, this patch tries to find an idle cpu within the cluster of the target CPU before scanning the whole LLC to gain lower latency. This will be implemented in 2 steps in select_idle_sibling(): 1. When the prev_cpu/recent_used_cpu are good wakeup candidates, use them if they're sharing cluster with the target CPU. Otherwise trying to scan for an idle CPU in the target's cluster. 2. Scanning the cluster prior to the LLC of the target CPU for an idle CPU to wakeup. Testing has been done on Kunpeng920 by pinning tasks to one numa and two numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs. With this patch, We noticed enhancement on tbench and netperf within one numa or cross two numa on top of tip-sched-core commit 9b46f1abc6d4 ("sched/debug: Print 'tgid' in sched_show_task()") tbench results (node 0): baseline patched 1: 327.2833 372.4623 ( 13.80%) 4: 1320.5933 1479.8833 ( 12.06%) 8: 2638.4867 2921.5267 ( 10.73%) 16: 5282.7133 5891.5633 ( 11.53%) 32: 9810.6733 9877.3400 ( 0.68%) 64: 7408.9367 7447.9900 ( 0.53%) 128: 6203.2600 6191.6500 ( -0.19%) tbench results (node 0-1): baseline patched 1: 332.0433 372.7223 ( 12.25%) 4: 1325.4667 1477.6733 ( 11.48%) 8: 2622.9433 2897.9967 ( 10.49%) 16: 5218.6100 5878.2967 ( 12.64%) 32: 10211.7000 11494.4000 ( 12.56%) 64: 13313.7333 16740.0333 ( 25.74%) 128: 13959.1000 14533.9000 ( 4.12%) netperf results TCP_RR (node 0): baseline patched 1: 76546.5033 90649.9867 ( 18.42%) 4: 77292.4450 90932.7175 ( 17.65%) 8: 77367.7254 90882.3467 ( 17.47%) 16: 78519.9048 90938.8344 ( 15.82%) 32: 72169.5035 72851.6730 ( 0.95%) 64: 25911.2457 25882.2315 ( -0.11%) 128: 10752.6572 10768.6038 ( 0.15%) netperf results TCP_RR (node 0-1): baseline patched 1: 76857.6667 90892.2767 ( 18.26%) 4: 78236.6475 90767.3017 ( 16.02%) 8: 77929.6096 90684.1633 ( 16.37%) 16: 77438.5873 90502.5787 ( 16.87%) 32: 74205.6635 88301.5612 ( 19.00%) 64: 69827.8535 71787.6706 ( 2.81%) 128: 25281.4366 25771.3023 ( 1.94%) netperf results UDP_RR (node 0): baseline patched 1: 96869.8400 110800.8467 ( 14.38%) 4: 97744.9750 109680.5425 ( 12.21%) 8: 98783.9863 110409.9637 ( 11.77%) 16: 99575.0235 110636.2435 ( 11.11%) 32: 95044.7250 97622.8887 ( 2.71%) 64: 32925.2146 32644.4991 ( -0.85%) 128: 12859.2343 12824.0051 ( -0.27%) netperf results UDP_RR (node 0-1): baseline patched 1: 97202.4733 110190.1200 ( 13.36%) 4: 95954.0558 106245.7258 ( 10.73%) 8: 96277.1958 105206.5304 ( 9.27%) 16: 97692.7810 107927.2125 ( 10.48%) 32: 79999.6702 103550.2999 ( 29.44%) 64: 80592.7413 87284.0856 ( 8.30%) 128: 27701.5770 29914.5820 ( 7.99%) Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch in the code has not been tested but it supposed to work. Chen Yu also noticed this will improve the performance of tbench and netperf on a 24 CPUs Jacobsville machine, there are 4 CPUs in one cluster sharing L2 Cache. [https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net] Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Link: https://lkml.kernel.org/r/20231019033323.54147-3-yangyicong@huawei.com
2023-10-24sched: Add cpus_share_resources APIBarry Song
Add cpus_share_resources() API. This is the preparation for the optimization of select_idle_cpu() on platforms with cluster scheduler level. On a machine with clusters cpus_share_resources() will test whether two cpus are within the same cluster. On a non-cluster machine it will behaves the same as cpus_share_cache(). So we use "resources" here for cache resources. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20231019033323.54147-2-yangyicong@huawei.com
2023-10-24sched/core: Fix RQCF_ACT_SKIP leakHao Jia
Igor Raits and Bagas Sanjaya report a RQCF_ACT_SKIP leak warning. This warning may be triggered in the following situations: CPU0 CPU1 __schedule() *rq->clock_update_flags <<= 1;* unregister_fair_sched_group() pick_next_task_fair+0x4a/0x410 destroy_cfs_bandwidth() newidle_balance+0x115/0x3e0 for_each_possible_cpu(i) *i=0* rq_unpin_lock(this_rq, rf) __cfsb_csd_unthrottle() raw_spin_rq_unlock(this_rq) rq_lock(*CPU0_rq*, &rf) rq_clock_start_loop_update() rq->clock_update_flags & RQCF_ACT_SKIP <-- raw_spin_rq_lock(this_rq) The purpose of RQCF_ACT_SKIP is to skip the update rq clock, but the update is very early in __schedule(), but we clear RQCF_*_SKIP very late, causing it to span that gap above and triggering this warning. In __schedule() we can clear the RQCF_*_SKIP flag immediately after update_rq_clock() to avoid this RQCF_ACT_SKIP leak warning. And set rq->clock_update_flags to RQCF_UPDATED to avoid rq->clock_update_flags < RQCF_ACT_SKIP warning that may be triggered later. Fixes: ebb83d84e49b ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") Closes: https://lore.kernel.org/all/20230913082424.73252-1-jiahao.os@bytedance.com Reported-by: Igor Raits <igor.raits@gmail.com> Reported-by: Bagas Sanjaya <bagasdotme@gmail.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/a5dd536d-041a-2ce9-f4b7-64d8d85c86dc@gmail.com
2023-10-23Merge tag 'v6.6-rc7' into sched/core, to pick up fixesIngo Molnar
Pick up recent sched/urgent fixes merged upstream. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-20sched/fair: Remove unused 'curr' argument from pick_next_entity()Yiwei Lin
The 'curr' argument of pick_next_entity() has become unused after the EEVDF changes. [ mingo: Updated the changelog. ] Signed-off-by: Yiwei Lin <s921975628@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231020055617.42064-1-s921975628@gmail.com
2023-10-20sched/nohz: Update comments about NEWILB_KICKJoel Fernandes (Google)
How ILB is triggered without IPIs is cryptic. Out of mercy for future code readers, document it in code comments. The comments are derived from a discussion with Vincent in a past review. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231020014031.919742-2-joel@joelfernandes.org
2023-10-18treewide: mark stuff as __ro_after_initAlexey Dobriyan
__read_mostly predates __ro_after_init. Many variables which are marked __read_mostly should have been __ro_after_init from day 1. Also, mark some stuff as "const" and "__init" while I'm at it. [akpm@linux-foundation.org: revert sysctl_nr_open_min, sysctl_nr_open_max changes due to arm warning] [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/4f6bb9c0-abba-4ee4-a7aa-89265e886817@p183 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18sched: remove wait bookmarksMatthew Wilcox (Oracle)
There are no users of wait bookmarks left, so simplify the wait code by removing them. Link: https://lkml.kernel.org/r/20231010035829.544242-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Benjamin Segall <bsegall@google.com> Cc: Bin Lai <sclaibin@gmail.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18sched/fair: Remove duplicate #includeJiapeng Chong
./kernel/sched/fair.c: linux/sched/cond_resched.h is included more than once. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231018062759.44375-1-jiapeng.chong@linux.alibaba.com Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6907
2023-10-18sched/eevdf: Fix heap corruption morePeter Zijlstra
Because someone is a flaming idiot... and forgot we have current as se->on_rq but not actually in the tree itself, and walking rb_parent() on an entry not in the tree is 'funky' and KASAN complains. Fixes: 8dafa9d0eb1a ("sched/eevdf: Fix min_deadline heap integrity") Reported-by: 0599jiangyc@gmail.com Reported-by: Dmitry Safonov <0x7f454c46@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218020 Link: https://lkml.kernel.org/r/CAJwJo6ZGXO07%3DQvW4fgQfbsDzQPs9xj5sAQ1zp%3DmAyPMNbHYww%40mail.gmail.com
2023-10-16sched/numa, mm: make numa migrate functions to take a folioKefeng Wang
The cpupid (or access time) is stored in the head page for THP, so it is safely to make should_numa_migrate_memory() and numa_hint_fault_latency() to take a folio. This is in preparation for large folio numa balancing. Link: https://lkml.kernel.org/r/20230921074417.24004-7-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-16sched/psi: Update poll => rtpoll in relevant commentsFan Yu
The PSI trigger code is now making a distinction between privileged and unprivileged triggers, after the following commit: 65457b74aa94 ("sched/psi: Rename existing poll members in preparation") But some comments have not been modified along with the code, so they need to be updated. This will help readers better understand the code. Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Ziljstra <peterz@infradead.org> Link: https://lore.kernel.org/r/202310161920399921184@zte.com.cn
2023-10-13sched: Make PELT acronym definition searchableMathieu Desnoyers
The PELT acronym definition can be found right at the top of kernel/sched/pelt.c (of course), but it cannot be found through use of grep -r PELT kernel/sched/ Add the acronym "(PELT)" after "Per Entity Load Tracking" at the top of the source file. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231012125824.1260774-1-mathieu.desnoyers@efficios.com
2023-10-13sched: Fix stop_one_cpu_nowait() vs hotplugPeter Zijlstra
Kuyo reported sporadic failures on a sched_setaffinity() vs CPU hotplug stress-test -- notably affine_move_task() remains stuck in wait_for_completion(), leading to a hung-task detector warning. Specifically, it was reported that stop_one_cpu_nowait(.fn = migration_cpu_stop) returns false -- this stopper is responsible for the matching complete(). The race scenario is: CPU0 CPU1 // doing _cpu_down() __set_cpus_allowed_ptr() task_rq_lock(); takedown_cpu() stop_machine_cpuslocked(take_cpu_down..) <PREEMPT: cpu_stopper_thread() MULTI_STOP_PREPARE ... __set_cpus_allowed_ptr_locked() affine_move_task() task_rq_unlock(); <PREEMPT: cpu_stopper_thread()\> ack_state() MULTI_STOP_RUN take_cpu_down() __cpu_disable(); stop_machine_park(); stopper->enabled = false; /> /> stop_one_cpu_nowait(.fn = migration_cpu_stop); if (stopper->enabled) // false!!! That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the stopper thread gets a chance to preempt and allows the cpu-down for the target CPU to complete. OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to issue a wakeup, it must not be ran under the scheduler locks. Solve this apparent contradiction by keeping preemption disabled over the unlock + queue_stopper combination: preempt_disable(); task_rq_unlock(...); if (!stop_pending) stop_one_cpu_nowait(...) preempt_enable(); This respects the lock ordering contraints while still avoiding the above race. That is, if we find the CPU is online under rq-lock, the targeted stop_one_cpu_nowait() must succeed. Apply this pattern to all similar stop_one_cpu_nowait() invocations. Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()") Reported-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net
2023-10-13sched/psi: Bail out early from irq time accountingHaifeng Xu
We could bail out early when psi was disabled. Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Link: https://lore.kernel.org/r/20230926115722.467833-1-haifeng.xu@shopee.com
2023-10-12sched/topology: Rename 'DIE' domain to 'PKG'Peter Zijlstra
While reworking the x86 topology code Thomas tripped over creating a 'DIE' domain for the package mask. :-) Since these names are CONFIG_SCHED_DEBUG=y only, rename them to make the name less ambiguous. [ Shrikanth Hegde: rename on s390 as well. ] [ Valentin Schneider: also rename it in the comments. ] [ mingo: port to recent kernels & find all remaining occurances. ] Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Valentin Schneider <vschneid@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230712141056.GI3100107@hirez.programming.kicks-ass.net
2023-10-11sched/psi: Delete the 'update_total' function parameter from update_triggers()Yang Yang
The 'update_total' parameter of update_triggers() is always true after the previous commit: 80cc1d1d5ee3 ("sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes") If the 'changed_states & group->rtpoll_states' condition is true, 'new_stall' in update_triggers() will be true, and then 'update_total' should also be true. So update_total is redundant - remove it. [ mingo: Changelog updates ] Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Ziljstra <peterz@infradead.org> Link: https://lore.kernel.org/r/202310101645437859599@zte.com.cn
2023-10-11sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no ↵Yang Yang
state changes When psimon wakes up and there are no state changes for ->rtpoll_states, it's unnecessary to update triggers and ->rtpoll_total because the pressures being monitored by the user have not changed. This will help to slightly reduce unnecessary computations of PSI. [ mingo: Changelog updates ] Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Ziljstra <peterz@infradead.org> Link: https://lore.kernel.org/r/202310101641075436843@zte.com.cn
2023-10-10sched/headers: Remove comment referring to rq::cpu_load, since this has been ↵Colin Ian King
removed There is a comment that refers to cpu_load, however, this cpu_load was removed with: 55627e3cd22c ("sched/core: Remove rq->cpu_load[]") ... back in 2019. The comment does not make sense with respect to this removed array, so remove the comment. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231010155744.1381065-1-colin.i.king@gmail.com
2023-10-10sched/numa: Complete scanning of inactive VMAs when there is no alternativeMel Gorman
VMAs are skipped if there is no recent fault activity but this represents a chicken-and-egg problem as there may be no fault activity if the PTEs are never updated to trap NUMA hints. There is an indirect reliance on scanning to be forced early in the lifetime of a task but this may fail to detect changes in phase behaviour. Force inactive VMAs to be scanned when all other eligible VMAs have been updated within the same scan sequence. Test results in general look good with some changes in performance, both negative and positive, depending on whether the additional scanning and faulting was beneficial or not to the workload. The autonuma benchmark workload NUMA01_THREADLOCAL was picked for closer examination. The workload creates two processes with numerous threads and thread-local storage that is zero-filled in a loop. It exercises the corner case where unrelated threads may skip VMAs that are thread-local to another thread and still has some VMAs that inactive while the workload executes. The VMA skipping activity frequency with and without the patch: 6.6.0-rc2-sched-numabtrace-v1 ============================= 649 reason=scan_delay 9,094 reason=unsuitable 48,915 reason=shared_ro 143,919 reason=inaccessible 193,050 reason=pid_inactive 6.6.0-rc2-sched-numabselective-v1 ============================= 146 reason=seq_completed 622 reason=ignore_pid_inactive 624 reason=scan_delay 6,570 reason=unsuitable 16,101 reason=shared_ro 27,608 reason=inaccessible 41,939 reason=pid_inactive Note that with the patch applied, the PID activity is ignored (ignore_pid_inactive) to ensure a VMA with some activity is completely scanned. In addition, a small number of VMAs are scanned when no other eligible VMA is available during a single scan window (seq_completed). The number of times a VMA is skipped due to no PID activity from the scanning task (pid_inactive) drops dramatically. It is expected that this will increase the number of PTEs updated for NUMA hinting faults as well as hinting faults but these represent PTEs that would otherwise have been missed. The tradeoff is scan+fault overhead versus improving locality due to migration. On a 2-socket Cascade Lake test machine, the time to complete the workload is as follows; 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%) Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%* Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%) CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%) Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%) The time to complete the workload is reduced by almost 30%: 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 / Duration User 91201.80 63506.64 Duration System 2015.53 1819.78 Duration Elapsed 1234.77 868.37 In this specific case, system CPU time was not increased but it's not universally true. From vmstat, the NUMA scanning and fault activity is as follows; 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 Ops NUMA base-page range updates 64272.00 26374386.00 Ops NUMA PTE updates 36624.00 55538.00 Ops NUMA PMD updates 54.00 51404.00 Ops NUMA hint faults 15504.00 75786.00 Ops NUMA hint local faults % 14860.00 56763.00 Ops NUMA hint local percent 95.85 74.90 Ops NUMA pages migrated 1629.00 6469222.00 Both the number of PTE updates and hint faults is dramatically increased. While this is superficially unfortunate, it represents ranges that were simply skipped without the patch. As a result of the scanning and hinting faults, many more pages were also migrated but as the time to completion is reduced, the overhead is offset by the gain. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Link: https://lore.kernel.org/r/20231010083143.19593-7-mgorman@techsingularity.net
2023-10-10sched/numa: Complete scanning of partial VMAs regardless of PID activityMel Gorman
NUMA Balancing skips VMAs when the current task has not trapped a NUMA fault within the VMA. If the VMA is skipped then mm->numa_scan_offset advances and a task that is trapping faults within the VMA may never fully update PTEs within the VMA. Force tasks to update PTEs for partially scanned PTEs. The VMA will be tagged for NUMA hints by some task but this removes some of the benefit of tracking PID activity within a VMA. A follow-on patch will mitigate this problem. The test cases and machines evaluated did not trigger the corner case so the performance results are neutral with only small changes within the noise from normal test-to-test variance. However, the next patch makes the corner case easier to trigger. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Link: https://lore.kernel.org/r/20231010083143.19593-6-mgorman@techsingularity.net
2023-10-10sched/numa: Move up the access pid reset logicRaghavendra K T
Recent NUMA hinting faulting activity is reset approximately every VMA_PID_RESET_PERIOD milliseconds. However, if the current task has not accessed a VMA then the reset check is missed and the reset is potentially deferred forever. Check if the PID activity information should be reset before checking if the current task recently trapped a NUMA hinting fault. [ mgorman@techsingularity.net: Rewrite changelog ] Suggested-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231010083143.19593-5-mgorman@techsingularity.net
2023-10-10sched/numa: Trace decisions related to skipping VMAsMel Gorman
NUMA balancing skips or scans VMAs for a variety of reasons. In preparation for completing scans of VMAs regardless of PID access, trace the reasons why a VMA was skipped. In a later patch, the tracing will be used to track if a VMA was forcibly scanned. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231010083143.19593-4-mgorman@techsingularity.net
2023-10-10sched/numa: Rename vma_numab_state::access_pids[] => ::pids_active[], ↵Mel Gorman
::next_pid_reset => ::pids_active_reset The access_pids[] field name is somewhat ambiguous as no PIDs are accessed. Similarly, it's not clear that next_pid_reset is related to access_pids[]. Rename the fields to more accurately reflect their purpose. [ mingo: Rename in the comments too. ] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231010083143.19593-3-mgorman@techsingularity.net
2023-10-09Merge tag 'v6.6-rc5' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-09sched/topology: Move the declaration of 'schedutil_gov' to kernel/sched/sched.hIngo Molnar
Move it out of the .c file into the shared scheduler-internal header file, to gain type-checking. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Cc: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20231009060037.170765-3-sshegde@linux.vnet.ibm.com
2023-10-09sched/topology: Change behaviour of the 'sched_energy_aware' sysctl, based ↵Shrikanth Hegde
on the platform The 'sched_energy_aware' sysctl is available for the admin to disable/enable energy aware scheduling(EAS). EAS is enabled only if few conditions are met by the platform. They are, asymmetric CPU capacity, no SMT, schedutil CPUfreq governor, frequency invariant load tracking etc. A platform may boot without EAS capability, but could gain such capability at runtime. For example, changing/registering the cpufreq governor to schedutil. At present, though platform doesn't support EAS, this sysctl returns 1 and it ends up calling build_perf_domains on write to 1 and NOP when writing to 0. That is confusing and un-necessary. Desired behavior would be to have this sysctl to enable/disable the EAS on supported platform. On non-supported platform write to the sysctl would return not supported error and read of the sysctl would return empty. So sched_energy_aware returns empty - EAS is not possible at this moment This will include EAS capable platforms which have at least one EAS condition false during startup, e.g. not using the schedutil cpufreq governor sched_energy_aware returns 0 - EAS is supported but disabled by admin. sched_energy_aware returns 1 - EAS is supported and enabled. User can find out the reason why EAS is not possible by checking info messages. sched_is_eas_possible returns true if the platform can do EAS at this moment. Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20231009060037.170765-3-sshegde@linux.vnet.ibm.com
2023-10-09sched/psi: Change update_triggers() to a 'void' functionYang Yang
Update_triggers() always returns now + group->rtpoll_min_period, and the return value is only used by psi_rtpoll_work(), so change update_triggers() to a void function, let group->rtpoll_next_update = now + group->rtpoll_min_period directly. This will avoid unnecessary function return value passing & simplifies the function. [ mingo: Updated changelog ] Suggested-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/202310092024289721617@zte.com.cn
2023-10-09sched/topology: Remove the EM_MAX_COMPLEXITY limitPierre Gondois
The Energy Aware Scheduler (EAS) estimates the energy consumption of placing a task on different CPUs. The goal is to minimize this energy consumption. Estimating the energy of different task placements is increasingly complex with the size of the platform. To avoid having a slow wake-up path, EAS is only enabled if this complexity is low enough. The current complexity limit was set in: b68a4c0dba3b1 ("sched/topology: Disable EAS on inappropriate platforms") ... based on the first implementation of EAS, which was re-computing the power of the whole platform for each task placement scenario, see: 390031e4c309 ("sched/fair: Introduce an energy estimation helper function") ... but the complexity of EAS was reduced in: eb92692b2544d ("sched/fair: Speed-up energy-aware wake-ups") ... and find_energy_efficient_cpu() (feec) algorithm was updated in: 3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()") find_energy_efficient_cpu() (feec) is now doing: feec() \_ for_each_pd(pd) [0] // get max_spare_cap_cpu and compute_prev_delta \_ for_each_cpu(pd) [1] \_ eenv_pd_busy_time(pd) [2] \_ for_each_cpu(pd) // compute_energy(pd) without the task \_ eenv_pd_max_util(pd, -1) [3.0] \_ for_each_cpu(pd) \_ em_cpu_energy(pd, -1) \_ for_each_ps(pd) // compute_energy(pd) with the task on prev_cpu \_ eenv_pd_max_util(pd, prev_cpu) [3.1] \_ for_each_cpu(pd) \_ em_cpu_energy(pd, prev_cpu) \_ for_each_ps(pd) // compute_energy(pd) with the task on max_spare_cap_cpu \_ eenv_pd_max_util(pd, max_spare_cap_cpu) [3.2] \_ for_each_cpu(pd) \_ em_cpu_energy(pd, max_spare_cap_cpu) \_ for_each_ps(pd) [3.1] happens only once since prev_cpu is unique. With the same definitions for nr_pd, nr_cpus and nr_ps, the complexity is of: nr_pd * (2 * [nr_cpus in pd] + 2 * ([nr_cpus in pd] + [nr_ps in pd])) + ([nr_cpus in pd] + [nr_ps in pd]) [0] * ( [1] + [2] + [3.0] + [3.2] ) + [3.1] = nr_pd * (4 * [nr_cpus in pd] + 2 * [nr_ps in pd]) + [nr_cpus in prev pd] + nr_ps The complexity limit was set to 2048 in: b68a4c0dba3b1 ("sched/topology: Disable EAS on inappropriate platforms") ... to make "EAS usable up to 16 CPUs with per-CPU DVFS and less than 8 performance states each". For the same platform, the complexity would actually be of: 16 * (4 + 2 * 7) + 1 + 7 = 296 Since the EAS complexity was greatly reduced since the limit was introduced, bigger platforms can handle EAS. For instance, a platform with 112 CPUs with 7 performance states each would not reach it: 112 * (4 + 2 * 7) + 1 + 7 = 2024 To reflect this improvement in the underlying EAS code, remove the EAS complexity check. Note that a limit on the number of CPUs still holds against EM_MAX_NUM_CPUS to avoid overflows during the energy estimation. [ mingo: Updates to the changelog. ] Signed-off-by: Pierre Gondois <Pierre.Gondois@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20231009060037.170765-2-sshegde@linux.vnet.ibm.com
2023-10-09sched/topology: Consolidate and clean up access to a CPU's max compute capacityVincent Guittot
Remove the rq::cpu_capacity_orig field and use arch_scale_cpu_capacity() instead. The scheduler uses 3 methods to get access to a CPU's max compute capacity: - arch_scale_cpu_capacity(cpu) which is the default way to get a CPU's capacity. - cpu_capacity_orig field which is periodically updated with arch_scale_cpu_capacity(). - capacity_orig_of(cpu) which encapsulates rq->cpu_capacity_orig. There is no real need to save the value returned by arch_scale_cpu_capacity() in struct rq. arch_scale_cpu_capacity() returns: - either a per_cpu variable. - or a const value for systems which have only one capacity. Remove rq::cpu_capacity_orig and use arch_scale_cpu_capacity() everywhere. No functional changes. Some performance tests on Arm64: - small SMP device (hikey): no noticeable changes - HMP device (RB5): hackbench shows minor improvement (1-2%) - large smp (thx2): hackbench and tbench shows minor improvement (1%) Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20231009103621.374412-2-vincent.guittot@linaro.org
2023-10-09sched/rt: Change the type of 'sysctl_sched_rt_period' from 'unsigned int' to ↵Yajun Deng
'int' Doing this matches the natural type of 'int' based calculus in sched_rt_handler(), and also enables the adding in of a correct upper bounds check on the sysctl interface. [ mingo: Rewrote the changelog. ] Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231008021538.3063250-1-yajun.deng@linux.dev
2023-10-09sched/nohz: Remove unnecessarily complex error handling pattern from ↵Ingo Molnar
find_new_ilb() find_new_ilb() returns nr_cpu_ids on failure - which is the usual cpumask bitops return pattern, but is weird & unnecessary in this context: not only is it a global variable, it it is a +1 out of bounds CPU index and also has different signedness ... Its only user, kick_ilb(), then checks the return against nr_cpu_ids to decide to return. There's no other use. So instead of this, use a standard -1 return on failure to find an idle CPU, as the argument is signed already. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Link: https://lore.kernel.org/r/20231006102518.2452758-4-mingo@kernel.org
2023-10-09sched/nohz: Use consistent variable names in find_new_ilb() and kick_ilb()Ingo Molnar
Use 'ilb_cpu' consistently in both functions. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Link: https://lore.kernel.org/r/20231006102518.2452758-3-mingo@kernel.org
2023-10-09sched/nohz: Update idle load-balancing (ILB) commentsIngo Molnar
- Fix incorrect/misleading comments, - clarify some others, - fix typos & grammar, - and use more consistent style throughout. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Link: https://lore.kernel.org/r/20231006102518.2452758-2-mingo@kernel.org
2023-10-09sched/eevdf: Fix pick_eevdf()Benjamin Segall
The old pick_eevdf() could fail to find the actual earliest eligible deadline when it descended to the right looking for min_deadline, but it turned out that that min_deadline wasn't actually eligible. In that case we need to go back and search through any left branches we skipped looking for the actual best _eligible_ min_deadline. This is more expensive, but still O(log n), and at worst should only involve descending two branches of the rbtree. I've run this through a userspace stress test (thank you tools/lib/rbtree.c), so hopefully this implementation doesn't miss any corner cases. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/xm261qego72d.fsf_-_@google.com
2023-10-09sched/eevdf: Fix min_deadline heap integrityPeter Zijlstra
Marek and Biju reported instances of: "EEVDF scheduling fail, picking leftmost" which Mike correlated with cgroup scheduling and the min_deadline heap getting corrupted; some trace output confirms: > And yeah, min_deadline is hosed somehow: > > validate_cfs_rq: --- / > __print_se: ffff88845cf48080 w: 1024 ve: -58857638 lag: 870381 vd: -55861854 vmd: -66302085 E (11372/tr) > __print_se: ffff88810d165800 w: 25 ve: -80323686 lag: 22336429 vd: -41496434 vmd: -66302085 E (-1//autogroup-31) > __print_se: ffff888108379000 w: 25 ve: 0 lag: -57987257 vd: 114632828 vmd: 114632828 N (-1//autogroup-33) > validate_cfs_rq: min_deadline: -55861854 avg_vruntime: -62278313462 / 1074 = -57987256 Turns out that reweight_entity(), which tries really hard to be fast, does not do the normal dequeue+update+enqueue pattern but *does* scale the deadline. However, it then fails to propagate the updated deadline value up the heap. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Reported-by: Biju Das <biju.das.jz@bp.renesas.com> Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Biju Das <biju.das.jz@bp.renesas.com> Tested-by: Mike Galbraith <efault@gmx.de> Link: https://lkml.kernel.org/r/20231006192445.GE743@noisy.programming.kicks-ass.net
2023-10-07sched/debug: Print 'tgid' in sched_show_task()Yajun Deng
Multiple blocked tasks are printed when the system hangs. They may have the same parent pid, but belong to different task groups. Printing tgid lets users better know whether these tasks are from the same task group or not. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230720080516.1515297-1-yajun.deng@linux.dev