Skip to content
  • Peter Zijlstra's avatar
    sched/core: Rewrite and improve select_idle_siblings() · 10e2f1ac
    Peter Zijlstra authored
    
    
    select_idle_siblings() is a known pain point for a number of
    workloads; it either does too much or not enough and sometimes just
    does plain wrong.
    
    This rewrite attempts to address a number of issues (but sadly not
    all).
    
    The current code does an unconditional sched_domain iteration; with
    the intent of finding an idle core (on SMT hardware). The problems
    which this patch tries to address are:
    
     - its pointless to look for idle cores if the machine is real busy;
       at which point you're just wasting cycles.
    
     - it's behaviour is inconsistent between SMT and !SMT hardware in
       that !SMT hardware ends up doing a scan for any idle CPU in the LLC
       domain, while SMT hardware does a scan for idle cores and if that
       fails, falls back to a scan for idle threads on the 'target' core.
    
    The new code replaces the sched_domain scan with 3 explicit scans:
    
     1) search for an idle core in the LLC
     2) search for an idle CPU in the LLC
     3) search for an idle thread in the 'target' core
    
    where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
    heuristics to skip the step.
    
    Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
    goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
    siblings of the CPU going idle. Similarly, we clear
    sd_llc_shared->has_idle_cores when we fail to find an idle core.
    
    Step 2) tracks the average cost of the scan and compares this to the
    average idle time guestimate for the CPU doing the wakeup. There is a
    significant fudge factor involved to deal with the variability of the
    averages. Esp. hackbench was sensitive to this.
    
    Step 3) is unconditional; we assume (also per step 1) that scanning
    all SMT siblings in a core is 'cheap'.
    
    With this; SMT systems gain step 2, which cures a few benchmarks --
    notably one from Facebook.
    
    One 'feature' of the sched_domain iteration, which we preserve in the
    new code, is that it would start scanning from the 'target' CPU,
    instead of scanning the cpumask in cpu id order. This avoids multiple
    CPUs in the LLC scanning for idle to gang up and find the same CPU
    quite as much. The down side is that tasks can end up hopping across
    the LLC for no apparent reason.
    
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    10e2f1ac