This project is mirrored from https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git. Pull mirroring updated .
  1. 29 Nov, 2016 1 commit
    • Peter Zijlstra's avatar
      sched/idle: Add support for tasks that inject idle · c1de45ca
      Peter Zijlstra authored
      
      
      Idle injection drivers such as Intel powerclamp and ACPI PAD drivers use
      realtime tasks to take control of CPU then inject idle. There are two
      issues with this approach:
      
       1. Low efficiency: injected idle task is treated as busy so sched ticks
          do not stop during injected idle period, the result of these
          unwanted wakeups can be ~20% loss in power savings.
      
       2. Idle accounting: injected idle time is presented to user as busy.
      
      This patch addresses the issues by introducing a new PF_IDLE flag which
      allows any given task to be treated as idle task while the flag is set.
      Therefore, idle injection tasks can run through the normal flow of NOHZ
      idle enter/exit to get the correct accounting as well as tick stop when
      possible.
      
      The implication is that idle task is then no longer limited to PID == 0.
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarJacob Pan <jacob.jun.pan@linux.intel.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c1de45ca
  2. 30 Sep, 2016 11 commits
    • Peter Zijlstra's avatar
      sched/core: Fix set_user_nice() · 49bd21ef
      Peter Zijlstra authored
      
      
      Almost all scheduler functions update state with the following
      pattern:
      
      	if (queued)
      		dequeue_task(rq, p, DEQUEUE_SAVE);
      	if (running)
      		put_prev_task(rq, p);
      
      	/* update state */
      
      	if (queued)
      		enqueue_task(rq, p, ENQUEUE_RESTORE);
      	if (running)
      		set_curr_task(rq, p);
      
      set_user_nice() however misses the running part, cure this.
      
      This was found by asserting we never enqueue 'current'.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      49bd21ef
    • Peter Zijlstra's avatar
      sched/fair: Introduce set_curr_task() helper · b2bf6c31
      Peter Zijlstra authored
      
      
      Now that the ia64 only set_curr_task() symbol is gone, provide a
      helper just like put_prev_task().
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b2bf6c31
    • Peter Zijlstra's avatar
      sched/core, ia64: Rename set_curr_task() · a458ae2e
      Peter Zijlstra authored
      
      
      Rename the ia64 only set_curr_task() function to free up the name.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a458ae2e
    • Vincent Guittot's avatar
      sched/core: Fix incorrect utilization accounting when switching to fair class · a399d233
      Vincent Guittot authored
      
      
      When a task switches to fair scheduling class, the period between now
      and the last update of its utilization is accounted as running time
      whatever happened during this period. This incorrect accounting applies
      to the task and also to the task group branch.
      
      When changing the property of a running task like its list of allowed
      CPUs or its scheduling class, we follow the sequence:
      
       - dequeue task
       - put task
       - change the property
       - set task as current task
       - enqueue task
      
      The end of the sequence doesn't follow the normal sequence (as per
      __schedule()) which is:
      
       - enqueue a task
       - then set the task as current task.
      
      This incorrectordering is the root cause of incorrect utilization accounting.
      Update the sequence to follow the right one:
      
       - dequeue task
       - put task
       - change the property
       - enqueue task
       - set task as current task
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: pjt@google.com
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1473666472-13749-8-git-send-email-vincent.guittot@linaro.org
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a399d233
    • Peter Zijlstra's avatar
      sched/core: Optimize SCHED_SMT · 1b568f0a
      Peter Zijlstra authored
      
      
      Avoid pointless SCHED_SMT code when running on !SMT hardware.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1b568f0a
    • Peter Zijlstra's avatar
      sched/core: Rewrite and improve select_idle_siblings() · 10e2f1ac
      Peter Zijlstra authored
      
      
      select_idle_siblings() is a known pain point for a number of
      workloads; it either does too much or not enough and sometimes just
      does plain wrong.
      
      This rewrite attempts to address a number of issues (but sadly not
      all).
      
      The current code does an unconditional sched_domain iteration; with
      the intent of finding an idle core (on SMT hardware). The problems
      which this patch tries to address are:
      
       - its pointless to look for idle cores if the machine is real busy;
         at which point you're just wasting cycles.
      
       - it's behaviour is inconsistent between SMT and !SMT hardware in
         that !SMT hardware ends up doing a scan for any idle CPU in the LLC
         domain, while SMT hardware does a scan for idle cores and if that
         fails, falls back to a scan for idle threads on the 'target' core.
      
      The new code replaces the sched_domain scan with 3 explicit scans:
      
       1) search for an idle core in the LLC
       2) search for an idle CPU in the LLC
       3) search for an idle thread in the 'target' core
      
      where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
      heuristics to skip the step.
      
      Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
      goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
      siblings of the CPU going idle. Similarly, we clear
      sd_llc_shared->has_idle_cores when we fail to find an idle core.
      
      Step 2) tracks the average cost of the scan and compares this to the
      average idle time guestimate for the CPU doing the wakeup. There is a
      significant fudge factor involved to deal with the variability of the
      averages. Esp. hackbench was sensitive to this.
      
      Step 3) is unconditional; we assume (also per step 1) that scanning
      all SMT siblings in a core is 'cheap'.
      
      With this; SMT systems gain step 2, which cures a few benchmarks --
      notably one from Facebook.
      
      One 'feature' of the sched_domain iteration, which we preserve in the
      new code, is that it would start scanning from the 'target' CPU,
      instead of scanning the cpumask in cpu id order. This avoids multiple
      CPUs in the LLC scanning for idle to gang up and find the same CPU
      quite as much. The down side is that tasks can end up hopping across
      the LLC for no apparent reason.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      10e2f1ac
    • Peter Zijlstra's avatar
      sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared · 0e369d75
      Peter Zijlstra authored
      
      
      Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc
      location into the much more natural sched_domain_shared location.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0e369d75
    • Peter Zijlstra's avatar
      sched/core: Introduce 'struct sched_domain_shared' · 24fc7edb
      Peter Zijlstra authored
      
      
      Since struct sched_domain is strictly per cpu; introduce a structure
      that is shared between all 'identical' sched_domains.
      
      Limit to SD_SHARE_PKG_RESOURCES domains for now, as we'll only use it
      for shared cache state; if another use comes up later we can easily
      relax this.
      
      While the sched_group's are normally shared between CPUs, these are
      not natural to use when we need some shared state on a domain level --
      since that would require the domain to have a parent, which is not a
      given.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      24fc7edb
    • Peter Zijlstra's avatar
      sched/core: Restructure destroy_sched_domain() · 16f3ef46
      Peter Zijlstra authored
      
      
      There is no point in doing a call_rcu() for each domain, only do a
      callback for the root sched domain and clean up the entire set in one
      go.
      
      Also make the entire call chain be called destroy_sched_domain*() to
      remove confusion with the free_sched_domains() call, which does an
      entirely different thing.
      
      Both cpu_attach_domain() callers of destroy_sched_domain() can live
      without the call_rcu() because at those points the sched_domain hasn't
      been published yet.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      16f3ef46
    • Peter Zijlstra's avatar
      sched/core: Remove unused @cpu argument from destroy_sched_domain*() · f39180ef
      Peter Zijlstra authored
      
      
      Small cleanup; nothing uses the @cpu argument so make it go away.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f39180ef
    • Tim Chen's avatar
      sched/core, x86/topology: Fix NUMA in package topology bug · 8f37961c
      Tim Chen authored
      
      
      Current code can call set_cpu_sibling_map() and invoke sched_set_topology()
      more than once (e.g. on CPU hot plug).  When this happens after
      sched_init_smp() has been called, we lose the NUMA topology extension to
      sched_domain_topology in sched_init_numa().  This results in incorrect
      topology when the sched domain is rebuilt.
      
      This patch fixes the bug and issues warning if we call sched_set_topology()
      after sched_init_smp().
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bp@suse.de
      Cc: jolsa@redhat.com
      Cc: rjw@rjwysocki.net
      Link: http://lkml.kernel.org/r/1474485552-141429-2-git-send-email-srinivas.pandruvada@linux.intel.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8f37961c
  3. 22 Sep, 2016 5 commits
  4. 16 Sep, 2016 1 commit
    • Andy Lutomirski's avatar
      sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK · 68f24b08
      Andy Lutomirski authored
      
      
      We currently keep every task's stack around until the task_struct
      itself is freed.  This means that we keep the stack allocation alive
      for longer than necessary and that, under load, we free stacks in
      big batches whenever RCU drops the last task reference.  Neither of
      these is good for reuse of cache-hot memory, and freeing in batches
      prevents us from usefully caching small numbers of vmalloced stacks.
      
      On architectures that have thread_info on the stack, we can't easily
      change this, but on architectures that set THREAD_INFO_IN_TASK, we
      can free it as soon as the task is dead.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jann Horn <jann@thejh.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/08ca06cde00ebed0046c5d26cbbf3fbb7ef5b812.1474003868.git.luto@kernel.org
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      68f24b08
  5. 05 Sep, 2016 4 commits
    • Josh Poimboeuf's avatar
      sched/debug: Remove several CONFIG_SCHEDSTATS guards · 4fa8d299
      Josh Poimboeuf authored
      
      
      Clean up the sched code by removing several of the CONFIG_SCHEDSTATS
      guards, using schedstat_*() macros where needed.
      
      Code size:
      
        !CONFIG_SCHEDSTATS defconfig:
      
            text	   data	    bss	     dec	    hex	filename
        10209818	4368184	1105920	15683922	 ef5152	vmlinux.before.nostats
        10209818	4368184	1105920	15683922	 ef5152	vmlinux.after.nostats
      
        CONFIG_SCHEDSTATS defconfig:
      
            text	   data	    bss	    dec	    hex	filename
        10214210	4370040	1105920	15690170	 ef69ba	vmlinux.before.stats
        10214210	4370680	1105920	15690810	 ef6c3a	vmlinux.after.stats
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/e51e0ebe5af95ac295de720dd252e7c0d2142e4a.1466184592.git.jpoimboe@redhat.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      4fa8d299
    • Josh Poimboeuf's avatar
      sched/debug: Clean up schedstat macros · ae92882e
      Josh Poimboeuf authored
      
      
      The schedstat_*() macros are inconsistent: most of them take a pointer
      and a field which the macro combines, whereas schedstat_set() takes the
      already combined ptr->field.
      
      The already combined ptr->field argument is actually more intuitive and
      easier to use, and there's no reason to require the user to split the
      variable up, so convert the macros to use the combined argument.
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/54953ca25bb579f3a5946432dee409b0e05222c6.1466184592.git.jpoimboe@redhat.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ae92882e
    • seokhoon.yoon's avatar
      schedcore: Remove duplicated init_task's preempt_notifiers init · efca03ec
      seokhoon.yoon authored
      
      
      init_task's preempt_notifiers is initialized twice:
      
      1) sched_init()
         -> INIT_HLIST_HEAD(&init_task.preempt_notifiers)
      
      2) sched_init()
         -> init_idle(current,) <--- current task is init_task at this time
          -> __sched_fork(,current)
           -> INIT_HLIST_HEAD(&p->preempt_notifiers)
      
      I think the first one is unnecessary, so remove it.
      Signed-off-by: default avatarseokhoon.yoon <iamyooon@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1471339568-5790-1-git-send-email-iamyooon@gmail.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      efca03ec
    • Balbir Singh's avatar
      sched/core: Fix a race between try_to_wake_up() and a woken up task · 135e8c92
      Balbir Singh authored
      
      
      The origin of the issue I've seen is related to
      a missing memory barrier between check for task->state and
      the check for task->on_rq.
      
      The task being woken up is already awake from a schedule()
      and is doing the following:
      
      	do {
      		schedule()
      		set_current_state(TASK_(UN)INTERRUPTIBLE);
      	} while (!cond);
      
      The waker, actually gets stuck doing the following in
      try_to_wake_up():
      
      	while (p->on_cpu)
      		cpu_relax();
      
      Analysis:
      
      The instance I've seen involves the following race:
      
       CPU1					CPU2
      
       while () {
         if (cond)
           break;
         do {
           schedule();
           set_current_state(TASK_UN..)
         } while (!cond);
      					wakeup_routine()
      					  spin_lock_irqsave(wait_lock)
         raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
       }					  try_to_wake_up()
       set_current_state(TASK_RUNNING);	  ..
       list_del(&waiter.list);
      
      CPU2 wakes up CPU1, but before it can get the wait_lock and set
      current state to TASK_RUNNING the following occurs:
      
       CPU3
       wakeup_routine()
       raw_spin_lock_irqsave(wait_lock)
       if (!list_empty)
         wake_up_process()
         try_to_wake_up()
         raw_spin_lock_irqsave(p->pi_lock)
         ..
         if (p->on_rq && ttwu_wakeup())
         ..
         while (p->on_cpu)
           cpu_relax()
         ..
      
      CPU3 tries to wake up the task on CPU1 again since it finds
      it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
      after CPU2, CPU3 got it.
      
      CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
      the task is spinning on the wait_lock. Interestingly since p->on_rq
      is checked under pi_lock, I've noticed that try_to_wake_up() finds
      p->on_rq to be 0. This was the most confusing bit of the analysis,
      but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
      check is not reliable without this fix IMHO. The race is visible
      (based on the analysis) only when ttwu_queue() does a remote wakeup
      via ttwu_queue_remote. In which case the p->on_rq change is not
      done uder the pi_lock.
      
      The result is that after a while the entire system locks up on
      the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
      
      Reproduction of the issue:
      
      The issue can be reproduced after a long run on my system with 80
      threads and having to tweak available memory to very low and running
      memory stress-ng mmapfork test. It usually takes a long time to
      reproduce. I am trying to work on a test case that can reproduce
      the issue faster, but thats work in progress. I am still testing the
      changes on my still in a loop and the tests seem OK thus far.
      
      Big thanks to Benjamin and Nick for helping debug this as well.
      Ben helped catch the missing barrier, Nick caught every missing
      bit in my theory.
      Signed-off-by: default avatarBalbir Singh <bsingharora@gmail.com>
      [ Updated comment to clarify matching barriers. Many
        architectures do not have a full barrier in switch_to()
        so that cannot be relied upon. ]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nicholas Piggin <nicholas.piggin@gmail.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      135e8c92
  6. 24 Aug, 2016 1 commit
  7. 22 Aug, 2016 1 commit
    • Paul E. McKenney's avatar
      sched: Make wake_up_nohz_cpu() handle CPUs going offline · 379d9ecb
      Paul E. McKenney authored
      
      
      Both timers and hrtimers are maintained on the outgoing CPU until
      CPU_DEAD time, at which point they are migrated to a surviving CPU.  If a
      mod_timer() executes between CPU_DYING and CPU_DEAD time, x86 systems
      will splat in native_smp_send_reschedule() when attempting to wake up
      the just-now-offlined CPU, as shown below from a NO_HZ_FULL kernel:
      
      [ 7976.741556] WARNING: CPU: 0 PID: 661 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:125 native_smp_send_reschedule+0x39/0x40
      [ 7976.741595] Modules linked in:
      [ 7976.741595] CPU: 0 PID: 661 Comm: rcu_torture_rea Not tainted 4.7.0-rc2+ #1
      [ 7976.741595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      [ 7976.741595]  0000000000000000 ffff88000002fcc8 ffffffff8138ab2e 0000000000000000
      [ 7976.741595]  0000000000000000 ffff88000002fd08 ffffffff8105cabc 0000007d1fd0ee18
      [ 7976.741595]  0000000000000001 ffff88001fd16d40 ffff88001fd0ee00 ffff88001fd0ee00
      [ 7976.741595] Call Trace:
      [ 7976.741595]  [<ffffffff8138ab2e>] dump_stack+0x67/0x99
      [ 7976.741595]  [<ffffffff8105cabc>] __warn+0xcc/0xf0
      [ 7976.741595]  [<ffffffff8105cb98>] warn_slowpath_null+0x18/0x20
      [ 7976.741595]  [<ffffffff8103cba9>] native_smp_send_reschedule+0x39/0x40
      [ 7976.741595]  [<ffffffff81089bc2>] wake_up_nohz_cpu+0x82/0x190
      [ 7976.741595]  [<ffffffff810d275a>] internal_add_timer+0x7a/0x80
      [ 7976.741595]  [<ffffffff810d3ee7>] mod_timer+0x187/0x2b0
      [ 7976.741595]  [<ffffffff810c89dd>] rcu_torture_reader+0x33d/0x380
      [ 7976.741595]  [<ffffffff810c66f0>] ? sched_torture_read_unlock+0x30/0x30
      [ 7976.741595]  [<ffffffff810c86a0>] ? rcu_bh_torture_read_lock+0x80/0x80
      [ 7976.741595]  [<ffffffff8108068f>] kthread+0xdf/0x100
      [ 7976.741595]  [<ffffffff819dd83f>] ret_from_fork+0x1f/0x40
      [ 7976.741595]  [<ffffffff810805b0>] ? kthread_create_on_node+0x200/0x200
      
      However, in this case, the wakeup is redundant, because the timer
      migration will reprogram timer hardware as needed.  Note that the fact
      that preemption is disabled does not avoid the splat, as the offline
      operation has already passed both the synchronize_sched() and the
      stop_machine() that would be blocked by disabled preemption.
      
      This commit therefore modifies wake_up_nohz_cpu() to avoid attempting
      to wake up offline CPUs.  It also adds a comment stating that the
      caller must tolerate lost wakeups when the target CPU is going offline,
      and suggesting the CPU_DEAD notifier as a recovery mechanism.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      379d9ecb
  8. 18 Aug, 2016 6 commits
  9. 10 Aug, 2016 5 commits
    • Vegard Nossum's avatar
      sched/debug: Add taint on "BUG: Sleeping function called from invalid context" · f0b22e39
      Vegard Nossum authored
      
      
      Seeing this, it occurs to me that we should probably add a taint here:
      
          BUG: sleeping function called from invalid context at mm/slab.h:388
          in_atomic(): 0, irqs_disabled(): 0, pid: 32211, name: trinity-c3
          Preemption disabled at:[<ffffffff811aaa37>] console_unlock+0x2f7/0x930
      
          CPU: 3 PID: 32211 Comm: trinity-c3 Not tainted 4.7.0-rc7+ #19
                                             ^^^^^^^^^^^
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
           0000000000000000 ffff8800b8a17160 ffffffff81971441 ffff88011a3c4c80
           ffff88011a3c4c80 ffff8800b8a17198 ffffffff81158067 0000000000000de6
           ffff88011a3c4c80 ffffffff8390e07c 0000000000000184 0000000000000000
          Call Trace:
          [...]
      
          BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1309
          in_atomic(): 0, irqs_disabled(): 0, pid: 32211, name: trinity-c3
          Preemption disabled at:[<ffffffff8119db33>] down_trylock+0x13/0x80
      
          CPU: 3 PID: 32211 Comm: trinity-c3 Not tainted 4.7.0-rc7+ #19
                                             ^^^^^^^^^^^
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
           0000000000000000 ffff8800b8a17e08 ffffffff81971441 ffff88011a3c4c80
           ffff88011a3c4c80 ffff8800b8a17e40 ffffffff81158067 0000000000000000
           ffff88011a3c4c80 ffffffff83437b20 000000000000051d 0000000000000000
          Call Trace:
          [...]
      Signed-off-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russel <rusty@rustcorp.com.au>
      Link: http://lkml.kernel.org/r/1469216762-19626-1-git-send-email-vegard.nossum@oracle.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f0b22e39
    • Vegard Nossum's avatar
      sched/debug: Make the "Preemption disabled at ..." message more useful · d1c6d149
      Vegard Nossum authored
      This message is currently really useless since it always prints a value
      that comes from the printk() we just did, e.g.:
      
          BUG: sleeping function called from invalid context at mm/slab.h:388
          in_atomic(): 0, irqs_disabled(): 0, pid: 31996, name: trinity-c1
          Preemption disabled at:[<ffffffff8119db33>] down_trylock+0x13/0x80
      
          BUG: sleeping function called from invalid context at include/linux/freezer.h:56
          in_atomic(): 0, irqs_disabled(): 0, pid: 31996, name: trinity-c1
          Preemption disabled at:[<ffffffff811aaa37>] console_unlock+0x2f7/0x930
      
      Here, both down_trylock() and console_unlock() is somewhere in the
      printk() path.
      
      We should save the value before calling printk() and use the saved value
      instead. That immediately reveals the offending callsite:
      
          BUG: sleeping function called from invalid context at mm/slab.h:388
          in_atomic(): 0, irqs_disabled(): 0, pid: 14971, name: trinity-c2
          Preemption disabled at:[<ffffffff819bcd46>] rhashtable_walk_start+0x46/0x150
      
      Bug report:
      
        http://marc.info/?l=linux-netdev&m=146925979821849&w=2
      
      Signed-off-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russel <rusty@rustcorp.com.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d1c6d149
    • Luis de Bethencourt's avatar
      sched/core: Add documentation for 'cookie' argument · 9279e0d2
      Luis de Bethencourt authored
      
      
      Add documentation for the cookie argument in try_to_wake_up_local().
      
      This caused the following warning when building documentation:
      
        kernel/sched/core.c:2088: warning: No description found for parameter 'cookie'
      Signed-off-by: default avatarLuis de Bethencourt <luisbg@osg.samsung.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Fixes: e7904a28 ("ilocking/lockdep, sched/core: Implement a better lock pinning scheme")
      Link: http://lkml.kernel.org/r/1468159226-17674-1-git-send-email-luisbg@osg.samsung.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9279e0d2
    • Leo Yan's avatar
      sched/core: Fix one typo · a1fd4656
      Leo Yan authored
      
      
      Fix one minor typo in the comment: s/targer/target/.
      Signed-off-by: default avatarLeo Yan <leo.yan@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1470378758-15066-1-git-send-email-leo.yan@linaro.org
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a1fd4656
    • Giovanni Gherdovich's avatar
      sched/cputime: Mitigate performance regression in times()/clock_gettime() · 6075620b
      Giovanni Gherdovich authored
      Commit:
      
        6e998916 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")
      
      fixed a problem whereby clock_nanosleep() followed by clock_gettime() could
      allow a task to wake early. It addressed the problem by calling the scheduling
      classes update_curr() when the cputimer starts.
      
      Said change induced a considerable performance regression on the syscalls
      times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
      debuggers and applications that monitor their own performance that
      accidentally depend on the performance of these specific calls.
      
      This patch mitigates the performace loss by prefetching data in the CPU
      cache, as stalls due to cache misses appear to be where most time is spent
      in our benchmarks.
      
      Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
      box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
      variable number of threads, from 2 to 4*num_cpus; the results are in
      seconds and correspond to the average of 10 runs; the percentage gain is
      computed with (before-after)/before so a positive value is an improvement
      (it's faster). The improvement varies between a few percents for 5-20
      threads and more than 10% for 2 or >20 threads.
      
      pound_clock_gettime:
      
          threads       4.7-rc7     patched 4.7-rc7
          [num]         [secs]      [secs (percent)]
            2           3.48        3.06 ( 11.83%)
            5           3.33        3.25 (  2.40%)
            8           3.37        3.26 (  3.30%)
           12           3.32        3.37 ( -1.60%)
           21           4.01        3.90 (  2.74%)
           30           3.63        3.36 (  7.41%)
           48           3.71        3.11 ( 16.27%)
           79           3.75        3.16 ( 15.74%)
          110           3.81        3.25 ( 14.80%)
          128           3.88        3.31 ( 14.76%)
      
      pound_times:
      
          threads       4.7-rc7     patched 4.7-rc7
          [num]         [secs]      [secs (percent)]
            2           3.65        3.25 ( 11.03%)
            5           3.45        3.17 (  7.92%)
            8           3.52        3.22 (  8.69%)
           12           3.29        3.36 ( -2.04%)
           21           4.07        3.92 (  3.78%)
           30           3.87        3.40 ( 12.17%)
           48           3.79        3.16 ( 16.61%)
           79           3.88        3.28 ( 15.42%)
          110           3.90        3.38 ( 13.35%)
          128           4.00        3.38 ( 15.45%)
      
      pound_clock_gettime and pound_clock_gettime are two benchmarks included in
      the MMTests framework. They launch a given number of threads which
      repeatedly call times() or clock_gettimes(). The results above can be
      reproduced with cloning MMTests from github.com and running the "poundtime"
      workload:
      
        $ git clone https://github.com/gormanm/mmtests.git
        $ cd mmtests
        $ cp configs/config-global-dhp__workload_poundtime config
        $ ./run-mmtests.sh --run-monitor $(uname -r)
      
      The above will run "poundtime" measuring the kernel currently running on
      the machine; Once a new kernel is installed and the machine rebooted,
      running again
      
        $ cd mmtests
        $ ./run-mmtests.sh --run-monitor $(uname -r)
      
      will produce results to compare with. A comparison table will be output
      with:
      
        $ cd mmtests/work/log
        $ ../../compare-kernels.sh
      
      the table will contain a lot of entries; grepping for "Amean" (as in
      "arithmetic mean") will give the tables presented above. The source code
      for the two benchmarks is reported at the end of this changelog for
      clairity.
      
      The cache misses addressed by this patch were found using a combination of
      `perf top`, `perf record` and `perf annotate`. The incriminated lines were
      found to be
      
          struct sched_entity *curr = cfs_rq->curr;
      
      and
      
          delta_exec = now - curr->exec_start;
      
      in the function update_curr() from kernel/sched/fair.c. This patch
      prefetches the data from memory just before update_curr is called in the
      interested execution path.
      
      A comparison of the total number of cycles before and after the patch
      follows; the data is obtained using `perf stat -r 10 -ddd <program>`
      running over the same sequence of number of threads used above (a positive
      gain is an improvement):
      
        threads   cycles before                 cycles after                gain
      
          2      19,699,563,964  +-1.19%      17,358,917,517  +-1.85%      11.88%
          5      47,401,089,566  +-2.96%      45,103,730,829  +-0.97%       4.85%
          8      80,923,501,004  +-3.01%      71,419,385,977  +-0.77%      11.74%
         12     112,326,485,473  +-0.47%     110,371,524,403  +-0.47%       1.74%
         21     193,455,574,299  +-0.72%     180,120,667,904  +-0.36%       6.89%
         30     315,073,519,013  +-1.64%     271,222,225,950  +-1.29%      13.92%
         48     321,969,515,332  +-1.48%     273,353,977,321  +-1.16%      15.10%
         79     337,866,003,422  +-0.97%     289,462,481,538  +-1.05%      14.33%
        110     338,712,691,920  +-0.78%     290,574,233,170  +-0.77%      14.21%
        128     348,384,794,006  +-0.50%     292,691,648,206  +-0.66%      15.99%
      
      A comparison of cache miss vs total cache loads ratios, before and after
      the patch (again from the `perf stat -r 10 -ddd <program>` tables):
      
        threads   L1 misses/total*100     L1 misses/total*100            gain
      		         before                   after
            2           7.43  +-4.90%           7.36  +-4.70%           0.94%
            5          13.09  +-4.74%          13.52  +-3.73%          -3.28%
            8          13.79  +-5.61%          12.90  +-3.27%           6.45%
           12          11.57  +-2.44%           8.71  +-1.40%          24.72%
           21          12.39  +-3.92%           9.97  +-1.84%          19.53%
           30          13.91  +-2.53%          11.73  +-2.28%          15.67%
           48          13.71  +-1.59%          12.32  +-1.97%          10.14%
           79          14.44  +-0.66%          13.40  +-1.06%           7.20%
          110          15.86  +-0.50%          14.46  +-0.59%           8.83%
          128          16.51  +-0.32%          15.06  +-0.78%           8.78%
      
      As a final note, the following shows the evolution of performance figures
      in the "poundtime" benchmark and pinpoints commit 6e998916
      ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a
      major source of degradation, mostly unaddressed to this day (figures
      expressed in seconds).
      
      pound_clock_gettime:
      
        threads   parent of         6e998916        4.7-rc7
      	    6e998916            itself
          2        2.23          3.68 ( -64.56%)        3.48 (-55.48%)
          5        2.83          3.78 ( -33.42%)        3.33 (-17.43%)
          8        2.84          4.31 ( -52.12%)        3.37 (-18.76%)
          12       3.09          3.61 ( -16.74%)        3.32 ( -7.17%)
          21       3.14          4.63 ( -47.36%)        4.01 (-27.71%)
          30       3.28          5.75 ( -75.37%)        3.63 (-10.80%)
          48       3.02          6.05 (-100.56%)        3.71 (-22.99%)
          79       2.88          6.30 (-118.90%)        3.75 (-30.26%)
          110      2.95          6.46 (-119.00%)        3.81 (-29.24%)
          128      3.05          6.42 (-110.08%)        3.88 (-27.04%)
      
      pound_times:
      
        threads   parent of         6e998916        4.7-rc7
      	    6e998916
      
                  itself
          2        2.27          3.73 ( -64.71%)        3.65 (-61.14%)
          5        2.78          3.77 ( -35.56%)        3.45 (-23.98%)
          8        2.79          4.41 ( -57.71%)        3.52 (-26.05%)
          12       3.02          3.56 ( -17.94%)        3.29 ( -9.08%)
          21       3.10          4.61 ( -48.74%)        4.07 (-31.34%)
          30       3.33          5.75 ( -72.53%)        3.87 (-16.01%)
          48       2.96          6.06 (-105.04%)        3.79 (-28.10%)
          79       2.88          6.24 (-116.83%)        3.88 (-34.81%)
          110      2.98          6.37 (-114.08%)        3.90 (-31.12%)
          128      3.10          6.35 (-104.61%)        4.00 (-28.87%)
      
      The source code of the two benchmarks follows. To compile the two:
      
        NR_THREADS=42
        for FILE in pound_times pound_clock_gettime; do
            gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE
        done
      
      ==== BEGIN pound_times.c ====
      
      struct tms start;
      
      void *pound (void *threadid)
      {
        struct tms end;
        int oldutime = 0;
        int utime;
        int i;
        for (i = 0; i < 5000000 / NUM_THREADS; i++) {
                times(&end);
                utime = ((int)end.tms_utime - (int)start.tms_utime);
                if (oldutime > utime) {
                  printf("utime decreased, was %d, now %d!\n", oldutime, utime);
                }
                oldutime = utime;
        }
        pthread_exit(NULL);
      }
      
      int main()
      {
        pthread_t th[NUM_THREADS];
        long i;
        times(&start);
        for (i = 0; i < NUM_THREADS; i++) {
          pthread_create (&th[i], NULL, pound, (void *)i);
        }
        pthread_exit(NULL);
        return 0;
      }
      ==== END pound_times.c ====
      
      ==== BEGIN pound_clock_gettime.c ====
      
      void *pound (void *threadid)
      {
      	struct timespec ts;
      	int rc, i;
      	unsigned long prev = 0, this = 0;
      
      	for (i = 0; i < 5000000 / NUM_THREADS; i++) {
      		rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
      		if (rc < 0)
      			perror("clock_gettime");
      		this = (ts.tv_sec * 1000000000) + ts.tv_nsec;
      		if (0 && this < prev)
      			printf("%lu ns timewarp at iteration %d\n", prev - this, i);
      		prev = this;
      	}
      	pthread_exit(NULL);
      }
      
      int main()
      {
      	pthread_t th[NUM_THREADS];
      	long rc, i;
      	pid_t pgid;
      
      	for (i = 0; i < NUM_THREADS; i++) {
      		rc = pthread_create(&th[i], NULL, pound, (void *)i);
      		if (rc < 0)
      			perror("pthread_create");
      	}
      
      	pthread_exit(NULL);
      	return 0;
      }
      ==== END pound_clock_gettime.c ====
      Suggested-by: default avatarMike Galbraith <mgalbraith@suse.de>
      Signed-off-by: default avatarGiovanni Gherdovich <ggherdovich@suse.cz>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.cz
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6075620b
  10. 13 Jul, 2016 1 commit
  11. 10 Jul, 2016 1 commit
  12. 27 Jun, 2016 3 commits
    • Zev Weiss's avatar
      sched/core: Fix sched_getaffinity() return value kerneldoc comment · 599b4840
      Zev Weiss authored
      
      
      Previous version was probably written referencing the man page for
      glibc's wrapper, but the wrapper's behavior differs from that of the
      syscall itself in this case.
      Signed-off-by: default avatarZev Weiss <zev@bewilderbeest.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1466975603-25408-1-git-send-email-zev@bewilderbeest.net
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      599b4840
    • Peter Zijlstra's avatar
      sched/fair: Reorder cgroup creation code · 8663e24d
      Peter Zijlstra authored
      
      
      A future patch needs rq->lock held _after_ we link the task_group into
      the hierarchy. In order to avoid taking every rq->lock twice, reorder
      things a little and create online_fair_sched_group() to be called
      after we link the task_group.
      
      All this code is still ran from css_alloc() so css_online() isn't in
      fact used for this.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8663e24d
    • Peter Zijlstra's avatar
      sched/fair: Fix PELT integrity for new tasks · 7dc603c9
      Peter Zijlstra authored
      
      
      Vincent and Yuyang found another few scenarios in which entity
      tracking goes wobbly.
      
      The scenarios are basically due to the fact that new tasks are not
      immediately attached and thereby differ from the normal situation -- a
      task is always attached to a cfs_rq load average (such that it
      includes its blocked contribution) and are explicitly
      detached/attached on migration to another cfs_rq.
      
      Scenario 1: switch to fair class
      
        p->sched_class = fair_class;
        if (queued)
          enqueue_task(p);
            ...
              enqueue_entity()
      	  enqueue_entity_load_avg()
      	    migrated = !sa->last_update_time (true)
      	    if (migrated)
      	      attach_entity_load_avg()
        check_class_changed()
          switched_from() (!fair)
          switched_to()   (fair)
            switched_to_fair()
              attach_entity_load_avg()
      
      If @p is a new task that hasn't been fair before, it will have
      !last_update_time and, per the above, end up in
      attach_entity_load_avg() _twice_.
      
      Scenario 2: change between cgroups
      
        sched_move_group(p)
          if (queued)
            dequeue_task()
          task_move_group_fair()
            detach_task_cfs_rq()
              detach_entity_load_avg()
            set_task_rq()
            attach_task_cfs_rq()
              attach_entity_load_avg()
          if (queued)
            enqueue_task();
              ...
                enqueue_entity()
      	    enqueue_entity_load_avg()
      	      migrated = !sa->last_update_time (true)
      	      if (migrated)
      	        attach_entity_load_avg()
      
      Similar as with scenario 1, if @p is a new task, it will have
      !load_update_time and we'll end up in attach_entity_load_avg()
      _twice_.
      
      Furthermore, notice how we do a detach_entity_load_avg() on something
      that wasn't attached to begin with.
      
      As stated above; the problem is that the new task isn't yet attached
      to the load tracking and thereby violates the invariant assumption.
      
      This patch remedies this by ensuring a new task is indeed properly
      attached to the load tracking on creation, through
      post_init_entity_util_avg().
      
      Of course, this isn't entirely as straightforward as one might think,
      since the task is hashed before we call wake_up_new_task() and thus
      can be poked at. We avoid this by adding TASK_NEW and teaching
      cpu_cgroup_can_attach() to refuse such tasks.
      Reported-by: default avatarYuyang Du <yuyang.du@intel.com>
      Reported-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7dc603c9