diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2012-05-22 18:27:32 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2012-05-22 18:27:32 -0700 |
commit | d79ee93de909dfb252279b9a95978bbda9a814a9 (patch) | |
tree | bfccca60fd36259ff4bcc5e78a2c272fbd680065 | |
parent | 2ff2b289a695807e291e1ed9f639d8a3ba5f4254 (diff) | |
parent | 1c2927f18576d65631d8e0ddd19e1d023183222e (diff) |
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
"The biggest change is the cleanup/simplification of the load-balancer:
instead of the current practice of architectures twiddling scheduler
internal data structures and providing the scheduler domains in
colorfully inconsistent ways, we now have generic scheduler code in
kernel/sched/core.c:sched_init_numa() that looks at the architecture's
node_distance() parameters and (while not fully trusting it) deducts a
NUMA topology from it.
This inevitably changes balancing behavior - hopefully for the better.
There are various smaller optimizations, cleanups and fixlets as well"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Taint kernel with TAINT_WARN after sleep-in-atomic bug
sched: Remove stale power aware scheduling remnants and dysfunctional knobs
sched/debug: Fix printing large integers on 32-bit platforms
sched/fair: Improve the ->group_imb logic
sched/nohz: Fix rq->cpu_load[] calculations
sched/numa: Don't scale the imbalance
sched/fair: Revert sched-domain iteration breakage
sched/x86: Rewrite set_cpu_sibling_map()
sched/numa: Fix the new NUMA topology bits
sched/numa: Rewrite the CONFIG_NUMA sched domain support
sched/fair: Propagate 'struct lb_env' usage into find_busiest_group
sched/fair: Add some serialization to the sched_domain load-balance walk
sched/fair: Let minimally loaded cpu balance the group
sched: Change rq->nr_running to unsigned int
x86/numa: Check for nonsensical topologies on real hw as well
x86/numa: Hard partition cpu topology masks on node boundaries
x86/numa: Allow specifying node_distance() for numa=fake
x86/sched: Make mwait_usable() heed to "idle=" kernel parameters properly
sched: Update documentation and comments
sched_rt: Avoid unnecessary dequeue and enqueue of pushable tasks in set_cpus_allowed_rt()
25 files changed, 441 insertions, 1005 deletions
diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index e7be75b96e4..5dab36448b4 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -9,31 +9,6 @@ Description: /sys/devices/system/cpu/cpu#/ -What: /sys/devices/system/cpu/sched_mc_power_savings - /sys/devices/system/cpu/sched_smt_power_savings -Date: June 2006 -Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org> -Description: Discover and adjust the kernel's multi-core scheduler support. - - Possible values are: - - 0 - No power saving load balance (default value) - 1 - Fill one thread/core/package first for long running threads - 2 - Also bias task wakeups to semi-idle cpu package for power - savings - - sched_mc_power_savings is dependent upon SCHED_MC, which is - itself architecture dependent. - - sched_smt_power_savings is dependent upon SCHED_SMT, which - is itself architecture dependent. - - The two files are independent of each other. It is possible - that one file may be present without the other. - - Introduced by git commit 5c45bf27. - - What: /sys/devices/system/cpu/kernel_max /sys/devices/system/cpu/offline /sys/devices/system/cpu/online diff --git a/Documentation/scheduler/sched-design-CFS.txt b/Documentation/scheduler/sched-design-CFS.txt index 91ecff07ced..d529e02d928 100644 --- a/Documentation/scheduler/sched-design-CFS.txt +++ b/Documentation/scheduler/sched-design-CFS.txt @@ -130,7 +130,7 @@ CFS implements three scheduling policies: idle timer scheduler in order to avoid to get into priority inversion problems which would deadlock the machine. -SCHED_FIFO/_RR are implemented in sched_rt.c and are as specified by +SCHED_FIFO/_RR are implemented in sched/rt.c and are as specified by POSIX. The command chrt from util-linux-ng 2.13.1.1 can set all of these except @@ -145,9 +145,9 @@ Classes," an extensible hierarchy of scheduler modules. These modules encapsulate scheduling policy details and are handled by the scheduler core without the core code assuming too much about them. -sched_fair.c implements the CFS scheduler described above. +sched/fair.c implements the CFS scheduler described above. -sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than +sched/rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than the previous vanilla scheduler did. It uses 100 runqueues (for all 100 RT priority levels, instead of 140 in the previous scheduler) and it needs no expired array. diff --git a/Documentation/scheduler/sched-domains.txt b/Documentation/scheduler/sched-domains.txt index b7ee379b651..443f0c76bab 100644 --- a/Documentation/scheduler/sched-domains.txt +++ b/Documentation/scheduler/sched-domains.txt @@ -61,10 +61,6 @@ The implementor should read comments in include/linux/sched.h: struct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of the specifics and what to tune. -For SMT, the architecture must define CONFIG_SCHED_SMT and provide a -cpumask_t cpu_sibling_map[NR_CPUS], where cpu_sibling_map[i] is the mask of -all "i"'s siblings as well as "i" itself. - Architectures may retain the regular override the default SD_*_INIT flags while using the generic domain builder in kernel/sched.c if they wish to retain the traditional SMT->SMP->NUMA topology (or some subset of that). This diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index 09f646753d1..a2496e449b7 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -70,31 +70,6 @@ void build_cpu_to_node_map(void); .nr_balance_failed = 0, \ } -/* sched_domains SD_NODE_INIT for IA64 NUMA machines */ -#define SD_NODE_INIT (struct sched_domain) { \ - .parent = NULL, \ - .child = NULL, \ - .groups = NULL, \ - .min_interval = 8, \ - .max_interval = 8*(min(num_online_cpus(), 32U)), \ - .busy_factor = 64, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 2, \ - .busy_idx = 3, \ - .idle_idx = 2, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - .flags = SD_LOAD_BALANCE \ - | SD_BALANCE_NEWIDLE \ - | SD_BALANCE_EXEC \ - | SD_BALANCE_FORK \ - | SD_SERIALIZE, \ - .last_balance = jiffies, \ - .balance_interval = 64, \ - .nr_balance_failed = 0, \ -} - #endif /* CONFIG_NUMA */ #ifdef CONFIG_SMP diff --git a/arch/mips/include/asm/mach-ip27/topology.h b/arch/mips/include/asm/mach-ip27/topology.h index 1b1a7d1632b..b2cf641f206 100644 --- a/arch/mips/include/asm/mach-ip27/topology.h +++ b/arch/mips/include/asm/mach-ip27/topology.h @@ -36,23 +36,6 @@ extern unsigned char __node_distances[MAX_COMPACT_NODES][MAX_COMPACT_NODES]; #define node_distance(from, to) (__node_distances[(from)][(to)]) -/* sched_domains SD_NODE_INIT for SGI IP27 machines */ -#define SD_NODE_INIT (struct sched_domain) { \ - .parent = NULL, \ - .child = NULL, \ - .groups = NULL, \ - .min_interval = 8, \ - .max_interval = 32, \ - .busy_factor = 32, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 1, \ - .flags = SD_LOAD_BALANCE | \ - SD_BALANCE_EXEC, \ - .last_balance = jiffies, \ - .balance_interval = 1, \ - .nr_balance_failed = 0, \ -} - #include <asm-generic/topology.h> #endif /* _ASM_MACH_TOPOLOGY_H */ diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index c97185885c6..852ed1b384f 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -18,12 +18,6 @@ struct device_node; */ #define RECLAIM_DISTANCE 10 -/* - * Avoid creating an extra level of balancing (SD_ALLNODES) on the largest - * POWER7 boxes which have a maximum of 32 nodes. - */ -#define SD_NODES_PER_DOMAIN 32 - #include <asm/mmzone.h> static inline int cpu_to_node(int cpu) @@ -51,36 +45,6 @@ static inline int pcibus_to_node(struct pci_bus *bus) cpu_all_mask : \ cpumask_of_node(pcibus_to_node(bus))) -/* sched_domains SD_NODE_INIT for PPC64 machines */ -#define SD_NODE_INIT (struct sched_domain) { \ - .min_interval = 8, \ - .max_interval = 32, \ - .busy_factor = 32, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 1, \ - .busy_idx = 3, \ - .idle_idx = 1, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - \ - .flags = 1*SD_LOAD_BALANCE \ - | 0*SD_BALANCE_NEWIDLE \ - | 1*SD_BALANCE_EXEC \ - | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ - | 1*SD_WAKE_AFFINE \ - | 0*SD_PREFER_LOCAL \ - | 0*SD_SHARE_CPUPOWER \ - | 0*SD_POWERSAVINGS_BALANCE \ - | 0*SD_SHARE_PKG_RESOURCES \ - | 1*SD_SERIALIZE \ - | 0*SD_PREFER_SIBLING \ - , \ - .last_balance = jiffies, \ - .balance_interval = 1, \ -} - extern int __node_distance(int, int); #define node_distance(a, b) __node_distance(a, b) diff --git a/arch/sh/include/asm/topology.h b/arch/sh/include/asm/topology.h index 88e734069fa..b0a282d65f6 100644 --- a/arch/sh/include/asm/topology.h +++ b/arch/sh/include/asm/topology.h @@ -3,31 +3,6 @@ #ifdef CONFIG_NUMA -/* sched_domains SD_NODE_INIT for sh machines */ -#define SD_NODE_INIT (struct sched_domain) { \ - .parent = NULL, \ - .child = NULL, \ - .groups = NULL, \ - .min_interval = 8, \ - .max_interval = 32, \ - .busy_factor = 32, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 2, \ - .busy_idx = 3, \ - .idle_idx = 2, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - .flags = SD_LOAD_BALANCE \ - | SD_BALANCE_FORK \ - | SD_BALANCE_EXEC \ - | SD_BALANCE_NEWIDLE \ - | SD_SERIALIZE, \ - .last_balance = jiffies, \ - .balance_interval = 1, \ - .nr_balance_failed = 0, \ -} - #define cpu_to_node(cpu) ((void)(cpu),0) #define parent_node(node) ((void)(node),0) diff --git a/arch/sparc/include/asm/topology_64.h b/arch/sparc/include/asm/topology_64.h index 8b9c556d630..1754390a426 100644 --- a/arch/sparc/include/asm/topology_64.h +++ b/arch/sparc/include/asm/topology_64.h @@ -31,25 +31,6 @@ static inline int pcibus_to_node(struct pci_bus *pbus) cpu_all_mask : \ cpumask_of_node(pcibus_to_node(bus))) -#define SD_NODE_INIT (struct sched_domain) { \ - .min_interval = 8, \ - .max_interval = 32, \ - .busy_factor = 32, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 2, \ - .busy_idx = 3, \ - .idle_idx = 2, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - .flags = SD_LOAD_BALANCE \ - | SD_BALANCE_FORK \ - | SD_BALANCE_EXEC \ - | SD_SERIALIZE, \ - .last_balance = jiffies, \ - .balance_interval = 1, \ -} - #else /* CONFIG_NUMA */ #include <asm-generic/topology.h> diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index 6fdd0c86019..7a7ce390534 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -78,32 +78,6 @@ static inline const struct cpumask *cpumask_of_node(int node) .balance_interval = 32, \ } -/* sched_domains SD_NODE_INIT for TILE architecture */ -#define SD_NODE_INIT (struct sched_domain) { \ - .min_interval = 16, \ - .max_interval = 512, \ - .busy_factor = 32, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 1, \ - .busy_idx = 3, \ - .idle_idx = 1, \ - .newidle_idx = 2, \ - .wake_idx = 1, \ - .flags = 1*SD_LOAD_BALANCE \ - | 1*SD_BALANCE_NEWIDLE \ - | 1*SD_BALANCE_EXEC \ - | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ - | 0*SD_WAKE_AFFINE \ - | 0*SD_PREFER_LOCAL \ - | 0*SD_SHARE_CPUPOWER \ - | 0*SD_SHARE_PKG_RESOURCES \ - | 1*SD_SERIALIZE \ - , \ - .last_balance = jiffies, \ - .balance_interval = 128, \ -} - /* By definition, we create nodes based on online memory. */ #define node_has_online_mem(nid) 1 diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index b9676ae37ad..095b21507b6 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -92,44 +92,6 @@ extern void setup_node_to_cpumask_map(void); #define pcibus_to_node(bus) __pcibus_to_node(bus) -#ifdef CONFIG_X86_32 -# define SD_CACHE_NICE_TRIES 1 -# define SD_IDLE_IDX 1 -#else -# define SD_CACHE_NICE_TRIES 2 -# define SD_IDLE_IDX 2 -#endif - -/* sched_domains SD_NODE_INIT for NUMA machines */ -#define SD_NODE_INIT (struct sched_domain) { \ - .min_interval = 8, \ - .max_interval = 32, \ - .busy_factor = 32, \ - .imbalance_pct = 125, \ - .cache_nice_tries = SD_CACHE_NICE_TRIES, \ - .busy_idx = 3, \ - .idle_idx = SD_IDLE_IDX, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - \ - .flags = 1*SD_LOAD_BALANCE \ - | 1*SD_BALANCE_NEWIDLE \ - | 1*SD_BALANCE_EXEC \ - | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ - | 1*SD_WAKE_AFFINE \ - | 0*SD_PREFER_LOCAL \ - | 0*SD_SHARE_CPUPOWER \ - | 0*SD_POWERSAVINGS_BALANCE \ - | 0*SD_SHARE_PKG_RESOURCES \ - | 1*SD_SERIALIZE \ - | 0*SD_PREFER_SIBLING \ - , \ - .last_balance = jiffies, \ - .balance_interval = 1, \ -} - extern int __node_distance(int, int); #define node_distance(a, b) __node_distance(a, b) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index dc8ca8ea78c..8040b752ee4 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -582,9 +582,17 @@ int mwait_usable(const struct cpuinfo_x86 *c) { u32 eax, ebx, ecx, edx; + /* Use mwait if idle=mwait boot option is given */ if (boot_option_idle_override == IDLE_FORCE_MWAIT) return 1; + /* + * Any idle= boot option other than idle=mwait means that we must not + * use mwait. Eg: idle=halt or idle=poll or idle=nomwait + */ + if (boot_option_idle_override != IDLE_NO_OVERRIDE) + return 0; + if (c->cpuid_level < MWAIT_INFO) return 0; diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 3acaf51dfdd..433529e29be 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -299,59 +299,90 @@ void __cpuinit smp_store_cpu_info(int id) identify_secondary_cpu(c); } -static void __cpuinit link_thread_siblings(int cpu1, int cpu2) +static bool __cpuinit +topology_sane(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o, const char *name) { - cpumask_set_cpu(cpu1, cpu_sibling_mask(cpu2)); - cpumask_set_cpu(cpu2, cpu_sibling_mask(cpu1)); - cpumask_set_cpu(cpu1, cpu_core_mask(cpu2)); - cpumask_set_cpu(cpu2, cpu_core_mask(cpu1)); - cpumask_set_cpu(cpu1, cpu_llc_shared_mask(cpu2)); - cpumask_set_cpu(cpu2, cpu_llc_shared_mask(cpu1)); + int cpu1 = c->cpu_index, cpu2 = o->cpu_index; + + return !WARN_ONCE(cpu_to_node(cpu1) != cpu_to_node(cpu2), + "sched: CPU #%d's %s-sibling CPU #%d is not on the same node! " + "[node: %d != %d]. Ignoring dependency.\n", + cpu1, name, cpu2, cpu_to_node(cpu1), cpu_to_node(cpu2)); } +#define link_mask(_m, c1, c2) \ +do { \ + cpumask_set_cpu((c1), cpu_##_m##_mask(c2)); \ + cpumask_set_cpu((c2), cpu_##_m##_mask(c1)); \ +} while (0) + +static bool __cpuinit match_smt(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) +{ + if (cpu_has(c, X86_FEATURE_TOPOEXT)) { + int cpu1 = c->cpu_index, cpu2 = o->cpu_index; + + if (c->phys_proc_id == o->phys_proc_id && + per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2) && + c->compute_unit_id == o->compute_unit_id) + return topology_sane(c, o, "smt"); + + } else if (c->phys_proc_id == o->phys_proc_id && + c->cpu_core_id == o->cpu_core_id) { + return topology_sane(c, o, "smt"); + } + + return false; +} + +static bool __cpuinit match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) +{ + int cpu1 = c->cpu_index, cpu2 = o->cpu_index; + + if (per_cpu(cpu_llc_id, cpu1) != BAD_APICID && + per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2)) + return topology_sane(c, o, "llc"); + + return false; +} + +static bool __cpuinit match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) +{ + if (c->phys_proc_id == o->phys_proc_id) + return topology_sane(c, o, "mc"); + + return false; +} void __cpuinit set_cpu_sibling_map(int cpu) { - int i; + bool has_mc = boot_cpu_data.x86_max_cores > 1; + bool has_smt = smp_num_siblings > 1; struct cpuinfo_x86 *c = &cpu_data(cpu); + struct cpuinfo_x86 *o; + int i; cpumask_set_cpu(cpu, cpu_sibling_setup_mask); - if (smp_num_siblings > 1) { - for_each_cpu(i, cpu_sibling_setup_mask) { - struct cpuinfo_x86 *o = &cpu_data(i); - - if (cpu_has(c, X86_FEATURE_TOPOEXT)) { - if (c->phys_proc_id == o->phys_proc_id && - per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i) && - c->compute_unit_id == o->compute_unit_id) - link_thread_siblings(cpu, i); - } else if (c->phys_proc_id == o->phys_proc_id && - c->cpu_core_id == o->cpu_core_id) { - link_thread_siblings(cpu, i); - } - } - } else { + if (!has_smt && !has_mc) { cpumask_set_cpu(cpu, cpu_sibling_mask(cpu)); - } - - cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)); - - if (__this_cpu_read(cpu_info.x86_max_cores) == 1) { - cpumask_copy(cpu_core_mask(cpu), cpu_sibling_mask(cpu)); + cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)); + cpumask_set_cpu(cpu, cpu_core_mask(cpu)); c->booted_cores = 1; return; } for_each_cpu(i, cpu_sibling_setup_mask) { - if (per_cpu(cpu_llc_id, cpu) != BAD_APICID && - per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i)) { - cpumask_set_cpu(i, cpu_llc_shared_mask(cpu)); - cpumask_set_cpu(cpu, cpu_llc_shared_mask(i)); - } - if (c->phys_proc_id == cpu_data(i).phys_proc_id) { - cpumask_set_cpu(i, cpu_core_mask(cpu)); - cpumask_set_cpu(cpu, cpu_core_mask(i)); + o = &cpu_data(i); + + if ((i == cpu) || (has_smt && match_smt(c, o))) + link_mask(sibling, cpu, i); + + if ((i == cpu) || (has_mc && match_llc(c, o))) + link_mask(llc_shared, cpu, i); + + if ((i == cpu) || (has_mc && match_mc(c, o))) { + link_mask(core, cpu, i); + /* * Does this new cpu bringup a new core? */ @@ -382,8 +413,7 @@ const struct cpumask *cpu_coregroup_mask(int cpu) * For perf, we return last level cache shared map. * And for power savings, we return cpu_core_map */ - if ((sched_mc_power_savings || sched_smt_power_savings) && - !(cpu_has(c, X86_FEATURE_AMD_DCM))) + if (!(cpu_has(c, X86_FEATURE_AMD_DCM))) return cpu_core_mask(cpu); else return cpu_llc_shared_mask(cpu); diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c index 53489ff6bf8..871dd886817 100644 --- a/arch/x86/mm/numa_emulation.c +++ b/arch/x86/mm/numa_emulation.c @@ -339,9 +339,11 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt) } else { unsigned long n; - n = simple_strtoul(emu_cmdline, NULL, 0); + n = simple_strtoul(emu_cmdline, &emu_cmdline, 0); ret = split_nodes_interleave(&ei, &pi, 0, max_addr, n); } + if (*emu_cmdline == ':') + emu_cmdline++; if (ret < 0) goto no_emu; @@ -418,7 +420,9 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt) int physj = emu_nid_to_phys[j]; int dist; - if (physi >= numa_dist_cnt || physj >= numa_dist_cnt) + if (get_option(&emu_cmdline, &dist) == 2) + ; + else if (physi >= numa_dist_cnt || physj >= numa_dist_cnt) dist = physi == physj ? LOCAL_DISTANCE : REMOTE_DISTANCE; else diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index adf937bf409..63452943abd 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -330,8 +330,4 @@ void __init cpu_dev_init(void) panic("Failed to register CPU subsystem"); cpu_dev_register_generic(); - -#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT) - sched_create_sysfs_power_savings_entries(cpu_subsys.dev_root); -#endif } diff --git a/include/linux/cpu.h b/include/linux/cpu.h index ee28844ae68..7230bb59a06 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -36,8 +36,6 @@ extern void cpu_remove_dev_attr(struct device_attribute *attr); extern int cpu_add_dev_attr_group(struct attribute_group *attrs); extern void cpu_remove_dev_attr_group(struct attribute_group *attrs); -extern int sched_create_sysfs_power_savings_entries(struct device *dev); - #ifdef CONFIG_HOTPLUG_CPU extern void unregister_cpu(struct cpu *cpu); extern ssize_t arch_cpu_probe(const char *, size_t); diff --git a/include/linux/sched.h b/include/linux/sched.h index f774d88cd0a..28fa9d02fd5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -855,61 +855,14 @@ enum cpu_idle_type { #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_PREFER_LOCAL 0x0040 /* Prefer to keep tasks local to this domain */ #define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */ -#define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ #define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */ #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ #define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */ -enum powersavings_balance_level { - POWERSAVINGS_BALANCE_NONE = 0, /* No power saving load balance */ - POWERSAVINGS_BALANCE_BASIC, /* Fill one thread/core/package - * first for long running threads - */ - POWERSAVINGS_BALANCE_WAKEUP, /* Also bias task wakeups to semi-idle - * cpu package for power savings - */ - MAX_POWERSAVINGS_BALANCE_LEVELS -}; - -extern int sched_mc_power_savings, sched_smt_power_savings; - -static inline int sd_balance_for_mc_power(void) -{ - if (sched_smt_power_savings) - return SD_POWERSAVINGS_BALANCE; - - if (!sched_mc_power_savings) - return SD_PREFER_SIBLING; - - return 0; -} - -static inline int sd_balance_for_package_power(void) -{ - if (sched_mc_power_savings | sched_smt_power_savings) - return SD_POWERSAVINGS_BALANCE; - - return SD_PREFER_SIBLING; -} - extern int __weak arch_sd_sibiling_asym_packing(void); -/* - * Optimise SD flags for power savings: - * SD_BALANCE_NEWIDLE helps aggressive task consolidation and power savings. - * Keep default SD flags if sched_{smt,mc}_power_saving=0 - */ - -static inline int sd_power_saving_flags(void) -{ - if (sched_mc_power_savings | sched_smt_power_savings) - return SD_BALANCE_NEWIDLE; - - return 0; -} - struct sched_group_power { atomic_t ref; /* @@ -1962,7 +1915,7 @@ static inline int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask) */ extern unsigned long long notrace sched_clock(void); /* - * See the comment in kernel/sched_clock.c + * See the comment in kernel/sched/clock.c */ extern u64 cpu_clock(int cpu); extern u64 local_clock(void); diff --git a/include/linux/topology.h b/include/linux/topology.h index 9dc427cdb6f..e91cd43394d 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -70,7 +70,6 @@ int arch_update_cpu_topology(void); * Below are the 3 major initializers used in building sched_domains: * SD_SIBLING_INIT, for SMT domains * SD_CPU_INIT, for SMP domains - * SD_NODE_INIT, for NUMA domains * * Any architecture that cares to do any tuning to these values should do so * by defining their own arch-specific initializer in include/asm/topology.h. @@ -99,7 +98,6 @@ int arch_update_cpu_topology(void); | 0*SD_BALANCE_WAKE \ | 1*SD_WAKE_AFFINE \ | 1*SD_SHARE_CPUPOWER \ - | 0*SD_POWERSAVINGS_BALANCE \ | 1*SD_SHARE_PKG_RESOURCES \ | 0*SD_SERIALIZE \ | 0*SD_PREFER_SIBLING \ @@ -135,8 +133,6 @@ int arch_update_cpu_topology(void); | 0*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ | 0*SD_SERIALIZE \ - | sd_balance_for_mc_power() \ - | sd_power_saving_flags() \ , \ .last_balance = jiffies, \ .balance_interval = 1, \ @@ -168,56 +164,18 @@ int arch_update_cpu_topology(void); | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \ | 0*SD_SERIALIZE \ - | sd_balance_for_package_power() \ - | sd_power_saving_flags() \ , \ .last_balance = jiffies, \ .balance_interval = 1, \ } #endif -/* sched_domains SD_ALLNODES_INIT for NUMA machines */ -#define SD_ALLNODES_INIT (struct sched_domain) { \ - .min_interval = 64, \ - .max_interval = 64*num_online_cpus(), \ - .busy_factor = 128, \ - .imbalance_pct = 133, \ - .cache_nice_tries = 1, \ - .busy_idx = 3, \ - .idle_idx = 3, \ - .flags = 1*SD_LOAD_BALANCE \ - | 1*SD_BALANCE_NEWIDLE \ - | 0*SD_BALANCE_EXEC \ - | 0*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ - | 0*SD_WAKE_AFFINE \ - | 0*SD_SHARE_CPUPOWER \ - | 0*SD_POWERSAVINGS_BALANCE \ - | 0*SD_SHARE_PKG_RESOURCES \ - | 1*SD_SERIALIZE \ - | 0*SD_PREFER_SIBLING \ - , \ - .last_balance = jiffies, \ - .balance_interval = 64, \ -} - -#ifndef SD_NODES_PER_DOMAIN -#define SD_NODES_PER_DOMAIN 16 -#endif - #ifdef CONFIG_SCHED_BOOK #ifndef SD_BOOK_INIT #error Please define an appropriate SD_BOOK_INIT in include/asm/topology.h!!! #endif #endif /* CONFIG_SCHED_BOOK */ -#ifdef CONFIG_NUMA -#ifndef SD_NODE_INIT -#error Please define an appropriate SD_NODE_INIT in include/asm/topology.h!!! -#endif - -#endif /* CONFIG_NUMA */ - #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID DECLARE_PER_CPU(int, numa_node); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d2e2e173d8f..d833cc94eed 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -693,8 +693,6 @@ int tg_nop(struct task_group *tg, void *data) } #endif -void update_cpu_load(struct rq *this_rq); - static void set_load_weight(struct task_struct *p) { int prio = p->static_prio - MAX_RT_PRIO; @@ -2481,22 +2479,13 @@ decay_load_missed(unsigned long load, unsigned long missed_updates, int idx) * scheduler tick (TICK_NSEC). With tickless idle this will not be called * every tick. We fix it up based on jiffies. */ -void update_cpu_load(struct rq *this_rq) +static void __update_cpu_load(struct rq *this_rq, unsigned long this_load, + unsigned long pending_updates) { - unsigned long this_load = this_rq->load.weight; - unsigned long curr_jiffies = jiffies; - unsigned long pending_updates; int i, scale; this_rq->nr_load_updates++; - /* Avoid repeated calls on same jiffy, when moving in and out of idle */ - if (curr_jiffies == this_rq->last_load_update_tick) - return; - - pending_updates = curr_jiffies - this_rq->last_load_update_tick; - this_rq->last_load_update_tick = curr_jiffies; - /* Update our load: */ this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */ for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) { @@ -2521,9 +2510,45 @@ void update_cpu_load(struct rq *this_rq) sched_avg_update(this_rq); } +/* + * Called from nohz_idle_balance() to update the load ratings before doing the + * idle balance. + */ +void update_idle_cpu_load(struct rq *this_rq) +{ + unsigned long curr_jiffies = jiffies; + unsigned long load = this_rq->load.weight; + unsigned long pending_updates; + + /* + * Bloody broken means of dealing with nohz, but better than nothing.. + * jiffies is updated by one cpu, another cpu can drift wrt the jiffy + * update and see 0 difference the one time and 2 the next, even though + * we ticked at roughtly the same rate. + * + * Hence we only use this from nohz_idle_balance() and skip this + * nonsense when called from the scheduler_tick() since that's + * guaranteed a stable rate. + */ + if (load || curr_jiffies == this_rq->last_load_update_tick) + return; + + pending_updates = curr_jiffies - this_rq->last_load_update_tick; + thi |