aboutsummaryrefslogtreecommitdiff
path: root/Documentation/power
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/power')
-rw-r--r--Documentation/power/00-INDEX6
-rw-r--r--Documentation/power/basic-pm-debugging.txt12
-rw-r--r--Documentation/power/devices.txt51
-rw-r--r--Documentation/power/freezing-of-tasks.txt5
-rw-r--r--Documentation/power/interface.txt4
-rw-r--r--Documentation/power/notifiers.txt6
-rw-r--r--Documentation/power/opp.txt163
-rw-r--r--Documentation/power/pm_qos_interface.txt104
-rw-r--r--Documentation/power/power_supply_class.txt8
-rw-r--r--Documentation/power/powercap/powercap.txt236
-rw-r--r--Documentation/power/runtime_pm.txt104
-rw-r--r--Documentation/power/states.txt97
-rw-r--r--Documentation/power/suspend-and-cpuhotplug.txt2
-rw-r--r--Documentation/power/swsusp.txt20
-rw-r--r--Documentation/power/video_extension.txt37
15 files changed, 607 insertions, 248 deletions
diff --git a/Documentation/power/00-INDEX b/Documentation/power/00-INDEX
index a4d682f5423..ad04cc8097e 100644
--- a/Documentation/power/00-INDEX
+++ b/Documentation/power/00-INDEX
@@ -4,6 +4,8 @@ apm-acpi.txt
- basic info about the APM and ACPI support.
basic-pm-debugging.txt
- Debugging suspend and resume
+charger-manager.txt
+ - Battery charger management.
devices.txt
- How drivers interact with system-wide power management
drivers-testing.txt
@@ -22,6 +24,8 @@ pm_qos_interface.txt
- info on Linux PM Quality of Service interface
power_supply_class.txt
- Tells userspace about battery, UPS, AC or DC power supply properties
+runtime_pm.txt
+ - Power management framework for I/O devices.
s2ram.txt
- How to get suspend to ram working (and debug it when it isn't)
states.txt
@@ -38,7 +42,5 @@ tricks.txt
- How to trick software suspend (to disk) into working when it isn't
userland-swsusp.txt
- Experimental implementation of software suspend in userspace
-video_extension.txt
- - ACPI video extensions
video.txt
- Video issues during resume from suspend
diff --git a/Documentation/power/basic-pm-debugging.txt b/Documentation/power/basic-pm-debugging.txt
index 262acf56fa7..edeecd447d2 100644
--- a/Documentation/power/basic-pm-debugging.txt
+++ b/Documentation/power/basic-pm-debugging.txt
@@ -172,14 +172,14 @@ you can boot the kernel with the 'no_console_suspend' parameter and try to log
kernel messages using the serial console. This may provide you with some
information about the reasons of the suspend (resume) failure. Alternatively,
it may be possible to use a FireWire port for debugging with firescope
-(ftp://ftp.firstfloor.org/pub/ak/firescope/). On x86 it is also possible to
+(http://v3.sk/~lkundrak/firescope/). On x86 it is also possible to
use the PM_TRACE mechanism documented in Documentation/power/s2ram.txt .
2. Testing suspend to RAM (STR)
To verify that the STR works, it is generally more convenient to use the s2ram
tool available from http://suspend.sf.net and documented at
-http://en.opensuse.org/SDB:Suspend_to_RAM.
+http://en.opensuse.org/SDB:Suspend_to_RAM (S2RAM_LINK).
Namely, after writing "freezer", "devices", "platform", "processors", or "core"
into /sys/power/pm_test (available if the kernel is compiled with
@@ -194,10 +194,10 @@ Among other things, the testing with the help of /sys/power/pm_test may allow
you to identify drivers that fail to suspend or resume their devices. They
should be unloaded every time before an STR transition.
-Next, you can follow the instructions at http://en.opensuse.org/s2ram to test
-the system, but if it does not work "out of the box", you may need to boot it
-with "init=/bin/bash" and test s2ram in the minimal configuration. In that
-case, you may be able to search for failing drivers by following the procedure
+Next, you can follow the instructions at S2RAM_LINK to test the system, but if
+it does not work "out of the box", you may need to boot it with
+"init=/bin/bash" and test s2ram in the minimal configuration. In that case,
+you may be able to search for failing drivers by following the procedure
analogous to the one described in section 1. If you find some failing drivers,
you will have to unload them every time before an STR transition (ie. before
you run s2ram), and please report the problems with them.
diff --git a/Documentation/power/devices.txt b/Documentation/power/devices.txt
index 504dfe4d52e..d172bce0fd4 100644
--- a/Documentation/power/devices.txt
+++ b/Documentation/power/devices.txt
@@ -2,6 +2,7 @@ Device Power Management
Copyright (c) 2010-2011 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
Copyright (c) 2010 Alan Stern <stern@rowland.harvard.edu>
+Copyright (c) 2014 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Most of the code in Linux is device drivers, so most of the Linux power
@@ -268,7 +269,7 @@ situations.
System Power Management Phases
------------------------------
Suspending or resuming the system is done in several phases. Different phases
-are used for standby or memory sleep states ("suspend-to-RAM") and the
+are used for freeze, standby, and memory sleep states ("suspend-to-RAM") and the
hibernation state ("suspend-to-disk"). Each phase involves executing callbacks
for every device before the next phase begins. Not all busses or classes
support all these callbacks and not all drivers use all the callbacks. The
@@ -309,7 +310,8 @@ execute the corresponding method from dev->driver->pm instead if there is one.
Entering System Suspend
-----------------------
-When the system goes into the standby or memory sleep state, the phases are:
+When the system goes into the freeze, standby or memory sleep state,
+the phases are:
prepare, suspend, suspend_late, suspend_noirq.
@@ -325,6 +327,20 @@ When the system goes into the standby or memory sleep state, the phases are:
driver in some way for the upcoming system power transition, but it
should not put the device into a low-power state.
+ For devices supporting runtime power management, the return value of the
+ prepare callback can be used to indicate to the PM core that it may
+ safely leave the device in runtime suspend (if runtime-suspended
+ already), provided that all of the device's descendants are also left in
+ runtime suspend. Namely, if the prepare callback returns a positive
+ number and that happens for all of the descendants of the device too,
+ and all of them (including the device itself) are runtime-suspended, the
+ PM core will skip the suspend, suspend_late and suspend_noirq suspend
+ phases as well as the resume_noirq, resume_early and resume phases of
+ the following system resume for all of these devices. In that case,
+ the complete callback will be called directly after the prepare callback
+ and is entirely responsible for bringing the device back to the
+ functional state as appropriate.
+
2. The suspend methods should quiesce the device to stop it from performing
I/O. They also may save the device registers and put it into the
appropriate low-power state, depending on the bus type the device is on,
@@ -368,7 +384,7 @@ the devices that were suspended.
Leaving System Suspend
----------------------
-When resuming from standby or memory sleep, the phases are:
+When resuming from freeze, standby or memory sleep, the phases are:
resume_noirq, resume_early, resume, complete.
@@ -390,7 +406,7 @@ When resuming from standby or memory sleep, the phases are:
the resume methods. This generally involves undoing the actions of the
preceding suspend_late phase.
- 3 The resume methods should bring the the device back to its operating
+ 3 The resume methods should bring the device back to its operating
state, so that it can perform normal I/O. This generally involves
undoing the actions of the suspend phase.
@@ -399,12 +415,23 @@ When resuming from standby or memory sleep, the phases are:
the resume callbacks occur; it's not necessary to wait until the
complete phase.
+ Moreover, if the preceding prepare callback returned a positive number,
+ the device may have been left in runtime suspend throughout the whole
+ system suspend and resume (the suspend, suspend_late, suspend_noirq
+ phases of system suspend and the resume_noirq, resume_early, resume
+ phases of system resume may have been skipped for it). In that case,
+ the complete callback is entirely responsible for bringing the device
+ back to the functional state after system suspend if necessary. [For
+ example, it may need to queue up a runtime resume request for the device
+ for this purpose.] To check if that is the case, the complete callback
+ can consult the device's power.direct_complete flag. Namely, if that
+ flag is set when the complete callback is being run, it has been called
+ directly after the preceding prepare and special action may be required
+ to make the device work correctly afterward.
+
At the end of these phases, drivers should be as functional as they were before
suspending: I/O can be performed using DMA and IRQs, and the relevant clocks are
-gated on. Even if the device was in a low-power state before the system sleep
-because of runtime power management, afterwards it should be back in its
-full-power state. There are multiple reasons why it's best to do this; they are
-discussed in more detail in Documentation/power/runtime_pm.txt.
+gated on.
However, the details here may again be platform-specific. For example,
some systems support multiple "run" states, and the mode in effect at
@@ -433,8 +460,8 @@ the system log.
Entering Hibernation
--------------------
-Hibernating the system is more complicated than putting it into the standby or
-memory sleep state, because it involves creating and saving a system image.
+Hibernating the system is more complicated than putting it into the other
+sleep states, because it involves creating and saving a system image.
Therefore there are more phases for hibernation, with a different set of
callbacks. These phases always run after tasks have been frozen and memory has
been freed.
@@ -485,8 +512,8 @@ image forms an atomic snapshot of the system state.
At this point the system image is saved, and the devices then need to be
prepared for the upcoming system shutdown. This is much like suspending them
-before putting the system into the standby or memory sleep state, and the phases
-are similar.
+before putting the system into the freeze, standby or memory sleep state,
+and the phases are similar.
9. The prepare phase is discussed above.
diff --git a/Documentation/power/freezing-of-tasks.txt b/Documentation/power/freezing-of-tasks.txt
index 6ec291ea1c7..85894d83b35 100644
--- a/Documentation/power/freezing-of-tasks.txt
+++ b/Documentation/power/freezing-of-tasks.txt
@@ -223,3 +223,8 @@ since they ask the freezer to skip freezing this task, since it is anyway
only after the entire suspend/hibernation sequence is complete.
So, to summarize, use [un]lock_system_sleep() instead of directly using
mutex_[un]lock(&pm_mutex). That would prevent freezing failures.
+
+V. Miscellaneous
+/sys/power/pm_freeze_timeout controls how long it will cost at most to freeze
+all user space processes or all freezable kernel threads, in unit of millisecond.
+The default value is 20000, with range of unsigned integer.
diff --git a/Documentation/power/interface.txt b/Documentation/power/interface.txt
index c537834af00..f1f0f59a7c4 100644
--- a/Documentation/power/interface.txt
+++ b/Documentation/power/interface.txt
@@ -7,8 +7,8 @@ running. The interface exists in /sys/power/ directory (assuming sysfs
is mounted at /sys).
/sys/power/state controls system power state. Reading from this file
-returns what states are supported, which is hard-coded to 'standby'
-(Power-On Suspend), 'mem' (Suspend-to-RAM), and 'disk'
+returns what states are supported, which is hard-coded to 'freeze',
+'standby' (Power-On Suspend), 'mem' (Suspend-to-RAM), and 'disk'
(Suspend-to-Disk).
Writing to this file one of those strings causes the system to
diff --git a/Documentation/power/notifiers.txt b/Documentation/power/notifiers.txt
index c2a4a346c0d..a81fa254303 100644
--- a/Documentation/power/notifiers.txt
+++ b/Documentation/power/notifiers.txt
@@ -15,8 +15,10 @@ A suspend/hibernation notifier may be used for this purpose.
The subsystems or drivers having such needs can register suspend notifiers that
will be called upon the following events by the PM core:
-PM_HIBERNATION_PREPARE The system is going to hibernate or suspend, tasks will
- be frozen immediately.
+PM_HIBERNATION_PREPARE The system is going to hibernate, tasks will be frozen
+ immediately. This is different from PM_SUSPEND_PREPARE
+ below because here we do additional work between notifiers
+ and drivers freezing.
PM_POST_HIBERNATION The system memory state has been restored from a
hibernation image or an error occurred during
diff --git a/Documentation/power/opp.txt b/Documentation/power/opp.txt
index 3035d00757a..a9adad828cd 100644
--- a/Documentation/power/opp.txt
+++ b/Documentation/power/opp.txt
@@ -1,6 +1,5 @@
-*=============*
-* OPP Library *
-*=============*
+Operating Performance Points (OPP) Library
+==========================================
(C) 2009-2010 Nishanth Menon <nm@ti.com>, Texas Instruments Incorporated
@@ -11,23 +10,38 @@ Contents
3. OPP Search Functions
4. OPP Availability Control Functions
5. OPP Data Retrieval Functions
-6. Cpufreq Table Generation
-7. Data Structures
+6. Data Structures
1. Introduction
===============
+1.1 What is an Operating Performance Point (OPP)?
+
Complex SoCs of today consists of a multiple sub-modules working in conjunction.
In an operational system executing varied use cases, not all modules in the SoC
need to function at their highest performing frequency all the time. To
facilitate this, sub-modules in a SoC are grouped into domains, allowing some
-domains to run at lower voltage and frequency while other domains are loaded
-more. The set of discrete tuples consisting of frequency and voltage pairs that
+domains to run at lower voltage and frequency while other domains run at
+voltage/frequency pairs that are higher.
+
+The set of discrete tuples consisting of frequency and voltage pairs that
the device will support per domain are called Operating Performance Points or
OPPs.
+As an example:
+Let us consider an MPU device which supports the following:
+{300MHz at minimum voltage of 1V}, {800MHz at minimum voltage of 1.2V},
+{1GHz at minimum voltage of 1.3V}
+
+We can represent these as three OPPs as the following {Hz, uV} tuples:
+{300000000, 1000000}
+{800000000, 1200000}
+{1000000000, 1300000}
+
+1.2 Operating Performance Points Library
+
OPP library provides a set of helper functions to organize and query the OPP
information. The library is located in drivers/base/power/opp.c and the header
-is located in include/linux/opp.h. OPP library can be enabled by enabling
+is located in include/linux/pm_opp.h. OPP library can be enabled by enabling
CONFIG_PM_OPP from power management menuconfig menu. OPP library depends on
CONFIG_PM as certain SoCs such as Texas Instrument's OMAP framework allows to
optionally boot at a certain OPP without needing cpufreq.
@@ -56,14 +70,13 @@ operations until that OPP could be re-enabled if possible.
OPP library facilitates this concept in it's implementation. The following
operational functions operate only on available opps:
-opp_find_freq_{ceil, floor}, opp_get_voltage, opp_get_freq, opp_get_opp_count
-and opp_init_cpufreq_table
+opp_find_freq_{ceil, floor}, dev_pm_opp_get_voltage, dev_pm_opp_get_freq, dev_pm_opp_get_opp_count
-opp_find_freq_exact is meant to be used to find the opp pointer which can then
-be used for opp_enable/disable functions to make an opp available as required.
+dev_pm_opp_find_freq_exact is meant to be used to find the opp pointer which can then
+be used for dev_pm_opp_enable/disable functions to make an opp available as required.
WARNING: Users of OPP library should refresh their availability count using
-get_opp_count if opp_enable/disable functions are invoked for a device, the
+get_opp_count if dev_pm_opp_enable/disable functions are invoked for a device, the
exact mechanism to trigger these or the notification mechanism to other
dependent subsystems such as cpufreq are left to the discretion of the SoC
specific framework which uses the OPP library. Similar care needs to be taken
@@ -81,24 +94,23 @@ using RCU read locks. The opp_find_freq_{exact,ceil,floor},
opp_get_{voltage, freq, opp_count} fall into this category.
opp_{add,enable,disable} are updaters which use mutex and implement it's own
-RCU locking mechanisms. opp_init_cpufreq_table acts as an updater and uses
-mutex to implment RCU updater strategy. These functions should *NOT* be called
-under RCU locks and other contexts that prevent blocking functions in RCU or
-mutex operations from working.
+RCU locking mechanisms. These functions should *NOT* be called under RCU locks
+and other contexts that prevent blocking functions in RCU or mutex operations
+from working.
2. Initial OPP List Registration
================================
-The SoC implementation calls opp_add function iteratively to add OPPs per
+The SoC implementation calls dev_pm_opp_add function iteratively to add OPPs per
device. It is expected that the SoC framework will register the OPP entries
optimally- typical numbers range to be less than 5. The list generated by
registering the OPPs is maintained by OPP library throughout the device
operation. The SoC framework can subsequently control the availability of the
-OPPs dynamically using the opp_enable / disable functions.
+OPPs dynamically using the dev_pm_opp_enable / disable functions.
-opp_add - Add a new OPP for a specific domain represented by the device pointer.
+dev_pm_opp_add - Add a new OPP for a specific domain represented by the device pointer.
The OPP is defined using the frequency and voltage. Once added, the OPP
is assumed to be available and control of it's availability can be done
- with the opp_enable/disable functions. OPP library internally stores
+ with the dev_pm_opp_enable/disable functions. OPP library internally stores
and manages this information in the opp struct. This function may be
used by SoC framework to define a optimal list as per the demands of
SoC usage environment.
@@ -109,7 +121,7 @@ opp_add - Add a new OPP for a specific domain represented by the device pointer.
soc_pm_init()
{
/* Do things */
- r = opp_add(mpu_dev, 1000000, 900000);
+ r = dev_pm_opp_add(mpu_dev, 1000000, 900000);
if (!r) {
pr_err("%s: unable to register mpu opp(%d)\n", r);
goto no_cpufreq;
@@ -128,44 +140,44 @@ functions return the matching pointer representing the opp if a match is
found, else returns error. These errors are expected to be handled by standard
error checks such as IS_ERR() and appropriate actions taken by the caller.
-opp_find_freq_exact - Search for an OPP based on an *exact* frequency and
+dev_pm_opp_find_freq_exact - Search for an OPP based on an *exact* frequency and
availability. This function is especially useful to enable an OPP which
is not available by default.
Example: In a case when SoC framework detects a situation where a
higher frequency could be made available, it can use this function to
- find the OPP prior to call the opp_enable to actually make it available.
+ find the OPP prior to call the dev_pm_opp_enable to actually make it available.
rcu_read_lock();
- opp = opp_find_freq_exact(dev, 1000000000, false);
+ opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false);
rcu_read_unlock();
/* dont operate on the pointer.. just do a sanity check.. */
if (IS_ERR(opp)) {
pr_err("frequency not disabled!\n");
/* trigger appropriate actions.. */
} else {
- opp_enable(dev,1000000000);
+ dev_pm_opp_enable(dev,1000000000);
}
NOTE: This is the only search function that operates on OPPs which are
not available.
-opp_find_freq_floor - Search for an available OPP which is *at most* the
+dev_pm_opp_find_freq_floor - Search for an available OPP which is *at most* the
provided frequency. This function is useful while searching for a lesser
match OR operating on OPP information in the order of decreasing
frequency.
Example: To find the highest opp for a device:
freq = ULONG_MAX;
rcu_read_lock();
- opp_find_freq_floor(dev, &freq);
+ dev_pm_opp_find_freq_floor(dev, &freq);
rcu_read_unlock();
-opp_find_freq_ceil - Search for an available OPP which is *at least* the
+dev_pm_opp_find_freq_ceil - Search for an available OPP which is *at least* the
provided frequency. This function is useful while searching for a
higher match OR operating on OPP information in the order of increasing
frequency.
Example 1: To find the lowest opp for a device:
freq = 0;
rcu_read_lock();
- opp_find_freq_ceil(dev, &freq);
+ dev_pm_opp_find_freq_ceil(dev, &freq);
rcu_read_unlock();
Example 2: A simplified implementation of a SoC cpufreq_driver->target:
soc_cpufreq_target(..)
@@ -173,7 +185,7 @@ opp_find_freq_ceil - Search for an available OPP which is *at least* the
/* Do stuff like policy checks etc. */
/* Find the best frequency match for the req */
rcu_read_lock();
- opp = opp_find_freq_ceil(dev, &freq);
+ opp = dev_pm_opp_find_freq_ceil(dev, &freq);
rcu_read_unlock();
if (!IS_ERR(opp))
soc_switch_to_freq_voltage(freq);
@@ -193,34 +205,34 @@ as thermal considerations (e.g. don't use OPPx until the temperature drops).
WARNING: Do not use these functions in interrupt context.
-opp_enable - Make a OPP available for operation.
+dev_pm_opp_enable - Make a OPP available for operation.
Example: Lets say that 1GHz OPP is to be made available only if the
SoC temperature is lower than a certain threshold. The SoC framework
implementation might choose to do something as follows:
if (cur_temp < temp_low_thresh) {
/* Enable 1GHz if it was disabled */
rcu_read_lock();
- opp = opp_find_freq_exact(dev, 1000000000, false);
+ opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false);
rcu_read_unlock();
/* just error check */
if (!IS_ERR(opp))
- ret = opp_enable(dev, 1000000000);
+ ret = dev_pm_opp_enable(dev, 1000000000);
else
goto try_something_else;
}
-opp_disable - Make an OPP to be not available for operation
+dev_pm_opp_disable - Make an OPP to be not available for operation
Example: Lets say that 1GHz OPP is to be disabled if the temperature
exceeds a threshold value. The SoC framework implementation might
choose to do something as follows:
if (cur_temp > temp_high_thresh) {
/* Disable 1GHz if it was enabled */
rcu_read_lock();
- opp = opp_find_freq_exact(dev, 1000000000, true);
+ opp = dev_pm_opp_find_freq_exact(dev, 1000000000, true);
rcu_read_unlock();
/* just error check */
if (!IS_ERR(opp))
- ret = opp_disable(dev, 1000000000);
+ ret = dev_pm_opp_disable(dev, 1000000000);
else
goto try_something_else;
}
@@ -232,7 +244,7 @@ information from the OPP structure is necessary. Once an OPP pointer is
retrieved using the search functions, the following functions can be used by SoC
framework to retrieve the information represented inside the OPP layer.
-opp_get_voltage - Retrieve the voltage represented by the opp pointer.
+dev_pm_opp_get_voltage - Retrieve the voltage represented by the opp pointer.
Example: At a cpufreq transition to a different frequency, SoC
framework requires to set the voltage represented by the OPP using
the regulator framework to the Power Management chip providing the
@@ -241,15 +253,15 @@ opp_get_voltage - Retrieve the voltage represented by the opp pointer.
{
/* do things */
rcu_read_lock();
- opp = opp_find_freq_ceil(dev, &freq);
- v = opp_get_voltage(opp);
+ opp = dev_pm_opp_find_freq_ceil(dev, &freq);
+ v = dev_pm_opp_get_voltage(opp);
rcu_read_unlock();
if (v)
regulator_set_voltage(.., v);
/* do other things */
}
-opp_get_freq - Retrieve the freq represented by the opp pointer.
+dev_pm_opp_get_freq - Retrieve the freq represented by the opp pointer.
Example: Lets say the SoC framework uses a couple of helper functions
we could pass opp pointers instead of doing additional parameters to
handle quiet a bit of data parameters.
@@ -258,8 +270,8 @@ opp_get_freq - Retrieve the freq represented by the opp pointer.
/* do things.. */
max_freq = ULONG_MAX;
rcu_read_lock();
- max_opp = opp_find_freq_floor(dev,&max_freq);
- requested_opp = opp_find_freq_ceil(dev,&freq);
+ max_opp = dev_pm_opp_find_freq_floor(dev,&max_freq);
+ requested_opp = dev_pm_opp_find_freq_ceil(dev,&freq);
if (!IS_ERR(max_opp) && !IS_ERR(requested_opp))
r = soc_test_validity(max_opp, requested_opp);
rcu_read_unlock();
@@ -267,25 +279,25 @@ opp_get_freq - Retrieve the freq represented by the opp pointer.
}
soc_test_validity(..)
{
- if(opp_get_voltage(max_opp) < opp_get_voltage(requested_opp))
+ if(dev_pm_opp_get_voltage(max_opp) < dev_pm_opp_get_voltage(requested_opp))
return -EINVAL;
- if(opp_get_freq(max_opp) < opp_get_freq(requested_opp))
+ if(dev_pm_opp_get_freq(max_opp) < dev_pm_opp_get_freq(requested_opp))
return -EINVAL;
/* do things.. */
}
-opp_get_opp_count - Retrieve the number of available opps for a device
+dev_pm_opp_get_opp_count - Retrieve the number of available opps for a device
Example: Lets say a co-processor in the SoC needs to know the available
frequencies in a table, the main processor can notify as following:
soc_notify_coproc_available_frequencies()
{
/* Do things */
rcu_read_lock();
- num_available = opp_get_opp_count(dev);
+ num_available = dev_pm_opp_get_opp_count(dev);
speeds = kzalloc(sizeof(u32) * num_available, GFP_KERNEL);
/* populate the table in increasing order */
freq = 0;
- while (!IS_ERR(opp = opp_find_freq_ceil(dev, &freq))) {
+ while (!IS_ERR(opp = dev_pm_opp_find_freq_ceil(dev, &freq))) {
speeds[i] = freq;
freq++;
i++;
@@ -296,34 +308,7 @@ opp_get_opp_count - Retrieve the number of available opps for a device
/* Do other things */
}
-6. Cpufreq Table Generation
-===========================
-opp_init_cpufreq_table - cpufreq framework typically is initialized with
- cpufreq_frequency_table_cpuinfo which is provided with the list of
- frequencies that are available for operation. This function provides
- a ready to use conversion routine to translate the OPP layer's internal
- information about the available frequencies into a format readily
- providable to cpufreq.
-
- WARNING: Do not use this function in interrupt context.
-
- Example:
- soc_pm_init()
- {
- /* Do things */
- r = opp_init_cpufreq_table(dev, &freq_table);
- if (!r)
- cpufreq_frequency_table_cpuinfo(policy, freq_table);
- /* Do other things */
- }
-
- NOTE: This function is available only if CONFIG_CPU_FREQ is enabled in
- addition to CONFIG_PM as power management feature is required to
- dynamically scale voltage and frequency in a system.
-
-opp_free_cpufreq_table - Free up the table allocated by opp_init_cpufreq_table
-
-7. Data Structures
+6. Data Structures
==================
Typically an SoC contains multiple voltage domains which are variable. Each
domain is represented by a device pointer. The relationship to OPP can be
@@ -343,16 +328,16 @@ accessed by various functions as described above. However, the structures
representing the actual OPPs and domains are internal to the OPP library itself
to allow for suitable abstraction reusable across systems.
-struct opp - The internal data structure of OPP library which is used to
+struct dev_pm_opp - The internal data structure of OPP library which is used to
represent an OPP. In addition to the freq, voltage, availability
information, it also contains internal book keeping information required
for the OPP library to operate on. Pointer to this structure is
provided back to the users such as SoC framework to be used as a
identifier for OPP in the interactions with OPP layer.
- WARNING: The struct opp pointer should not be parsed or modified by the
- users. The defaults of for an instance is populated by opp_add, but the
- availability of the OPP can be modified by opp_enable/disable functions.
+ WARNING: The struct dev_pm_opp pointer should not be parsed or modified by the
+ users. The defaults of for an instance is populated by dev_pm_opp_add, but the
+ availability of the OPP can be modified by dev_pm_opp_enable/disable functions.
struct device - This is used to identify a domain to the OPP layer. The
nature of the device and it's implementation is left to the user of
@@ -362,19 +347,19 @@ Overall, in a simplistic view, the data structure operations is represented as
following:
Initialization / modification:
- +-----+ /- opp_enable
-opp_add --> | opp | <-------
- | +-----+ \- opp_disable
+ +-----+ /- dev_pm_opp_enable
+dev_pm_opp_add --> | opp | <-------
+ | +-----+ \- dev_pm_opp_disable
\-------> domain_info(device)
Search functions:
- /-- opp_find_freq_ceil ---\ +-----+
-domain_info<---- opp_find_freq_exact -----> | opp |
- \-- opp_find_freq_floor ---/ +-----+
+ /-- dev_pm_opp_find_freq_ceil ---\ +-----+
+domain_info<---- dev_pm_opp_find_freq_exact -----> | opp |
+ \-- dev_pm_opp_find_freq_floor ---/ +-----+
Retrieval functions:
-+-----+ /- opp_get_voltage
++-----+ /- dev_pm_opp_get_voltage
| opp | <---
-+-----+ \- opp_get_freq
++-----+ \- dev_pm_opp_get_freq
-domain_info <- opp_get_opp_count
+domain_info <- dev_pm_opp_get_opp_count
diff --git a/Documentation/power/pm_qos_interface.txt b/Documentation/power/pm_qos_interface.txt
index 79a2a58425e..a5da5c7e712 100644
--- a/Documentation/power/pm_qos_interface.txt
+++ b/Documentation/power/pm_qos_interface.txt
@@ -7,7 +7,7 @@ one of the parameters.
Two different PM QoS frameworks are available:
1. PM QoS classes for cpu_dma_latency, network_latency, network_throughput.
2. the per-device PM QoS framework provides the API to manage the per-device latency
-constraints.
+constraints and PM QoS flags.
Each parameters have defined units:
* latency: usec
@@ -86,15 +86,21 @@ To remove the user mode request for a target value simply close the device
node.
-2. PM QoS per-device latency framework
+2. PM QoS per-device latency and flags framework
-For each device a list of performance requests is maintained along with
-an aggregated target value. The aggregated target value is updated with
-changes to the request list or elements of the list. Typically the
-aggregated target value is simply the max or min of the request values held
-in the parameter list elements.
-Note: the aggregated target value is implemented as an atomic variable so that
-reading the aggregated value does not require any locking mechanism.
+For each device, there are three lists of PM QoS requests. Two of them are
+maintained along with the aggregated targets of resume latency and active
+state latency tolerance (in microseconds) and the third one is for PM QoS flags.
+Values are updated in response to changes of the request list.
+
+The target values of resume latency and active state latency tolerance are
+simply the minimum of the request values held in the parameter list elements.
+The PM QoS flags aggregate value is a gather (bitwise OR) of all list elements'
+values. Two device PM QoS flags are defined currently: PM_QOS_FLAG_NO_POWER_OFF
+and PM_QOS_FLAG_REMOTE_WAKEUP.
+
+Note: The aggregated target values are implemented in such a way that reading
+the aggregated value does not require any locking mechanism.
From kernel mode the use of this interface is the following:
@@ -119,6 +125,40 @@ the request.
s32 dev_pm_qos_read_value(device):
Returns the aggregated value for a given device's constraints list.
+enum pm_qos_flags_status dev_pm_qos_flags(device, mask)
+Check PM QoS flags of the given device against the given mask of flags.
+The meaning of the return values is as follows:
+ PM_QOS_FLAGS_ALL: All flags from the mask are set
+ PM_QOS_FLAGS_SOME: Some flags from the mask are set
+ PM_QOS_FLAGS_NONE: No flags from the mask are set
+ PM_QOS_FLAGS_UNDEFINED: The device's PM QoS structure has not been
+ initialized or the list of requests is empty.
+
+int dev_pm_qos_add_ancestor_request(dev, handle, type, value)
+Add a PM QoS request for the first direct ancestor of the given device whose
+power.ignore_children flag is unset (for DEV_PM_QOS_RESUME_LATENCY requests)
+or whose power.set_latency_tolerance callback pointer is not NULL (for
+DEV_PM_QOS_LATENCY_TOLERANCE requests).
+
+int dev_pm_qos_expose_latency_limit(device, value)
+Add a request to the device's PM QoS list of resume latency constraints and
+create a sysfs attribute pm_qos_resume_latency_us under the device's power
+directory allowing user space to manipulate that request.
+
+void dev_pm_qos_hide_latency_limit(device)
+Drop the request added by dev_pm_qos_expose_latency_limit() from the device's
+PM QoS list of resume latency constraints and remove sysfs attribute
+pm_qos_resume_latency_us from the device's power directory.
+
+int dev_pm_qos_expose_flags(device, value)
+Add a request to the device's PM QoS list of flags and create sysfs attributes
+pm_qos_no_power_off and pm_qos_remote_wakeup under the device's power directory
+allowing user space to change these flags' value.
+
+void dev_pm_qos_hide_flags(device)
+Drop the request added by dev_pm_qos_expose_flags() from the device's PM QoS list
+of flags and remove sysfs attributes pm_qos_no_power_off and pm_qos_remote_wakeup
+under the device's power directory.
Notification mechanisms:
The per-device PM QoS framework has 2 different and distinct notification trees:
@@ -127,7 +167,7 @@ a per-device notification tree and a global notification tree.
int dev_pm_qos_add_notifier(device, notifier):
Adds a notification callback function for the device.
The callback is called when the aggregated value of the device constraints list
-is changed.
+is changed (for resume latency device PM QoS only).
int dev_pm_qos_remove_notifier(device, notifier):
Removes the notification callback function for the device.
@@ -135,14 +175,48 @@ Removes the notification callback function for the device.
int dev_pm_qos_add_global_notifier(notifier):
Adds a notification callback function in the global notification tree of the
framework.
-The callback is called when the aggregated value for any device is changed.
+The callback is called when the aggregated value for any device is changed
+(for resume latency device PM QoS only).
int dev_pm_qos_remove_global_notifier(notifier):
Removes the notification callback function from the global notification tree
of the framework.
-From user mode:
-No API for user space access to the per-device latency constraints is provided
-yet - still under discussion.
-
+Active state latency tolerance
+
+This device PM QoS type is used to support systems in which hardware may switch
+to energy-saving operation modes on the fly. In those systems, if the operation
+mode chosen by the hardware attempts to save energy in an overly aggressive way,
+it may cause excess latencies to be visible to software, causing it to miss
+certain protocol requirements or target frame or sample rates etc.
+
+If there is a latency tolerance control mechanism for a given device available
+to software, the .set_latency_tolerance callback in that device's dev_pm_info
+structure should be populated. The routine pointed to by it is should implement
+whatever is necessary to transfer the effective requirement value to the
+hardware.
+
+Whenever the effective latency tolerance changes for the device, its
+.set_latency_tolerance() callback will be executed and the effective value will
+be passed to it. If that value is negative, which means that the list of
+latency tolerance requirements for the device is empty, the callback is expected
+to switch the underlying hardware latency tolerance control mechanism to an
+autonomous mode if available. If that value is PM_QOS_LATENCY_ANY, in turn, and
+the hardware supports a special "no requirement" setting, the callback is
+expected to use it. That allows software to prevent the hardware from
+automatically updating the device's latency tolerance in response to its power
+state changes (e.g. during transitions from D3cold to D0), which generally may
+be done in the autonomous latency tolerance control mode.
+
+If .set_latency_tolerance() is present for the device, sysfs attribute
+pm_qos_latency_tolerance_us will be present in the devivce's power directory.
+Then, user space can use that attribute to specify its latency tolerance
+requirement for the device, if any. Writing "any" to it means "no requirement,
+but do not let the hardware control latency tolerance" and writing "auto" to it
+allows the hardware to be switched to the autonomous mode if there are no other
+requirements from the kernel side in the device's list.
+
+Kernel code can use the functions described above along with the
+DEV_PM_QOS_LATENCY_TOLERANCE device PM QoS type to add, remove and update
+latency tolerance requirements for devices.
diff --git a/Documentation/power/power_supply_class.txt b/Documentation/power/power_supply_class.txt
index 3f10b39b034..89a8816990f 100644
--- a/Documentation/power/power_supply_class.txt
+++ b/Documentation/power/power_supply_class.txt
@@ -135,11 +135,11 @@ CAPACITY_LEVEL - capacity level. This corresponds to
POWER_SUPPLY_CAPACITY_LEVEL_*.
TEMP - temperature of the power supply.
-TEMP_ALERT_MIN - minimum battery temperature alert value in milli centigrade.
-TEMP_ALERT_MAX - maximum battery temperature alert value in milli centigrade.
+TEMP_ALERT_MIN - minimum battery temperature alert.
+TEMP_ALERT_MAX - maximum battery temperature alert.
TEMP_AMBIENT - ambient temperature.
-TEMP_AMBIENT_ALERT_MIN - minimum ambient temperature alert value in milli centigrade.
-TEMP_AMBIENT_ALERT_MAX - maximum ambient temperature alert value in milli centigrade.
+TEMP_AMBIENT_ALERT_MIN - minimum ambient temperature alert.
+TEMP_AMBIENT_ALERT_MAX - maximum ambient temperature alert.
TIME_TO_EMPTY - seconds left for battery to be considered empty (i.e.
while battery powers a load)
diff --git a/Documentation/power/powercap/powercap.txt b/Documentation/power/powercap/powercap.txt
new file mode 100644
index 00000000000..1e6ef164e07
--- /dev/null
+++ b/Documentation/power/powercap/powercap.txt
@@ -0,0 +1,236 @@
+Power Capping Framework
+==================================
+
+The power capping framework provides a consistent interface between the kernel
+and the user space that allows power capping drivers to expose the settings to
+user space in a uniform way.
+
+Terminology
+=========================
+The framework exposes power capping devices to user space via sysfs in the
+form of a tree of objects. The objects at the root level of the tree represent
+'control types', which correspond to different methods of power capping. For
+example, the intel-rapl control type represents the Intel "Running Average
+Power Limit" (RAPL) technology, whereas the 'idle-injection' control type
+corresponds to the use of idle injection for controlling power.
+
+Power zones represent different parts of the system, which can be controlled and
+monitored using the power capping method determined by the control type the
+given zone belongs to. They each contain attributes for monitoring power, as
+well as controls represented in the form of power constraints. If the parts of
+the system represented by different power zones are hierarchical (that is, one
+bigger part consists of multiple smaller parts that each have their own power
+controls), those power zones may also be organized in a hierarchy with one
+parent power zone containing multiple subzones and so on to reflect the power
+control topology of the system. In that case, it is possible to apply power
+capping to a set of devices together using the parent power zone and if more
+fine grained control is required, it can be applied through the subzones.
+
+
+Example sysfs interface tree:
+
+/sys/devices/virtual/powercap
+??? intel-rapl
+ ??? intel-rapl:0
+ ?   ??? constraint_0_name
+ ?   ??? constraint_0_power_limit_uw
+ ?   ??? constraint_0_time_window_us
+ ?   ??? constraint_1_name
+ ?   ??? constraint_1_power_limit_uw
+ ?   ??? constraint_1_time_window_us
+ ?   ??? device -> ../../intel-rapl
+ ?   ??? energy_uj
+ ?   ??? intel-rapl:0:0
+ ?   ?   ??? constraint_0_name
+ ?   ?   ??? constraint_0_power_limit_uw
+ ?   ?   ??? constraint_0_time_window_us
+ ?   ?   ??? constraint_1_name
+ ?   ?   ??? constraint_1_power_limit_uw
+ ?   ?   ??? constraint_1_time_window_us
+ ?   ?   ??? device -> ../../intel-rapl:0
+ ?   ?   ??? energy_uj
+ ?   ?   ??? max_energy_range_uj
+ ?   ?   ??? name
+ ?   ?   ??? enabled
+ ?   ?   ??? power
+ ?   ?   ?   ??? async
+ ?   ?   ?   []
+ ?   ?   ??? subsystem -> ../../../../../../class/power_cap
+ ?   ?   ??? uevent
+ ?   ??? intel-rapl:0:1
+ ?   ?   ??? constraint_0_name
+ ?   ?   ??? constraint_0_power_limit_uw
+ ?   ?   ??? constraint_0_time_window_us
+ ?   ?   ??? constraint_1_name
+ ?   ?   ??? constraint_1_power_limit_uw
+ ?   ?   ??? constraint_1_time_window_us
+ ?   ?   ??? device -> ../../intel-rapl:0
+ ?   ?   ??? energy_uj
+ ?   ?   ??? max_energy_range_uj
+ ?   ?   ??? name
+ ?   ?   ??? enabled
+ ?   ?   ??? power
+ ?   ?   ?   ??? async
+ ?   ?   ?   []
+ ?   ?   ??? subsystem -> ../../../../../../class/power_cap
+ ?   ?   ??? uevent
+ ?   ??? max_energy_range_uj
+ ?   ??? max_power_range_uw
+ ?   ??? name
+ ?   ??? enabled
+ ?   ??? power
+ ?   ?   ??? async
+ ?   ?   []
+ ?   ??? subsystem -> ../../../../../class/power_cap
+ ?   ??? enabled
+ ?   ??? uevent
+ ??? intel-rapl:1
+ ?   ??? constraint_0_name
+ ?   ??? constraint_0_power_limit_uw
+ ?   ??? constraint_0_time_window_us
+ ?   ??? constraint_1_name
+ ?   ??? constraint_1_power_limit_uw
+ ?   ??? constraint_1_time_window_us
+ ?   ??? device -> ../../intel-rapl
+ ?   ??? energy_uj
+ ?   ??? intel-rapl:1:0
+ ?   ?   ??? constraint_0_name
+ ?   ?   ??? constraint_0_power_limit_uw
+ ?   ?   ??? constraint_0_time_window_us
+ ?   ?   ??? constraint_1_name
+ ?   ?   ??? constraint_1_power_limit_uw
+ ?   ?   ??? constraint_1_time_window_us
+ ?   ?   ??? device -> ../../intel-rapl:1
+ ?   ?   ??? energy_uj
+ ?   ?   ??? max_energy_range_uj
+ ?   ?   ??? name
+ ?   ?   ??? enabled
+ ?   ?   ??? power
+ ?   ?   ?   ??? async
+ ?   ?   ?   []
+ ?   ?   ??? subsystem -> ../../../../../../class/power_cap
+ ?   ?   ??? uevent
+ ?   ??? intel-rapl:1:1
+ ?   ?   ??? constraint_0_name
+ ?   ?   ??? constraint_0_power_limit_uw
+ ?   ?   ??? constraint_0_time_window_us
+ ?   ?   ??? constraint_1_name
+ ?   ?   ??? constraint_1_power_limit_uw
+ ?   ?   ??? constraint_1_time_window_us
+ ?   ?   ??? device -> ../../intel-rapl:1
+ ?   ?   ??? energy_uj
+ ?   ?   ??? max_energy_range_uj
+ ?   ?   ??? name
+ ?   ?   ??? enabled
+ ?   ?   ??? power
+ ?   ?   ?   ??? async
+ ?   ?   ?   []
+ ?   ?   ??? subsystem -> ../../../../../../class/power_cap
+ ?   ?   ??? uevent
+ ?   ??? max_energy_range_uj
+ ?   ??? max_power_range_uw
+ ?   ??? name
+ ?   ??? enabled
+ ?   ??? power
+ ?   ?   ??? async
+ ?   ?   []
+ ?   ??? subsystem -> ../../../../../class/power_cap
+ ?   ??? uevent
+ ??? power
+ ?   ??? async
+ ?   []
+ ??? subsystem -> ../../../../class/power_cap
+ ??? enabled
+ ??? uevent
+
+The above example illustrates a case in which the Intel RAPL technology,
+available in Intel® IA-64 and IA-32 Processor Architectures, is used. There is one
+control type called intel-rapl which contains two power zones, intel-rapl:0 and
+intel-rapl:1, representing CPU packages. Each of these power zones contains
+two subzones, intel-rapl:j:0 and intel-rapl:j:1 (j = 0, 1), representing the
+"core" and the "uncore" parts of the given CPU package, respectively. All of
+the zones and subzones contain energy monitoring attributes (energy_uj,
+max_energy_range_uj) and constraint attributes (constraint_*) allowing controls
+to be applied (the constraints in the 'package' power zones apply to the whole
+CPU packages and the subzone constraints only apply to the respective parts of
+the given package individually). Since Intel RAPL doesn't provide instantaneous
+power value, there is no power_uw attribute.
+
+In addition to that, each power zone contains a name attribute, allowing the
+part of the system represented by that zone to be identified.
+For example:
+
+cat /sys/class/power_cap/intel-rapl/intel-rapl:0/name
+package-0
+
+The Intel RAPL technology allows two constraints, short term and long term,
+with two different time windows to be applied to each power zone. Thus for
+each zone there are 2 attributes representing the constraint names, 2 power
+limits and 2 attributes representing the sizes of the time windows. Such that,
+constraint_j_* attributes correspond to the jth constraint (j = 0,1).
+
+For example:
+ constraint_0_name
+ constraint_0_power_limit_uw
+ constraint_0_time_window_us
+ constraint_1_name
+ constraint_1_power_limit_uw
+ constraint_1_time_window_us
+
+Power Zone Attributes
+=================================
+Monitoring attributes
+----------------------
+
+energy_uj (rw): Current energy counter in micro joules. Write "0" to reset.
+If the counter can not be reset, then this attribute is read only.
+
+max_energy_range_uj (ro): Range of the above energy counter in micro-joules.
+
+power_uw (ro): Current power in micro watts.
+
+max_power_range_uw (ro): Range of the above power value in micro-watts.
+
+name (ro): Name of this power zone.
+
+It is possible that some domains have both power ranges and energy counter ranges;
+however, only one is mandatory.
+
+Constraints
+----------------
+constraint_X_power_limit_uw (rw): Power limit in micro watts, which should be
+applicable for the time window specified by "constraint_X_time_window_us".
+
+constraint_X_time_window_us (rw): Time window in micro seconds.
+
+constraint_X_name (ro): An optional name of the constraint
+
+constraint_X_max_power_uw(ro): Maximum allowed power in micro watts.
+
+constraint_X_min_power_uw(ro): Minimum allowed power in micro watts.
+
+constraint_X_max_time_window_us(ro): Maximum allowed time window in micro seconds.
+
+constraint_X_min_time_window_us(ro): Minimum allowed time window in micro seconds.
+
+Except power_limit_uw and time_window_us other fields are optional.
+
+Common zone and control type attributes
+----------------------------------------
+enabled (rw): Enable/Disable controls at zone level or for all zones using
+a control type.
+
+Power Cap Client Driver Interface
+==================================
+The API summary:
+
+Call powercap_register_control_type() to register control type object.
+Call powercap_register_zone() to register a power zone (under a given
+control type), either as a top-level power zone or as a subzone of another
+power zone registered earlier.
+The number of constraints in a power zone and the corresponding callbacks have
+to be defined prior to calling powercap_register_zone() to register that zone.
+
+To Free a power zone call powercap_unregister_zone().
+To free a control type object call powercap_unregister_control_type().
+Detailed API can be generated using kernel-doc on include/linux/powercap.h.
diff --git a/Documentation/power/runtime_pm.txt b/Documentation/power/runtime_pm.txt
index 03591a750f9..f32ce541957 100644
--- a/Documentation/power/runtime_pm.txt
+++ b/Documentation/power/runtime_pm.txt
@@ -2,6 +2,7 @@ Runtime Power Management Framework for I/O Devices
(C) 2009-2011 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
(C) 2010 Alan Stern <stern@rowland.harvard.edu>
+(C) 2014 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
1. Introduction
@@ -144,8 +145,14 @@ The action performed by the idle callback is totally dependent on the subsystem
(or driver) in question, but the expected and recommended action is to check
if the device can be suspended (i.e. if all of the conditions necessary for
suspending the device are satisfied) and to queue up a suspend request for the
-device in that case. The value returned by this callback is ignored by the PM
-core.
+device in that case. If there is no idle callback, or if the callback returns
+0, then the PM core will attempt to carry out a runtime suspend of the device,
+also respecting devices configured for autosuspend. In essence this means a
+call to pm_runtime_autosuspend() (do note that drivers needs to update the
+device last busy mark, pm_runtime_mark_last_busy(), to control the delay under
+this circumstance). To prevent this (for example, if the callback routine has
+started a delayed suspend), the routine must return a non-zero value. Negative
+error return codes are ignored by the PM core.
The helper functions provided by the PM core, described in Section 4, guarantee
that the following constraints are met with respect to runtime PM callbacks for
@@ -226,7 +233,7 @@ defined in include/linux/pm.h:
equal to zero); the initial value of it is 1 (i.e. runtime PM is
initially disabled for all devices)
- unsigned int runtime_error;
+ int runtime_error;
- if set, there was a fatal error (one of the callbacks returned error code
as described in Section 2), so the helper funtions will not work until
this flag is cleared; this is the error code returned by the failing
@@ -301,9 +308,10 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h:
removing the device from device hierarchy
int pm_runtime_idle(struct device *dev);
- - execute the subsystem-level idle callback for the device; returns 0 on
- success or error code on failure, where -EINPROGRESS means that
- ->runtime_idle() is already being executed
+ - execute the subsystem-level idle callback for the device; returns an
+ error code on failure, where -EINPROGRESS means that ->runtime_idle() is
+ already being executed; if there is no callback or the callback returns 0
+ then run pm_runtime_autosuspend(dev) and return its result
int pm_runtime_suspend(struct device *dev);
- execute the subsystem-level suspend callback for the device; returns 0 on
@@ -394,11 +402,11 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h:
int pm_runtime_disable(struct device *dev);
- increment the device's 'power.disable_depth' field (if the value of that
field was previously zero, this prevents subsystem-level runtime PM
- callbacks from being run for the device), make sure that all of the pending
- runtime PM operations on the device are either completed or canceled;
- returns 1 if there was a resume request pending and it was necessary to
- execute the subsystem-level resume callback for the device to satisfy that
- request, otherwise 0 is returned
+ callbacks from being run for the device), make sure that all of the
+ pending runtime PM operations on the device are either completed or
+ canceled; returns 1 if there was a resume request pending and it was
+ necessary to execute the subsystem-level resume callback for the device
+ to satisfy that request, otherwise 0 is returned
int pm_runtime_barrier(struct device *dev);
- check if there's a resume request pending for the device and resume it
@@ -426,6 +434,10 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h:
'power.runtime_error' is set or 'power.disable_depth' is greater than
zero)
+ bool pm_runtime_active(struct device *dev);
+ - return true if the device's runtime PM status is 'active' or its
+ 'power.disable_depth' field is not equal to zero, or false otherwise
+
bool pm_runtime_suspended(struct device *dev);
- return true if the device's runtime PM status is 'suspended' and its
'power.disable_depth' field is equal to zero, or false otherwise
@@ -433,6 +445,10 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h:
bool pm_runtime_status_suspended(struct device *dev);
- return true if the device's runtime PM status is 'suspended'
+ bool pm_runtime_suspended_if_enabled(struct device *dev);
+ - return true if the device's runtime PM status is 'suspended' and its
+ 'power.disable_depth' field is equal to 1
+
void pm_runtime_allow(struct device *dev);
- set the power.runtime_auto flag for the device and decrease its usage
counter (used by the /sys/devices/.../power/control interface to
@@ -536,13 +552,11 @@ helper functions described in Section 4. In that case, pm_runtime_resume()
should be used. Of course, for this purpose the device's runtime PM has to be
enabled earlier by calling pm_runtime_enable().
-If the device bus type's or driver's ->probe() callback runs
-pm_runtime_suspend() or pm_runtime_idle() or their asynchronous counterparts,
-they will fail returning -EAGAIN, because the device's usage counter is
-incremented by the driver core before executing ->probe(). Still, it may be
-desirable to suspend the device as soon as ->probe() has finished, so the driver
-core uses pm_runtime_put_sync() to invoke the subsystem-level idle callback for
-the device at that time.
+It may be desirable to suspend the device once ->probe() has finished.
+Therefore the driver core uses the asyncronous pm_request_idle() to submit a
+request to execute the subsystem-level idle callback for the device at that
+time. A driver that makes use of the runtime autosuspend feature, may want to
+update the last busy mark before returning from ->probe().
Moreover, the driver core prevents runtime PM callbacks from racing with the bus
notifier callback in __device_release_driver(), which is necessary, because the
@@ -635,19 +649,33 @@ place (in particular, if the system is not waking up from hibernation), it may
be more efficient to leave the devices that had been suspended before the system
suspend began in the suspended state.
+To this end, the PM core provides a mechanism allowing some coordination between
+different levels of device hierarchy. Namely, if a system suspend .prepare()
+callback returns a positive number for a device, that indicates to the PM core
+that the device appears to be runtime-suspended and its state is fine, so it
+may be left in runtime suspend provided that all of its descendants are also
+left in runtime suspend. If that happens, the PM core will not execute any
+system suspend and resume callbacks for all of those devices, except for the
+complete callback, which is then entirely responsible for handling the device
+as appropriate. This only applies to system suspend transitions that are not
+related to hibernation (see Documentation/power/devices.txt for more
+information).
+
The PM core does its best to reduce the probability of race conditions between
the runtime PM and system suspend/resume (and hibernation) callbacks by carrying
out the following operations:
- * During system suspend it calls pm_runtime_get_noresume() and
- pm_runtime_barrier() for every device right before executing the
- subsystem-level .suspend() callback for it. In addition to that it calls
- __pm_runtime_disable() with 'false' as the second argument for every device
- right before executing the subsystem-level .suspend_late() callback for it.
-
- * During system resume it calls pm_runtime_enable() and pm_runtime_put_sync()
- for every device right after executing the subsystem-level .resume_early()
- callback and right after executing the subsystem-level .resume() callback
+ * During system suspend pm_runtime_get_noresume() is called for every device
+ right before executing the subsystem-level .prepare() callback for it and
+ pm_runtime_barrier() is called for every device right before executing the
+ subsystem-level .suspend() callback for it. In addition to that the PM core
+ calls __pm_runtime_disable() with 'false' as the second argument for every
+ device right before executing the subsystem-level .suspend_late() callback
+ for it.
+
+ * During system resume pm_runtime_enable() and pm_runtime_put() are called for
+ every device right after executing the subsystem-level .resume_early()
+ callback and right after executing the subsystem-level .complete() callback
for it, respectively.
7. Generic subsystem callbacks
@@ -656,18 +684,13 @@ Subsystems may wish to conserve code space by using the set of generic power
management callbacks provided by the PM core, defined in
driver/base/power/generic_ops.c:
- int pm_generic_runtime_idle(struct device *dev);
- - invoke the ->runtime_idle() callback provided by the driver of this
- device, if defined, and call pm_runtime_suspend() for this device if the
- return value is 0 or the callback is not defined
-
int pm_generic_runtime_suspend(struct device *dev);
- invoke the ->runtime_suspend() callback provided by the driver of this
- device and return its result, or return -EINVAL if not defined
+ device and return its result, or return 0 if not defined
int pm_generic_runtime_resume(struct device *dev);
- invoke the ->runtime_resume() callback provided by the driver of this
- device and return its result, or return -EINVAL if not defined
+ device and return its result, or return 0 if not defined
int pm_generic_suspend(struct device *dev);
- if the device has not been suspended at run time, invoke the ->suspend()
@@ -723,15 +746,12 @@ driver/base/power/generic_ops.c:
int pm_generic_restore_noirq(struct device *dev);
- invoke the ->restore_noirq() callback provided by the device's driver
-These functions can be assigned to the ->runtime_idle(), ->runtime_suspend(),
+These functions are the defaults used by the PM core, if a subsystem doesn't
+provide its own callbacks for ->runtime_idle(), ->runtime_suspend(),
->runtime_resume(), ->suspend(), ->suspend_noirq(), ->resume(),
->resume_noirq(), ->freeze(), ->freeze_noirq(), ->thaw(), ->thaw_noirq(),
-->poweroff(), ->poweroff_noirq(), ->restore(), ->restore_noirq() callback
-pointers in the subsystem-level dev_pm_ops structures.
-
-If a subsystem wishes to use all of them at the same time, it can simply assign
-the GENERIC_SUBSYS_PM_OPS macro, defined in include/linux/pm.h, to its
-dev_pm_ops structure pointer.
+->poweroff(), ->poweroff_noirq(), ->restore(), ->restore_noirq() in the
+subsystem-level dev_pm_ops structure.
Device drivers that wish to use the same function as a system suspend, freeze,
poweroff and runtime suspend callback, and similarly for system resume, thaw,
@@ -869,7 +889,7 @@ Here is a schematic pseudo-code example:
foo->is_suspended = 0;
pm_runtime_mark_last_busy(&foo->dev);
if (foo->num_pending_requests > 0)
- foo_process_requests(foo);
+ foo_process_next_request(foo);
unlock(&foo->private_lock);
return 0;
}
diff --git a/Documentation/power/states.txt b/Documentation/power/states.txt
index 4416b28630d..50f3ef9177c 100644
--- a/Documentation/power/states.txt
+++ b/Documentation/power/states.txt
@@ -1,54 +1,87 @@
+System Power Management Sleep States
-System Power Management States
+(C) 2014 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+The kernel supports up to four system sleep states generically, although three
+of them depend on the platform support code to implement the low-level details
+for each state.
-The kernel supports three power management states generically, though
-each is dependent on platform support code to implement the low-level
-details for each state. This file describes each state, what they are
-commonly called, what ACPI state they map to, and what string to write
-to /sys/power/state to enter that state
+The states are represented by strings that can be read or written to the
+/sys/power/state file. Those strings may be "mem", "standby", "freeze" and
+"disk", where the last one always represents hibernation (Suspend-To-Disk) and
+the meaning of the remaining ones depends on the relative_sleep_states command
+line argument.
+
+For relative_sleep_states=1, the strings "mem", "standby" and "freeze" label the
+available non-hibernation sleep states from the deepest to the shallowest,
+respectively. In that case, "mem" is always present in /sys/power/state,
+because there is at least one non-hibernation sleep state in every system. If
+the given system supports two non-hibernation sleep states, "standby" is present
+in /sys/power/state in addition to "mem". If the system supports three
+non-hibernation sleep states, "freeze" will be present in /sys/power/state in
+addition to "mem" and "standby".
+
+For relative_sleep_states=0, which is the default, the following descriptions
+apply.
+
+state: Suspend-To-Idle
+ACPI state: S0
+Label: "freeze"
+
+This state is a generic, pure software, light-weight, system sleep state.
+It allows more energy to be saved relative to runtime idle by freezing user
+space and putting all I/O devices into low-power states (possibly
+lower-power than available at run time), such that the processors can
+spend more time in their idle states.
+
+This state can be used for platforms without Power-On Suspend/Suspend-to-RAM
+support, or it can be used in addition to Suspend-to-RAM (memory sleep)
+to provide reduced resume latency. It is always supported.
State: Standby / Power-On Suspend
ACPI State: S1
-String: "standby"
+Label: "standby"
-This state offers minimal, though real, power savings, while providing
-a very low-latency transition back to a working system. No operating
-state is lost (the CPU retains power), so the system easily starts up
+This state, if supported, offers moderate, though real, power savings, while
+providing a relatively low-latency transition back to a working system. No
+operating state is lost (the CPU retains power), so the system easily starts up
again where it left off.
-We try to put devices in a low-power state equivalent to D1, which
-also offers low power savings, but low resume latency. Not all devices
-support D1, and those that don't are left on.
-
-A transition from Standby to the On state should take about 1-2
-seconds.
+In addition to freezing user space and putting all I/O devices into low-power
+states, which is done for Suspend-To-Idle too, nonboot CPUs are taken offline
+and all low-level system functions are suspended during transitions into this
+state. For this reason, it should allow more energy to be saved relative to
+Suspend-To-Idle, but the resume latency will generally be greater than for that
+state.
State: Suspend-to-RAM
ACPI State: S3
-String: "mem"
+Label: "mem"
-This state offers significant power savings as everything in the
-system is put into a low-power state, except for memory, which is
-placed in self-refresh mode to retain its contents.
+This state, if supported, offers significant power savings as everything in the
+system is put into a low-power state, except for memory, which should be placed
+into the self-refresh mode to retain its contents. All of the steps carried out
+when entering Power-On Suspend are also carried out during transitions to STR.
+Additional operations may take place depending on the platform capabilities. In
+particular, on ACPI systems the kernel passes control to the BIOS (platform
+firmware) as the last step during STR transitions and that usually results in
+powering down some more low-level components that aren't directly controlled by
+the kernel.
-System and device state is saved and kept in memory. All devices are
-suspended and put into D3. In many cases, all peripheral buses lose
-power when entering STR, so devices must be able to handle the
-transition back to the On state.
+System and device state is saved and kept in memory. All devices are suspended
+and put into low-power states. In many cases, all peripheral buses lose power
+when entering STR, so devices must be able to handle the transition back to the
+"on" state.
-For at least ACPI, STR requires some minimal boot-strapping code to
-resume the system from STR. This may be true on other platforms.
-
-A transition from Suspend-to-RAM to the On state should take about
-3-5 seconds.
+For at least ACPI, STR requires some minimal boot-strapping code to resume the
+system from it. This may be the case on other platforms too.
State: Suspend-to-disk
ACPI State: S4
-String: "disk"
+Label: "disk"
This state offers the greatest power savings, and can be used even in
the absence of low-level platform support for power management. This
@@ -74,7 +107,3 @@ low-power state (like ACPI S4), or it may simply power down. Powering
down offers greater savings, and allows this mechanism to work on any
system. However, entering a real low-power state allows the user to
trigger wake up events (e.g. pressing a key or opening a laptop lid).
-
-A transition from Suspend-to-Disk to the On state should take about 30
-seconds, though it's typically a bit more with the current
-implementation.
diff --git a/Documentation/power/suspend-and-cpuhotplug.txt b/Documentation/power/suspend-and-cpuhotplug.txt
index e13dafc8e8f..2850df3bf95 100644
--- a/Documentation/power/suspend-and-cpuhotplug.txt
+++ b/Documentation/power/suspend-and-cpuhotplug.txt
@@ -1,6 +1,6 @@
Interaction of Suspend code (S3) with the CPU hotplug infrastructure
- (C) 2011 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
+ (C) 2011 - 2014 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
I. How does the regular CPU hotplug code differ from how the Suspend-to-RAM
diff --git a/Documentation/power/swsusp.txt b/Documentation/power/swsusp.txt
index 0b4b63e7e9b..f732a8321e8 100644
--- a/Documentation/power/swsusp.txt
+++ b/Documentation/power/swsusp.txt
@@ -50,6 +50,19 @@ echo N > /sys/power/image_size
before suspend (it is limited to 500 MB by default).
+. The resume process checks for the presence of the resume device,
+if found, it then checks the contents for the hibernation image signature.
+If both are found, it resumes the hibernation image.
+
+. The resume process may be triggered in two ways:
+ 1) During lateinit: If resume=/dev/your_swap_partition is specified on
+ the kernel command line, lateinit runs the resume process. If the
+ resume device has not been probed yet, the resume process fails and
+ bootup continues.
+ 2) Manually from an initrd or initramfs: May be run from
+ the init script by using the /sys/power/resume file. It is vital
+ that this be done prior to remounting any filesystems (even as
+ read-only) otherwise data may be corrupted.
Article about goals and implementation of Software Suspend for Linux
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -207,7 +220,10 @@ Q: After resuming, system is paging heavily, leading to very bad interactivity.
A: Try running
-cat `cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u` > /dev/null
+cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u | while read file
+do
+ test -f "$file" && cat "$file" > /dev/null
+done
after resume. swapoff -a; swapon -a may also be useful.
@@ -326,7 +342,7 @@ Q: How can distributions ship a swsusp-supporting kernel with modular
disk drivers (especially SATA)?
A: Well, it can be done, load the drivers, then do echo into
-/sys/power/disk/resume file from initrd. Be sure not to mount
+/sys/power/resume file from initrd. Be sure not to mount
anything, not even read-only mount, or you are going to lose your
data.
diff --git a/Documentation/power/video_extension.txt b/Documentation/power/video_extension.txt
deleted file mode 100644
index b2f9b1598ac..00000000000
--- a/Documentation/power/video_extension.txt
+++ /dev/null
@@ -1,37 +0,0 @@
-ACPI video extensions
-~~~~~~~~~~~~~~~~~~~~~
-
-This driver implement the ACPI Extensions For Display Adapters for
-integrated graphics devices on motherboard, as specified in ACPI 2.0
-Specification, Appendix B, allowing to perform some basic control like
-defining the video POST device, retrieving EDID information or to
-setup a video output, etc. Note that this is an ref. implementation
-only. It may or may not work for your integrated video device.
-
-Interfaces exposed to userland through /proc/acpi/video:
-
-VGA/info : display the supported video bus device capability like Video ROM, CRT/LCD/TV.
-VGA/ROM : Used to get a copy of the display devices' ROM data (up to 4k).
-VGA/POST_info : Used to determine what options are implemented.
-VGA/POST : Used to get/set POST device.
-VGA/DOS : Used to get/set ownership of output switching:
- Please refer ACPI spec B.4.1 _DOS
-VGA/CRT : CRT output
-VGA/LCD : LCD output
-VGA/TVO : TV output
-VGA/*/brightness : Used to get/set brightness of output device
-
-Notify event through /proc/acpi/event:
-
-#define ACPI_VIDEO_NOTIFY_SWITCH 0x80
-#define ACPI_VIDEO_NOTIFY_PROBE 0x81
-#define ACPI_VIDEO_NOTIFY_CYCLE 0x82
-#define ACPI_VIDEO_NOTIFY_NEXT_OUTPUT 0x83
-#define ACPI_VIDEO_NOTIFY_PREV_OUTPUT 0x84
-
-#define ACPI_VIDEO_NOTIFY_CYCLE_BRIGHTNESS 0x82
-#define ACPI_VIDEO_NOTIFY_INC_BRIGHTNESS 0x83
-#define ACPI_VIDEO_NOTIFY_DEC_BRIGHTNESS 0x84
-#define ACPI_VIDEO_NOTIFY_ZERO_BRIGHTNESS 0x85
-#define ACPI_VIDEO_NOTIFY_DISPLAY_OFF 0x86
-