20 files changed, 1793 insertions, 558 deletions
diff --git a/Documentation/power/00-INDEX b/Documentation/power/00-INDEX
index 45e9d4a9128..ad04cc8097e 100644
--- a/Documentation/power/00-INDEX
+++ b/Documentation/power/00-INDEX
@@ -4,6 +4,8 @@ apm-acpi.txt
 	- basic info about the APM and ACPI support.
 basic-pm-debugging.txt
 	- Debugging suspend and resume
+charger-manager.txt
+	- Battery charger management.
 devices.txt
 	- How drivers interact with system-wide power management
 drivers-testing.txt
@@ -22,10 +24,14 @@ pm_qos_interface.txt
 	- info on Linux PM Quality of Service interface
 power_supply_class.txt
 	- Tells userspace about battery, UPS, AC or DC power supply properties
+runtime_pm.txt
+	- Power management framework for I/O devices.
 s2ram.txt
 	- How to get suspend to ram working (and debug it when it isn't)
 states.txt
 	- System power management states
+suspend-and-cpuhotplug.txt
+	- Explains the interaction between Suspend-to-RAM (S3) and CPU hotplug
 swsusp-and-swap-files.txt
 	- Using swap files with software suspend (to disk)
 swsusp-dmcrypt.txt
@@ -36,7 +42,5 @@ tricks.txt
 	- How to trick software suspend (to disk) into working when it isn't
 userland-swsusp.txt
 	- Experimental implementation of software suspend in userspace
-video_extension.txt
-	- ACPI video extensions
 video.txt
 	- Video issues during resume from suspend
diff --git a/Documentation/power/basic-pm-debugging.txt b/Documentation/power/basic-pm-debugging.txt
index ddd78172ef7..edeecd447d2 100644
--- a/Documentation/power/basic-pm-debugging.txt
+++ b/Documentation/power/basic-pm-debugging.txt
@@ -15,7 +15,7 @@ test at least a couple of times in a row for confidence.  [This is necessary,
 because some problems only show up on a second attempt at suspending and
 resuming the system.]  Moreover, hibernating in the "reboot" and "shutdown"
 modes causes the PM core to skip some platform-related callbacks which on ACPI
-systems might be necessary to make hibernation work.  Thus, if you machine fails
+systems might be necessary to make hibernation work.  Thus, if your machine fails
 to hibernate or resume in the "reboot" mode, you should try the "platform" mode:
 
 # echo platform > /sys/power/disk
@@ -172,14 +172,14 @@ you can boot the kernel with the 'no_console_suspend' parameter and try to log
 kernel messages using the serial console.  This may provide you with some
 information about the reasons of the suspend (resume) failure.  Alternatively,
 it may be possible to use a FireWire port for debugging with firescope
-(ftp://ftp.firstfloor.org/pub/ak/firescope/).  On x86 it is also possible to
-use the PM_TRACE mechanism documented in Documentation/s2ram.txt .
+(http://v3.sk/~lkundrak/firescope/).  On x86 it is also possible to
+use the PM_TRACE mechanism documented in Documentation/power/s2ram.txt .
 
 2. Testing suspend to RAM (STR)
 
 To verify that the STR works, it is generally more convenient to use the s2ram
 tool available from http://suspend.sf.net and documented at
-http://en.opensuse.org/SDB:Suspend_to_RAM.
+http://en.opensuse.org/SDB:Suspend_to_RAM (S2RAM_LINK).
 
 Namely, after writing "freezer", "devices", "platform", "processors", or "core"
 into /sys/power/pm_test (available if the kernel is compiled with
@@ -194,10 +194,34 @@ Among other things, the testing with the help of /sys/power/pm_test may allow
 you to identify drivers that fail to suspend or resume their devices.  They
 should be unloaded every time before an STR transition.
 
-Next, you can follow the instructions at http://en.opensuse.org/s2ram to test
-the system, but if it does not work "out of the box", you may need to boot it
-with "init=/bin/bash" and test s2ram in the minimal configuration.  In that
-case, you may be able to search for failing drivers by following the procedure
+Next, you can follow the instructions at S2RAM_LINK to test the system, but if
+it does not work "out of the box", you may need to boot it with
+"init=/bin/bash" and test s2ram in the minimal configuration.  In that case,
+you may be able to search for failing drivers by following the procedure
 analogous to the one described in section 1.  If you find some failing drivers,
 you will have to unload them every time before an STR transition (ie. before
 you run s2ram), and please report the problems with them.
+
+There is a debugfs entry which shows the suspend to RAM statistics. Here is an
+example of its output.
+	# mount -t debugfs none /sys/kernel/debug
+	# cat /sys/kernel/debug/suspend_stats
+	success: 20
+	fail: 5
+	failed_freeze: 0
+	failed_prepare: 0
+	failed_suspend: 5
+	failed_suspend_noirq: 0
+	failed_resume: 0
+	failed_resume_noirq: 0
+	failures:
+	  last_failed_dev:	alarm
+				adc
+	  last_failed_errno:	-16
+				-16
+	  last_failed_step:	suspend
+				suspend
+Field success means the success number of suspend to RAM, and field fail means
+the failure number. Others are the failure number of different steps of suspend
+to RAM. suspend_stats just lists the last 2 failed devices, error number and
+failed step of suspend.
diff --git a/Documentation/power/charger-manager.txt b/Documentation/power/charger-manager.txt
new file mode 100644
index 00000000000..b4f7f4b23f6
--- /dev/null
+++ b/Documentation/power/charger-manager.txt
@@ -0,0 +1,200 @@
+Charger Manager
+	(C) 2011 MyungJoo Ham <myungjoo.ham@samsung.com>, GPL
+
+Charger Manager provides in-kernel battery charger management that
+requires temperature monitoring during suspend-to-RAM state
+and where each battery may have multiple chargers attached and the userland
+wants to look at the aggregated information of the multiple chargers.
+
+Charger Manager is a platform_driver with power-supply-class entries.
+An instance of Charger Manager (a platform-device created with Charger-Manager)
+represents an independent battery with chargers. If there are multiple
+batteries with their own chargers acting independently in a system,
+the system may need multiple instances of Charger Manager.
+
+1. Introduction
+===============
+
+Charger Manager supports the following:
+
+* Support for multiple chargers (e.g., a device with USB, AC, and solar panels)
+	A system may have multiple chargers (or power sources) and some of
+	they may be activated at the same time. Each charger may have its
+	own power-supply-class and each power-supply-class can provide
+	different information about the battery status. This framework
+	aggregates charger-related information from multiple sources and
+	shows combined information as a single power-supply-class.
+
+* Support for in suspend-to-RAM polling (with suspend_again callback)
+	While the battery is being charged and the system is in suspend-to-RAM,
+	we may need to monitor the battery health by looking at the ambient or
+	battery temperature. We can accomplish this by waking up the system
+	periodically. However, such a method wakes up devices unncessary for
+	monitoring the battery health and tasks, and user processes that are
+	supposed to be kept suspended. That, in turn, incurs unnecessary power
+	consumption and slow down charging process. Or even, such peak power
+	consumption can stop chargers in the middle of charging
+	(external power input < device power consumption), which not
+	only affects the charging time, but the lifespan of the battery.
+
+	Charger Manager provides a function "cm_suspend_again" that can be
+	used as suspend_again callback of platform_suspend_ops. If the platform
+	requires tasks other than cm_suspend_again, it may implement its own
+	suspend_again callback that calls cm_suspend_again in the middle.
+	Normally, the platform will need to resume and suspend some devices
+	that are used by Charger Manager.
+
+* Support for premature full-battery event handling
+	If the battery voltage drops by "fullbatt_vchkdrop_uV" after
+	"fullbatt_vchkdrop_ms" from the full-battery event, the framework
+	restarts charging. This check is also performed while suspended by
+	setting wakeup time accordingly and using suspend_again.
+
+* Support for uevent-notify
+	With the charger-related events, the device sends
+	notification to users with UEVENT.
+
+2. Global Charger-Manager Data related with suspend_again
+========================================================
+In order to setup Charger Manager with suspend-again feature
+(in-suspend monitoring), the user should provide charger_global_desc
+with setup_charger_manager(struct charger_global_desc *).
+This charger_global_desc data for in-suspend monitoring is global
+as the name suggests. Thus, the user needs to provide only once even
+if there are multiple batteries. If there are multiple batteries, the
+multiple instances of Charger Manager share the same charger_global_desc
+and it will manage in-suspend monitoring for all instances of Charger Manager.
+
+The user needs to provide all the three entries properly in order to activate
+in-suspend monitoring:
+
+struct charger_global_desc {
+
+char *rtc_name;
+	: The name of rtc (e.g., "rtc0") used to wakeup the system from
+	suspend for Charger Manager. The alarm interrupt (AIE) of the rtc
+	should be able to wake up the system from suspend. Charger Manager
+	saves and restores the alarm value and use the previously-defined
+	alarm if it is going to go off earlier than Charger Manager so that
+	Charger Manager does not interfere with previously-defined alarms.
+
+bool (*rtc_only_wakeup)(void);
+	: This callback should let CM know whether
+	the wakeup-from-suspend is caused only by the alarm of "rtc" in the
+	same struct. If there is any other wakeup source triggered the
+	wakeup, it should return false. If the "rtc" is the only wakeup
+	reason, it should return true.
+
+bool assume_timer_stops_in_suspend;
+	: if true, Charger Manager assumes that
+	the timer (CM uses jiffies as timer) stops during suspend. Then, CM
+	assumes that the suspend-duration is same as the alarm length.
+};
+
+3. How to setup suspend_again
+=============================
+Charger Manager provides a function "extern bool cm_suspend_again(void)".
+When cm_suspend_again is called, it monitors every battery. The suspend_ops
+callback of the system's platform_suspend_ops can call cm_suspend_again
+function to know whether Charger Manager wants to suspend again or not.
+If there are no other devices or tasks that want to use suspend_again
+feature, the platform_suspend_ops may directly refer to cm_suspend_again
+for its suspend_again callback.
+
+The cm_suspend_again() returns true (meaning "I want to suspend again")
+if the system was woken up by Charger Manager and the polling
+(in-suspend monitoring) results in "normal".
+
+4. Charger-Manager Data (struct charger_desc)
+=============================================
+For each battery charged independently from other batteries (if a series of
+batteries are charged by a single charger, they are counted as one independent
+battery), an instance of Charger Manager is attached to it.
+
+struct charger_desc {
+
+char *psy_name;
+	: The power-supply-class name of the battery. Default is
+	"battery" if psy_name is NULL. Users can access the psy entries
+	at "/sys/class/power_supply/[psy_name]/".
+
+enum polling_modes polling_mode;
+	: CM_POLL_DISABLE: do not poll this battery.
+	  CM_POLL_ALWAYS: always poll this battery.
+	  CM_POLL_EXTERNAL_POWER_ONLY: poll this battery if and only if
+				       an external power source is attached.
+	  CM_POLL_CHARGING_ONLY: poll this battery if and only if the
+				 battery is being charged.
+
+unsigned int fullbatt_vchkdrop_ms;
+unsigned int fullbatt_vchkdrop_uV;
+	: If both have non-zero values, Charger Manager will check the
+	battery voltage drop fullbatt_vchkdrop_ms after the battery is fully
+	charged. If the voltage drop is over fullbatt_vchkdrop_uV, Charger
+	Manager will try to recharge the battery by disabling and enabling
+	chargers. Recharge with voltage drop condition only (without delay
+	condition) is needed to be implemented with hardware interrupts from
+	fuel gauges or charger devices/chips.
+
+unsigned int fullbatt_uV;
+	: If specified with a non-zero value, Charger Manager assumes
+	that the battery is full (capacity = 100) if the battery is not being
+	charged and the battery voltage is equal to or greater than
+	fullbatt_uV.
+
+unsigned int polling_interval_ms;
+	: Required polling interval in ms. Charger Manager will poll
+	this battery every polling_interval_ms or more frequently.
+
+enum data_source battery_present;
+	: CM_BATTERY_PRESENT: assume that the battery exists.
+	CM_NO_BATTERY: assume that the battery does not exists.
+	CM_FUEL_GAUGE: get battery presence information from fuel gauge.
+	CM_CHARGER_STAT: get battery presence from chargers.
+
+char **psy_charger_stat;
+	: An array ending with NULL that has power-supply-class names of
+	chargers. Each power-supply-class should provide "PRESENT" (if
+	battery_present is "CM_CHARGER_STAT"), "ONLINE" (shows whether an
+	external power source is attached or not), and "STATUS" (shows whether
+	the battery is {"FULL" or not FULL} or {"FULL", "Charging",
+	"Discharging", "NotCharging"}).
+
+int num_charger_regulators;
+struct regulator_bulk_data *charger_regulators;
+	: Regulators representing the chargers in the form for
+	regulator framework's bulk functions.
+
+char *psy_fuel_gauge;
+	: Power-supply-class name of the fuel gauge.
+
+int (*temperature_out_of_range)(int *mC);
+bool measure_battery_temp;
+	: This callback returns 0 if the temperature is safe for charging,
+	a positive number if it is too hot to charge, and a negative number
+	if it is too cold to charge. With the variable mC, the callback returns
+	the temperature in 1/1000 of centigrade.
+	The source of temperature can be battery or ambient one according to
+	the value of measure_battery_temp.
+};
+
+5. Notify Charger-Manager of charger events: cm_notify_event()
+=========================================================
+If there is an charger event is required to notify
+Charger Manager, a charger device driver that triggers the event can call
+cm_notify_event(psy, type, msg) to notify the corresponding Charger Manager.
+In the function, psy is the charger driver's power_supply pointer, which is
+associated with Charger-Manager. The parameter "type"
+is the same as irq's type (enum cm_event_types). The event message "msg" is
+optional and is effective only if the event type is "UNDESCRIBED" or "OTHERS".
+
+6. Other Considerations
+=======================
+
+At the charger/battery-related events such as battery-pulled-out,
+charger-pulled-out, charger-inserted, DCIN-over/under-voltage, charger-stopped,
+and others critical to chargers, the system should be configured to wake up.
+At least the following should wake up the system from a suspend:
+a) charger-on/off b) external-power-in/out c) battery-in/out (while charging)
+
+It is usually accomplished by configuring the PMIC as a wakeup source.
diff --git a/Documentation/power/devices.txt b/Documentation/power/devices.txt
index 57080cd7457..d172bce0fd4 100644
--- a/Documentation/power/devices.txt
+++ b/Documentation/power/devices.txt
@@ -1,7 +1,8 @@
 Device Power Management
 
-Copyright (c) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+Copyright (c) 2010-2011 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
 Copyright (c) 2010 Alan Stern <stern@rowland.harvard.edu>
+Copyright (c) 2014 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
 
 
 Most of the code in Linux is device drivers, so most of the Linux power
@@ -96,6 +97,12 @@ struct dev_pm_ops {
 	int (*thaw)(struct device *dev);
 	int (*poweroff)(struct device *dev);
 	int (*restore)(struct device *dev);
+	int (*suspend_late)(struct device *dev);
+	int (*resume_early)(struct device *dev);
+	int (*freeze_late)(struct device *dev);
+	int (*thaw_early)(struct device *dev);
+	int (*poweroff_late)(struct device *dev);
+	int (*restore_early)(struct device *dev);
 	int (*suspend_noirq)(struct device *dev);
 	int (*resume_noirq)(struct device *dev);
 	int (*freeze_noirq)(struct device *dev);
@@ -123,9 +130,12 @@ please refer directly to the source code for more information about it.
 Subsystem-Level Methods
 -----------------------
 The core methods to suspend and resume devices reside in struct dev_pm_ops
-pointed to by the pm member of struct bus_type, struct device_type and
-struct class.  They are mostly of interest to the people writing infrastructure
-for buses, like PCI or USB, or device type and device class drivers.
+pointed to by the ops member of struct dev_pm_domain, or by the pm member of
+struct bus_type, struct device_type and struct class.  They are mostly of
+interest to the people writing infrastructure for platforms and buses, like PCI
+or USB, or device type and device class drivers.  They also are relevant to the
+writers of device drivers whose subsystems (PM domains, device types, device
+classes and bus types) don't provide all power management methods.
 
 Bus drivers implement these methods as appropriate for the hardware and the
 drivers using it; PCI works differently from USB, and so on.  Not many people
@@ -139,39 +149,57 @@ sequencing in the driver model tree.
 
 /sys/devices/.../power/wakeup files
 -----------------------------------
-All devices in the driver model have two flags to control handling of wakeup
-events (hardware signals that can force the device and/or system out of a low
-power state).  These flags are initialized by bus or device driver code using
+All device objects in the driver model contain fields that control the handling
+of system wakeup events (hardware signals that can force the system out of a
+sleep state).  These fields are initialized by bus or device driver code using
 device_set_wakeup_capable() and device_set_wakeup_enable(), defined in
 include/linux/pm_wakeup.h.
 
-The "can_wakeup" flag just records whether the device (and its driver) can
+The "power.can_wakeup" flag just records whether the device (and its driver) can
 physically support wakeup events.  The device_set_wakeup_capable() routine
-affects this flag.  The "should_wakeup" flag controls whether the device should
-try to use its wakeup mechanism.  device_set_wakeup_enable() affects this flag;
-for the most part drivers should not change its value.  The initial value of
-should_wakeup is supposed to be false for the majority of devices; the major
-exceptions are power buttons, keyboards, and Ethernet adapters whose WoL
-(wake-on-LAN) feature has been set up with ethtool.
+affects this flag.  The "power.wakeup" field is a pointer to an object of type
+struct wakeup_source used for controlling whether or not the device should use
+its system wakeup mechanism and for notifying the PM core of system wakeup
+events signaled by the device.  This object is only present for wakeup-capable
+devices (i.e. devices whose "can_wakeup" flags are set) and is created (or
+removed) by device_set_wakeup_capable().
 
 Whether or not a device is capable of issuing wakeup events is a hardware
 matter, and the kernel is responsible for keeping track of it.  By contrast,
 whether or not a wakeup-capable device should issue wakeup events is a policy
 decision, and it is managed by user space through a sysfs attribute: the
-power/wakeup file.  User space can write the strings "enabled" or "disabled" to
-set or clear the should_wakeup flag, respectively.  Reads from the file will
-return the corresponding string if can_wakeup is true, but if can_wakeup is
-false then reads will return an empty string, to indicate that the device
-doesn't support wakeup events.  (But even though the file appears empty, writes
-will still affect the should_wakeup flag.)
-
-The device_may_wakeup() routine returns true only if both flags are set.
-Drivers should check this routine when putting devices in a low-power state
-during a system sleep transition, to see whether or not to enable the devices'
-wakeup mechanisms.  However for runtime power management, wakeup events should
-be enabled whenever the device and driver both support them, regardless of the
-should_wakeup flag.
-
+"power/wakeup" file.  User space can write the strings "enabled" or "disabled"
+to it to indicate whether or not, respectively, the device is supposed to signal
+system wakeup.  This file is only present if the "power.wakeup" object exists
+for the given device and is created (or removed) along with that object, by
+device_set_wakeup_capable().  Reads from the file will return the corresponding
+string.
+
+The "power/wakeup" file is supposed to contain the "disabled" string initially
+for the majority of devices; the major exceptions are power buttons, keyboards,
+and Ethernet adapters whose WoL (wake-on-LAN) feature has been set up with
+ethtool.  It should also default to "enabled" for devices that don't generate
+wakeup requests on their own but merely forward wakeup requests from one bus to
+another (like PCI Express ports).
+
+The device_may_wakeup() routine returns true only if the "power.wakeup" object
+exists and the corresponding "power/wakeup" file contains the string "enabled".
+This information is used by subsystems, like the PCI bus type code, to see
+whether or not to enable the devices' wakeup mechanisms.  If device wakeup
+mechanisms are enabled or disabled directly by drivers, they also should use
+device_may_wakeup() to decide what to do during a system sleep transition.
+Device drivers, however, are not supposed to call device_set_wakeup_enable()
+directly in any case.
+
+It ought to be noted that system wakeup is conceptually different from "remote
+wakeup" used by runtime power management, although it may be supported by the
+same physical mechanism.  Remote wakeup is a feature allowing devices in
+low-power states to trigger specific interrupts to signal conditions in which
+they should be put into the full-power state.  Those interrupts may or may not
+be used to signal system wakeup events, depending on the hardware design.  On
+some systems it is impossible to trigger them from system sleep states.  In any
+case, remote wakeup should always be enabled for runtime power management for
+all devices and drivers that support it.
 
 /sys/devices/.../power/control files
 ------------------------------------
@@ -241,41 +269,51 @@ situations.
 System Power Management Phases
 ------------------------------
 Suspending or resuming the system is done in several phases.  Different phases
-are used for standby or memory sleep states ("suspend-to-RAM") and the
+are used for freeze, standby, and memory sleep states ("suspend-to-RAM") and the
 hibernation state ("suspend-to-disk").  Each phase involves executing callbacks
 for every device before the next phase begins.  Not all busses or classes
 support all these callbacks and not all drivers use all the callbacks.  The
 various phases always run after tasks have been frozen and before they are
 unfrozen.  Furthermore, the *_noirq phases run at a time when IRQ handlers have
-been disabled (except for those marked with the IRQ_WAKEUP flag).
+been disabled (except for those marked with the IRQF_NO_SUSPEND flag).
+
+All phases use PM domain, bus, type, class or driver callbacks (that is, methods
+defined in dev->pm_domain->ops, dev->bus->pm, dev->type->pm, dev->class->pm or
+dev->driver->pm).  These callbacks are regarded by the PM core as mutually
+exclusive.  Moreover, PM domain callbacks always take precedence over all of the
+other callbacks and, for example, type callbacks take precedence over bus, class
+and driver callbacks.  To be precise, the following rules are used to determine
+which callback to execute in the given phase:
+
+    1.	If dev->pm_domain is present, the PM core will choose the callback
+	included in dev->pm_domain->ops for execution
+
+    2.	Otherwise, if both dev->type and dev->type->pm are present, the callback
+	included in dev->type->pm will be chosen for execution.
 
-Most phases use bus, type, and class callbacks (that is, methods defined in
-dev->bus->pm, dev->type->pm, and dev->class->pm).  The prepare and complete
-phases are exceptions; they use only bus callbacks.  When multiple callbacks
-are used in a phase, they are invoked in the order: <class, type, bus> during
-power-down transitions and in the opposite order during power-up transitions.
-For example, during the suspend phase the PM core invokes
+    3.	Otherwise, if both dev->class and dev->class->pm are present, the
+	callback included in dev->class->pm will be chosen for execution.
 
-	dev->class->pm.suspend(dev);
-	dev->type->pm.suspend(dev);
-	dev->bus->pm.suspend(dev);
+    4.	Otherwise, if both dev->bus and dev->bus->pm are present, the callback
+	included in dev->bus->pm will be chosen for execution.
 
-before moving on to the next device, whereas during the resume phase the core
-invokes
+This allows PM domains and device types to override callbacks provided by bus
+types or device classes if necessary.
 
-	dev->bus->pm.resume(dev);
-	dev->type->pm.resume(dev);
-	dev->class->pm.resume(dev);
+The PM domain, type, class and bus callbacks may in turn invoke device- or
+driver-specific methods stored in dev->driver->pm, but they don't have to do
+that.
 
-These callbacks may in turn invoke device- or driver-specific methods stored in
-dev->driver->pm, but they don't have to.
+If the subsystem callback chosen for execution is not present, the PM core will
+execute the corresponding method from dev->driver->pm instead if there is one.
 
 
 Entering System Suspend
 -----------------------
-When the system goes into the standby or memory sleep state, the phases are:
+When the system goes into the freeze, standby or memory sleep state,
+the phases are:
 
-		prepare, suspend, suspend_noirq.
+		prepare, suspend, suspend_late, suspend_noirq.
 
     1.	The prepare phase is meant to prevent races by preventing new devices
 	from being registered; the PM core would never know that all the
@@ -284,18 +322,36 @@ When the system goes into the standby or memory sleep state, the phases are:
 	time.)  Unlike the other suspend-related phases, during the prepare
 	phase the device tree is traversed top-down.
 
-	The prepare phase uses only a bus callback.  After the callback method
-	returns, no new children may be registered below the device.  The method
-	may also prepare the device or driver in some way for the upcoming
-	system power transition, but it should not put the device into a
-	low-power state.
+	After the prepare callback method returns, no new children may be
+	registered below the device.  The method may also prepare the device or
+	driver in some way for the upcoming system power transition, but it
+	should not put the device into a low-power state.
+
+	For devices supporting runtime power management, the return value of the
+	prepare callback can be used to indicate to the PM core that it may
+	safely leave the device in runtime suspend (if runtime-suspended
+	already), provided that all of the device's descendants are also left in
+	runtime suspend.  Namely, if the prepare callback returns a positive
+	number and that happens for all of the descendants of the device too,
+	and all of them (including the device itself) are runtime-suspended, the
+	PM core will skip the suspend, suspend_late and	suspend_noirq suspend
+	phases as well as the resume_noirq, resume_early and resume phases of
+	the following system resume for all of these devices.	In that case,
+	the complete callback will be called directly after the prepare callback
+	and is entirely responsible for bringing the device back to the
+	functional state as appropriate.
 
     2.	The suspend methods should quiesce the device to stop it from performing
 	I/O.  They also may save the device registers and put it into the
 	appropriate low-power state, depending on the bus type the device is on,
 	and they may enable wakeup events.
 
-    3.	The suspend_noirq phase occurs after IRQ handlers have been disabled,
+    3	For a number of devices it is convenient to split suspend into the
+	"quiesce device" and "save device state" phases, in which cases
+	suspend_late is meant to do the latter.  It is always executed after
+	runtime power management has been disabled for all devices.
+
+    4.	The suspend_noirq phase occurs after IRQ handlers have been disabled,
 	which means that the driver's interrupt handler will not be called while
 	the callback method is running.  The methods should save the values of
 	the device's registers that weren't saved previously and finally put the
@@ -328,9 +384,9 @@ the devices that were suspended.
 
 Leaving System Suspend
 ----------------------
-When resuming from standby or memory sleep, the phases are:
+When resuming from freeze, standby or memory sleep, the phases are:
 
-		resume_noirq, resume, complete.
+		resume_noirq, resume_early, resume, complete.
 
     1.	The resume_noirq callback methods should perform any actions needed
 	before the driver's interrupt handlers are invoked.  This generally
@@ -346,21 +402,36 @@ When resuming from standby or memory sleep, the phases are:
 	device driver's ->pm.resume_noirq() method to perform device-specific
 	actions.
 
-    2.	The resume methods should bring the the device back to its operating
+    2.	The resume_early methods should prepare devices for the execution of
+	the resume methods.  This generally involves undoing the actions of the
+	preceding suspend_late phase.
+
+    3	The resume methods should bring the device back to its operating
 	state, so that it can perform normal I/O.  This generally involves
 	undoing the actions of the suspend phase.
 
-    3.	The complete phase uses only a bus callback.  The method should undo the
-	actions of the prepare phase.  Note, however, that new children may be
-	registered below the device as soon as the resume callbacks occur; it's
-	not necessary to wait until the complete phase.
+    4.	The complete phase should undo the actions of the prepare phase.  Note,
+	however, that new children may be registered below the device as soon as
+	the resume callbacks occur; it's not necessary to wait until the
+	complete phase.
+
+	Moreover, if the preceding prepare callback returned a positive number,
+	the device may have been left in runtime suspend throughout the whole
+	system suspend and resume (the suspend, suspend_late, suspend_noirq
+	phases of system suspend and the resume_noirq, resume_early, resume
+	phases of system resume may have been skipped for it).  In that case,
+	the complete callback is entirely responsible for bringing the device
+	back to the functional state after system suspend if necessary.  [For
+	example, it may need to queue up a runtime resume request for the device
+	for this purpose.]  To check if that is the case, the complete callback
+	can consult the device's power.direct_complete flag.  Namely, if that
+	flag is set when the complete callback is being run, it has been called
+	directly after the preceding prepare and special action may be required
+	to make the device work correctly afterward.
 
 At the end of these phases, drivers should be as functional as they were before
 suspending: I/O can be performed using DMA and IRQs, and the relevant clocks are
-gated on.  Even if the device was in a low-power state before the system sleep
-because of runtime power management, afterwards it should be back in its
-full-power state.  There are multiple reasons why it's best to do this; they are
-discussed in more detail in Documentation/power/runtime_pm.txt.
+gated on.
 
 However, the details here may again be platform-specific.  For example,
 some systems support multiple "run" states, and the mode in effect at
@@ -372,7 +443,7 @@ Drivers need to be able to handle hardware which has been reset since the
 suspend methods were called, for example by complete reinitialization.
 This may be the hardest part, and the one most protected by NDA'd documents
 and chip errata.  It's simplest if the hardware state hasn't changed since
-the suspend was carried out, but that can't be guaranteed (in fact, it ususally
+the suspend was carried out, but that can't be guaranteed (in fact, it usually
 is not the case).
 
 Drivers must also be prepared to notice that the device has been removed
@@ -389,8 +460,8 @@ the system log.
 
 Entering Hibernation
 --------------------
-Hibernating the system is more complicated than putting it into the standby or
-memory sleep state, because it involves creating and saving a system image.
+Hibernating the system is more complicated than putting it into the other
+sleep states, because it involves creating and saving a system image.
 Therefore there are more phases for hibernation, with a different set of
 callbacks.  These phases always run after tasks have been frozen and memory has
 been freed.
@@ -400,8 +471,8 @@ an image of the system memory while everything is stable, reactivate all
 devices (thaw), write the image to permanent storage, and finally shut down the
 system (poweroff).  The phases used to accomplish this are:
 
-	prepare, freeze, freeze_noirq, thaw_noirq, thaw, complete,
-	prepare, poweroff, poweroff_noirq
+	prepare, freeze, freeze_late, freeze_noirq, thaw_noirq, thaw_early,
+	thaw, complete, prepare, poweroff, poweroff_late, poweroff_noirq
 
     1.	The prepare phase is discussed in the "Entering System Suspend" section
 	above.
@@ -412,7 +483,11 @@ system (poweroff).  The phases used to accomplish this are:
 	save time it's best not to do so.  Also, the device should not be
 	prepared to generate wakeup events.
 
-    3.	The freeze_noirq phase is analogous to the suspend_noirq phase discussed
+    3.	The freeze_late phase is analogous to the suspend_late phase described
+	above, except that the device should not be put in a low-power state and
+	should not be allowed to generate wakeup events by it.
+
+    4.	The freeze_noirq phase is analogous to the suspend_noirq phase discussed
 	above, except again that the device should not be put in a low-power
 	state and should not be allowed to generate wakeup events.
 
@@ -420,32 +495,39 @@ At this point the system image is created.  All devices should be inactive and
 the contents of memory should remain undisturbed while this happens, so that the
 image forms an atomic snapshot of the system state.
 
-    4.	The thaw_noirq phase is analogous to the resume_noirq phase discussed
+    5.	The thaw_noirq phase is analogous to the resume_noirq phase discussed
 	above.  The main difference is that its methods can assume the device is
 	in the same state as at the end of the freeze_noirq phase.
 
-    5.	The thaw phase is analogous to the resume phase discussed above.  Its
+    6.	The thaw_early phase is analogous to the resume_early phase described
+	above.  Its methods should undo the actions of the preceding
+	freeze_late, if necessary.
+
+    7.	The thaw phase is analogous to the resume phase discussed above.  Its
 	methods should bring the device back to an operating state, so that it
 	can be used for saving the image if necessary.
 
-    6.	The complete phase is discussed in the "Leaving System Suspend" section
+    8.	The complete phase is discussed in the "Leaving System Suspend" section
 	above.
 
 At this point the system image is saved, and the devices then need to be
 prepared for the upcoming system shutdown.  This is much like suspending them
-before putting the system into the standby or memory sleep state, and the phases
-are similar.
+before putting the system into the freeze, standby or memory sleep state,
+and the phases are similar.
 
-    7.	The prepare phase is discussed above.
+    9.	The prepare phase is discussed above.
 
-    8.	The poweroff phase is analogous to the suspend phase.
+    10.	The poweroff phase is analogous to the suspend phase.
 
-    9.	The poweroff_noirq phase is analogous to the suspend_noirq phase.
+    11.	The poweroff_late phase is analogous to the suspend_late phase.
 
-The poweroff and poweroff_noirq callbacks should do essentially the same things
-as the suspend and suspend_noirq callbacks.  The only notable difference is that
-they need not store the device register values, because the registers should
-already have been stored during the freeze or freeze_noirq phases.
+    12.	The poweroff_noirq phase is analogous to the suspend_noirq phase.
+
+The poweroff, poweroff_late and poweroff_noirq callbacks should do essentially
+the same things as the suspend, suspend_late and suspend_noirq callbacks,
+respectively.  The only notable difference is that they need not store the
+device register values, because the registers should already have been stored
+during the freeze, freeze_late or freeze_noirq phases.
 
 
 Leaving Hibernation
@@ -489,54 +571,62 @@ To achieve this, the image kernel must restore the devices' pre-hibernation
 functionality.  The operation is much like waking up from the memory sleep
 state, although it involves different phases:
 
-	restore_noirq, restore, complete
+	restore_noirq, restore_early, restore, complete
 
     1.	The restore_noirq phase is analogous to the resume_noirq phase.
 
-    2.	The restore phase is analogous to the resume phase.
-
-    3.	The complete phase is discussed above.
-
-The main difference from resume[_noirq] is that restore[_noirq] must assume the
-device has been accessed and reconfigured by the boot loader or the boot kernel.
-Consequently the state of the device may be different from the state remembered
-from the freeze and freeze_noirq phases.  The device may even need to be reset
-and completely re-initialized.  In many cases this difference doesn't matter, so
-the resume[_noirq] and restore[_norq] method pointers can be set to the same
-routines.  Nevertheless, different callback pointers are used in case there is a
-situation where it actually matters.
-
-
-System Devices
---------------
-System devices (sysdevs) follow a slightly different API, which can be found in
-
-	include/linux/sysdev.h
-	drivers/base/sys.c
-
-System devices will be suspended with interrupts disabled, and after all other
-devices have been suspended.  On resume, they will be resumed before any other
-devices, and also with interrupts disabled.  These things occur in special
-"sysdev_driver" phases, which affect only system devices.
-
-Thus, after the suspend_noirq (or freeze_noirq or poweroff_noirq) phase, when
-the non-boot CPUs are all offline and IRQs are disabled on the remaining online
-CPU, then a sysdev_driver.suspend phase is carried out, and the system enters a
-sleep state (or a system image is created).  During resume (or after the image
-has been created or loaded) a sysdev_driver.resume phase is carried out, IRQs
-are enabled on the only online CPU, the non-boot CPUs are enabled, and the
-resume_noirq (or thaw_noirq or restore_noirq) phase begins.
-
-Code to actually enter and exit the system-wide low power state sometimes
-involves hardware details that are only known to the boot firmware, and
-may leave a CPU running software (from SRAM or flash memory) that monitors
-the system and manages its wakeup sequence.
+    2.	The restore_early phase is analogous to the resume_early phase.
+
+    3.	The restore phase is analogous to the resume phase.
+
+    4.	The complete phase is discussed above.
+
+The main difference from resume[_early|_noirq] is that restore[_early|_noirq]
+must assume the device has been accessed and reconfigured by the boot loader or
+the boot kernel.  Consequently the state of the device may be different from the
+state remembered from the freeze, freeze_late and freeze_noirq phases.  The
+device may even need to be reset and completely re-initialized.  In many cases
+this difference doesn't matter, so the resume[_early|_noirq] and
+restore[_early|_norq] method pointers can be set to the same routines.
+Nevertheless, different callback pointers are used in case there is a situation
+where it actually does matter.
+
+
+Device Power Management Domains
+-------------------------------
+Sometimes devices share reference clocks or other power resources.  In those
+cases it generally is not possible to put devices into low-power states
+individually.  Instead, a set of devices sharing a power resource can be put
+into a low-power state together at the same time by turning off the shared
+power resource.  Of course, they also need to be put into the full-power state
+together, by turning the shared power resource on.  A set of devices with this
+property is often referred to as a power domain.
+
+Support for power domains is provided through the pm_domain field of struct
+device.  This field is a pointer to an object of type struct dev_pm_domain,
+defined in include/linux/pm.h, providing a set of power management callbacks
+analogous to the subsystem-level and device driver callbacks that are executed
+for the given device during all power transitions, instead of the respective
+subsystem-level callbacks.  Specifically, if a device's pm_domain pointer is
+not NULL, the ->suspend() callback from the object pointed to by it will be
+executed instead of its subsystem's (e.g. bus type's) ->suspend() callback and
+analogously for all of the remaining callbacks.  In other words, power
+management domain callbacks, if defined for the given device, always take
+precedence over the callbacks provided by the device's subsystem (e.g. bus
+type).
+
+The support for device power management domains is only relevant to platforms
+needing to use the same device driver power management callbacks in many
+different power domain configurations and wanting to avoid incorporating the
+support for power domains into subsystem-level callbacks, for example by
+modifying the platform bus type.  Other platforms need not implement it or take
+it into account in any way.
 
 
 Device Low Power (suspend) States
 ---------------------------------
 Device low-power states aren't standard.  One device might only handle
-"on" and "off, while another might support a dozen different versions of
+"on" and "off", while another might support a dozen different versions of
 "on" (how many engines are active?), plus a state that gets back to "on"
 faster than from a full "off".
 
@@ -601,7 +691,7 @@ state temporarily, for example so that its system wakeup capability can be
 disabled.  This all depends on the hardware and the design of the subsystem and
 device driver in question.
 
-During system-wide resume from a sleep state it's best to put devices into the
-full-power state, as explained in Documentation/power/runtime_pm.txt.  Refer to
-that document for more information regarding this particular issue as well as
+During system-wide resume from a sleep state it's easiest to put devices into
+the full-power state, as explained in Documentation/power/runtime_pm.txt.  Refer
+to that document for more information regarding this particular issue as well as
 for information on the device runtime power management framework in general.
diff --git a/Documentation/power/drivers-testing.txt b/Documentation/power/drivers-testing.txt
index 7f7a737f7f9..638afdf4d6b 100644
--- a/Documentation/power/drivers-testing.txt
+++ b/Documentation/power/drivers-testing.txt
@@ -23,10 +23,10 @@ Once you have resolved the suspend/resume-related problems with your test system
 without the new driver, you are ready to test it:
 
 a) Build the driver as a module, load it and try the test modes of hibernation
-   (see: Documents/power/basic-pm-debugging.txt, 1).
+   (see: Documentation/power/basic-pm-debugging.txt, 1).
 
 b) Load the driver and attempt to hibernate in the "reboot", "shutdown" and
-   "platform" modes (see: Documents/power/basic-pm-debugging.txt, 1).
+   "platform" modes (see: Documentation/power/basic-pm-debugging.txt, 1).
 
 c) Compile the driver directly into the kernel and try the test modes of
    hibernation.
@@ -34,12 +34,12 @@ c) Compile the driver directly into the kernel and try the test modes of
 d) Attempt to hibernate with the driver compiled directly into the kernel
    in the "reboot", "shutdown" and "platform" modes.
 
-e) Try the test modes of suspend (see: Documents/power/basic-pm-debugging.txt,
+e) Try the test modes of suspend (see: Documentation/power/basic-pm-debugging.txt,
    2).  [As far as the STR tests are concerned, it should not matter whether or
    not the driver is built as a module.]
 
 f) Attempt to suspend to RAM using the s2ram tool with the driver loaded
-   (see: Documents/power/basic-pm-debugging.txt, 2).
+   (see: Documentation/power/basic-pm-debugging.txt, 2).
 
 Each of the above tests should be repeated several times and the STD tests
 should be mixed with the STR tests.  If any of them fails, the driver cannot be
diff --git a/Documentation/power/freezing-of-tasks.txt b/Documentation/power/freezing-of-tasks.txt
index 38b57248fd6..85894d83b35 100644
--- a/Documentation/power/freezing-of-tasks.txt
+++ b/Documentation/power/freezing-of-tasks.txt
@@ -9,7 +9,7 @@ architectures).
 
 II. How does it work?
 
-There are four per-task flags used for that, PF_NOFREEZE, PF_FROZEN, TIF_FREEZE
+There are three per-task flags used for that, PF_NOFREEZE, PF_FROZEN
 and PF_FREEZER_SKIP (the last one is auxiliary).  The tasks that have
 PF_NOFREEZE unset (all user space processes and some kernel threads) are
 regarded as 'freezable' and treated in a special way before the system enters a
@@ -17,30 +17,31 @@ suspend state as well as before a hibernation image is created (in what follows
 we only consider hibernation, but the description also applies to suspend).
 
 Namely, as the first step of the hibernation procedure the function
-freeze_processes() (defined in kernel/power/process.c) is called.  It executes
-try_to_freeze_tasks() that sets TIF_FREEZE for all of the freezable tasks and
-either wakes them up, if they are kernel threads, or sends fake signals to them,
-if they are user space processes.  A task that has TIF_FREEZE set, should react
-to it by calling the function called refrigerator() (defined in
-kernel/power/process.c), which sets the task's PF_FROZEN flag, changes its state
-to TASK_UNINTERRUPTIBLE and makes it loop until PF_FROZEN is cleared for it.
-Then, we say that the task is 'frozen' and therefore the set of functions
-handling this mechanism is referred to as 'the freezer' (these functions are
-defined in kernel/power/process.c and include/linux/freezer.h).  User space
-processes are generally frozen before kernel threads.
-
-It is not recommended to call refrigerator() directly.  Instead, it is
-recommended to use the try_to_freeze() function (defined in
-include/linux/freezer.h), that checks the task's TIF_FREEZE flag and makes the
-task enter refrigerator() if the flag is set.
+freeze_processes() (defined in kernel/power/process.c) is called.  A system-wide
+variable system_freezing_cnt (as opposed to a per-task flag) is used to indicate
+whether the system is to undergo a freezing operation. And freeze_processes()
+sets this variable.  After this, it executes try_to_freeze_tasks() that sends a
+fake signal to all user space processes, and wakes up all the kernel threads.
+All freezable tasks must react to that by calling try_to_freeze(), which
+results in a call to __refrigerator() (defined in kernel/freezer.c), which sets
+the task's PF_FROZEN flag, changes its state to TASK_UNINTERRUPTIBLE and makes
+it loop until PF_FROZEN is cleared for it. Then, we say that the task is
+'frozen' and therefore the set of functions handling this mechanism is referred
+to as 'the freezer' (these functions are defined in kernel/power/process.c,
+kernel/freezer.c & include/linux/freezer.h). User space processes are generally
+frozen before kernel threads.
+
+__refrigerator() must not be called directly.  Instead, use the
+try_to_freeze() function (defined in include/linux/freezer.h), that checks
+if the task is to be frozen and makes the task enter __refrigerator().
 
 For user space processes try_to_freeze() is called automatically from the
 signal-handling code, but the freezable kernel threads need to call it
 explicitly in suitable places or use the wait_event_freezable() or
 wait_event_freezable_timeout() macros (defined in include/linux/freezer.h)
-that combine interruptible sleep with checking if TIF_FREEZE is set and calling
-try_to_freeze().  The main loop of a freezable kernel thread may look like the
-following one:
+that combine interruptible sleep with checking if the task is to be frozen and
+calling try_to_freeze().  The main loop of a freezable kernel thread may look
+like the following one:
 
 	set_freezable();
 	do {
@@ -53,7 +54,7 @@ following one:
 (from drivers/usb/core/hub.c::hub_thread()).
 
 If a freezable kernel thread fails to call try_to_freeze() after the freezer has
-set TIF_FREEZE for it, the freezing of tasks will fail and the entire
+initiated a freezing operation, the freezing of tasks will fail and the entire
 hibernation operation will be cancelled.  For this reason, freezable kernel
 threads must call try_to_freeze() somewhere or use one of the
 wait_event_freezable() and wait_event_freezable_timeout() macros.
@@ -61,13 +62,34 @@ wait_event_freezable() and wait_event_freezable_timeout() macros.
 After the system memory state has been restored from a hibernation image and
 devices have been reinitialized, the function thaw_processes() is called in
 order to clear the PF_FROZEN flag for each frozen task.  Then, the tasks that
-have been frozen leave refrigerator() and continue running.
+have been frozen leave __refrigerator() and continue running.
+
+
+Rationale behind the functions dealing with freezing and thawing of tasks:
+-------------------------------------------------------------------------
+
+freeze_processes():
+  - freezes only userspace tasks
+
+freeze_kernel_threads():
+  - freezes all tasks (including kernel threads) because we can't freeze
+    kernel threads without freezing userspace tasks
+
+thaw_kernel_threads():
+  - thaws only kernel threads; this is particularly useful if we need to do
+    anything special in between thawing of kernel threads and thawing of
+    userspace tasks, or if we want to postpone the thawing of userspace tasks
+
+thaw_processes():
+  - thaws all tasks (including kernel threads) because we can't thaw userspace
+    tasks without thawing kernel threads
+
 
 III. Which kernel threads are freezable?
 
 Kernel threads are not freezable by default.  However, a kernel thread may clear
 PF_NOFREEZE for itself by calling set_freezable() (the resetting of PF_NOFREEZE
-directly is strongly discouraged).  From this point it is regarded as freezable
+directly is not allowed).  From this point it is regarded as freezable
 and must call try_to_freeze() in a suitable place.
 
 IV. Why do we do that?
@@ -95,7 +117,7 @@ after the memory for the image has been freed, we don't want tasks to allocate
 additional memory and we prevent them from doing that by freezing them earlier.
 [Of course, this also means that device drivers should not allocate substantial
 amounts of memory from their .suspend() callbacks before hibernation, but this
-is e separate issue.]
+is a separate issue.]
 
 3. The third reason is to prevent user space processes and some kernel threads
 from interfering with the suspending and resuming of devices.  A user space
@@ -120,10 +142,10 @@ So in practice, the 'at all' may become a 'why freeze kernel threads?' and
 freezing user threads I don't find really objectionable."
 
 Still, there are kernel threads that may want to be freezable.  For example, if
-a kernel that belongs to a device driver accesses the device directly, it in
-principle needs to know when the device is suspended, so that it doesn't try to
-access it at that time.  However, if the kernel thread is freezable, it will be
-frozen before the driver's .suspend() callback is executed and it will be
+a kernel thread that belongs to a device driver accesses the device directly, it
+in principle needs to know when the device is suspended, so that it doesn't try
+to access it at that time.  However, if the kernel thread is freezable, it will
+be frozen before the driver's .suspend() callback is executed and it will be
 thawed after the driver's .resume() callback has run, so it won't be accessing
 the device while it's suspended.
 
@@ -176,3 +198,33 @@ tasks, since it generally exists anyway.
 A driver must have all firmwares it may need in RAM before suspend() is called.
 If keeping them is not practical, for example due to their size, they must be
 requested early enough using the suspend notifier API described in notifiers.txt.
+
+VI. Are there any precautions to be taken to prevent freezing failures?
+
+Yes, there are.
+
+First of all, grabbing the 'pm_mutex' lock to mutually exclude a piece of code
+from system-wide sleep such as suspend/hibernation is not encouraged.
+If possible, that piece of code must instead hook onto the suspend/hibernation
+notifiers to achieve mutual exclusion. Look at the CPU-Hotplug code
+(kernel/cpu.c) for an example.
+
+However, if that is not feasible, and grabbing 'pm_mutex' is deemed necessary,
+it is strongly discouraged to directly call mutex_[un]lock(&pm_mutex) since
+that could lead to freezing failures, because if the suspend/hibernate code
+successfully acquired the 'pm_mutex' lock, and hence that other entity failed
+to acquire the lock, then that task would get blocked in TASK_UNINTERRUPTIBLE
+state. As a consequence, the freezer would not be able to freeze that task,
+leading to freezing failure.
+
+However, the [un]lock_system_sleep() APIs are safe to use in this scenario,
+since they ask the freezer to skip freezing this task, since it is anyway
+"frozen enough" as it is blocked on 'pm_mutex', which will be released
+only after the entire suspend/hibernation sequence is complete.
+So, to summarize, use [un]lock_system_sleep() instead of directly using
+mutex_[un]lock(&pm_mutex). That would prevent freezing failures.
+
+V. Miscellaneous
+/sys/power/pm_freeze_timeout controls how long it will cost at most to freeze
+all user space processes or all freezable kernel threads, in unit of millisecond.
+The default value is 20000, with range of unsigned integer.
diff --git a/Documentation/power/interface.txt b/Documentation/power/interface.txt
index c537834af00..f1f0f59a7c4 100644
--- a/Documentation/power/interface.txt
+++ b/Documentation/power/interface.txt
@@ -7,8 +7,8 @@ running. The interface exists in /sys/power/ directory (assuming sysfs
 is mounted at /sys). 
 
 /sys/power/state controls system power state. Reading from this file
-returns what states are supported, which is hard-coded to 'standby'
-(Power-On Suspend), 'mem' (Suspend-to-RAM), and 'disk'
+returns what states are supported, which is hard-coded to 'freeze',
+'standby' (Power-On Suspend), 'mem' (Suspend-to-RAM), and 'disk'
 (Suspend-to-Disk). 
 
 Writing to this file one of those strings causes the system to
diff --git a/Documentation/power/notifiers.txt b/Documentation/power/notifiers.txt
index ae1b7ec0768..a81fa254303 100644
--- a/Documentation/power/notifiers.txt
+++ b/Documentation/power/notifiers.txt
@@ -1,46 +1,43 @@
 Suspend notifiers
-	(C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL
-
-There are some operations that device drivers may want to carry out in their
-.suspend() routines, but shouldn't, because they can cause the hibernation or
-suspend to fail. For example, a driver may want to allocate a substantial amount
-of memory (like 50 MB) in .suspend(), but that shouldn't be done after the
-swsusp's memory shrinker has run.
-
-Also, there may be some operations, that subsystems want to carry out before a
-hibernation/suspend or after a restore/resume, requiring the system to be fully
-functional, so the drivers' .suspend() and .resume() routines are not suitable
-for this purpose.  For example, device drivers may want to upload firmware to
-their devices after a restore from a hibernation image, but they cannot do it by
-calling request_firmware() from their .resume() routines (user land processes
-are frozen at this point).  The solution may be to load the firmware into
-memory before processes are frozen and upload it from there in the .resume()
-routine.  Of course, a hibernation notifier may be used for this purpose.
-
-The subsystems that have such needs can register suspend notifiers that will be
-called upon the following events by the suspend core:
-
-PM_HIBERNATION_PREPARE	The system is going to hibernate or suspend, tasks will
-			be frozen immediately.
+	(C) 2007-2011 Rafael J. Wysocki <rjw@sisk.pl>, GPL
+
+There are some operations that subsystems or drivers may want to carry out
+before hibernation/suspend or after restore/resume, but they require the system
+to be fully functional, so the drivers' and subsystems' .suspend() and .resume()
+or even .prepare() and .complete() callbacks are not suitable for this purpose.
+For example, device drivers may want to upload firmware to their devices after
+resume/restore, but they cannot do it by calling request_firmware() from their
+.resume() or .complete() routines (user land processes are frozen at these
+points).  The solution may be to load the firmware into memory before processes
+are frozen and upload it from there in the .resume() routine.
+A suspend/hibernation notifier may be used for this purpose.
+
+The subsystems or drivers having such needs can register suspend notifiers that
+will be called upon the following events by the PM core:
+
+PM_HIBERNATION_PREPARE	The system is going to hibernate, tasks will be frozen
+			immediately. This is different from PM_SUSPEND_PREPARE
+			below because here we do additional work between notifiers
+			and drivers freezing.
 
 PM_POST_HIBERNATION	The system memory state has been restored from a
-			hibernation image or an error occured during the
-			hibernation.  Device drivers' .resume() callbacks have
+			hibernation image or an error occurred during
+			hibernation.  Device drivers' restore callbacks have
 			been executed and tasks have been thawed.
 
 PM_RESTORE_PREPARE	The system is going to restore a hibernation image.
-			If all goes well the restored kernel will issue a
+			If all goes well, the restored kernel will issue a
 			PM_POST_HIBERNATION notification.
 
-PM_POST_RESTORE		An error occurred during the hibernation restore.
-			Device drivers' .resume() callbacks have been executed
+PM_POST_RESTORE		An error occurred during restore from hibernation.
+			Device drivers' restore callbacks have been executed
 			and tasks have been thawed.
 
-PM_SUSPEND_PREPARE	The system is preparing for a suspend.
+PM_SUSPEND_PREPARE	The system is preparing for suspend.
 
-PM_POST_SUSPEND		The system has just resumed or an error occured during
-			the suspend.	Device drivers' .resume() callbacks have
-			been executed and tasks have been thawed.
+PM_POST_SUSPEND		The system has just resumed or an error occurred during
+			suspend.  Device drivers' resume callbacks have been
+			executed and tasks have been thawed.
 
 It is generally assumed that whatever the notifiers do for
 PM_HIBERNATION_PREPARE, should be undone for PM_POST_HIBERNATION.  Analogously,
diff --git a/Documentation/power/opp.txt b/Documentation/power/opp.txt
index 44d87ad3cea..a9adad828cd 100644
--- a/Documentation/power/opp.txt
+++ b/Documentation/power/opp.txt
@@ -1,6 +1,5 @@
-*=============*
-* OPP Library *
-*=============*
+Operating Performance Points (OPP) Library
+==========================================
 
 (C) 2009-2010 Nishanth Menon <nm@ti.com>, Texas Instruments Incorporated
 
@@ -11,23 +10,38 @@ Contents
 3. OPP Search Functions
 4. OPP Availability Control Functions
 5. OPP Data Retrieval Functions
-6. Cpufreq Table Generation
-7. Data Structures
+6. Data Structures
 
 1. Introduction
 ===============
+1.1 What is an Operating Performance Point (OPP)?
+
 Complex SoCs of today consists of a multiple sub-modules working in conjunction.
 In an operational system executing varied use cases, not all modules in the SoC
 need to function at their highest performing frequency all the time. To
 facilitate this, sub-modules in a SoC are grouped into domains, allowing some
-domains to run at lower voltage and frequency while other domains are loaded
-more. The set of discrete tuples consisting of frequency and voltage pairs that
+domains to run at lower voltage and frequency while other domains run at
+voltage/frequency pairs that are higher.
+
+The set of discrete tuples consisting of frequency and voltage pairs that
 the device will support per domain are called Operating Performance Points or
 OPPs.
 
+As an example:
+Let us consider an MPU device which supports the following:
+{300MHz at minimum voltage of 1V}, {800MHz at minimum voltage of 1.2V},
+{1GHz at minimum voltage of 1.3V}
+
+We can represent these as three OPPs as the following {Hz, uV} tuples:
+{300000000, 1000000}
+{800000000, 1200000}
+{1000000000, 1300000}
+
+1.2 Operating Performance Points Library
+
 OPP library provides a set of helper functions to organize and query the OPP
 information. The library is located in drivers/base/power/opp.c and the header
-is located in include/linux/opp.h. OPP library can be enabled by enabling
+is located in include/linux/pm_opp.h. OPP library can be enabled by enabling
 CONFIG_PM_OPP from power management menuconfig menu. OPP library depends on
 CONFIG_PM as certain SoCs such as Texas Instrument's OMAP framework allows to
 optionally boot at a certain OPP without needing cpufreq.
@@ -37,6 +51,9 @@ Typical usage of the OPP library is as follows:
 SoC framework	-> modifies on required cases certain OPPs	-> OPP layer
 		-> queries to search/retrieve information	->
 
+Architectures that provide a SoC framework for OPP should select ARCH_HAS_OPP
+to make the OPP layer available.
+
 OPP layer expects each domain to be represented by a unique device pointer. SoC
 framework registers a set of initial OPPs per device with the OPP layer. This
 list is expected to be an optimally small number typically around 5 per device.
@@ -53,14 +70,13 @@ operations until that OPP could be re-enabled if possible.
 
 OPP library facilitates this concept in it's implementation. The following
 operational functions operate only on available opps:
-opp_find_freq_{ceil, floor}, opp_get_voltage, opp_get_freq, opp_get_opp_count
-and opp_init_cpufreq_table
+opp_find_freq_{ceil, floor}, dev_pm_opp_get_voltage, dev_pm_opp_get_freq, dev_pm_opp_get_opp_count
 
-opp_find_freq_exact is meant to be used to find the opp pointer which can then
-be used for opp_enable/disable functions to make an opp available as required.
+dev_pm_opp_find_freq_exact is meant to be used to find the opp pointer which can then
+be used for dev_pm_opp_enable/disable functions to make an opp available as required.
 
 WARNING: Users of OPP library should refresh their availability count using
-get_opp_count if opp_enable/disable functions are invoked for a device, the
+get_opp_count if dev_pm_opp_enable/disable functions are invoked for a device, the
 exact mechanism to trigger these or the notification mechanism to other
 dependent subsystems such as cpufreq are left to the discretion of the SoC
 specific framework which uses the OPP library. Similar care needs to be taken
@@ -78,24 +94,23 @@ using RCU read locks. The opp_find_freq_{exact,ceil,floor},
 opp_get_{voltage, freq, opp_count} fall into this category.
 
 opp_{add,enable,disable} are updaters which use mutex and implement it's own
-RCU locking mechanisms. opp_init_cpufreq_table acts as an updater and uses
-mutex to implment RCU updater strategy. These functions should *NOT* be called
-under RCU locks and other contexts that prevent blocking functions in RCU or
-mutex operations from working.
+RCU locking mechanisms. These functions should *NOT* be called under RCU locks
+and other contexts that prevent blocking functions in RCU or mutex operations
+from working.
 
 2. Initial OPP List Registration
 ================================
-The SoC implementation calls opp_add function iteratively to add OPPs per
+The SoC implementation calls dev_pm_opp_add function iteratively to add OPPs per
 device. It is expected that the SoC framework will register the OPP entries
 optimally- typical numbers range to be less than 5. The list generated by
 registering the OPPs is maintained by OPP library throughout the device
 operation. The SoC framework can subsequently control the availability of the
-OPPs dynamically using the opp_enable / disable functions.
+OPPs dynamically using the dev_pm_opp_enable / disable functions.
 
-opp_add - Add a new OPP for a specific domain represented by the device pointer.
+dev_pm_opp_add - Add a new OPP for a specific domain represented by the device pointer.
 	The OPP is defined using the frequency and voltage. Once added, the OPP
 	is assumed to be available and control of it's availability can be done
-	with the opp_enable/disable functions. OPP library internally stores
+	with the dev_pm_opp_enable/disable functions. OPP library internally stores
 	and manages this information in the opp struct. This function may be
 	used by SoC framework to define a optimal list as per the demands of
 	SoC usage environment.
@@ -106,7 +121,7 @@ opp_add - Add a new OPP for a specific domain represented by the device pointer.
 	 soc_pm_init()
 	 {
 		/* Do things */
-		r = opp_add(mpu_dev, 1000000, 900000);
+		r = dev_pm_opp_add(mpu_dev, 1000000, 900000);
 		if (!r) {
 			pr_err("%s: unable to register mpu opp(%d)\n", r);
 			goto no_cpufreq;
@@ -125,44 +140,44 @@ functions return the matching pointer representing the opp if a match is
 found, else returns error. These errors are expected to be handled by standard
 error checks such as IS_ERR() and appropriate actions taken by the caller.
 
-opp_find_freq_exact - Search for an OPP based on an *exact* frequency and
+dev_pm_opp_find_freq_exact - Search for an OPP based on an *exact* frequency and
 	availability. This function is especially useful to enable an OPP which
 	is not available by default.
 	Example: In a case when SoC framework detects a situation where a
 	higher frequency could be made available, it can use this function to
-	find the OPP prior to call the opp_enable to actually make it available.
+	find the OPP prior to call the dev_pm_opp_enable to actually make it available.
 	 rcu_read_lock();
-	 opp = opp_find_freq_exact(dev, 1000000000, false);
+	 opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false);
 	 rcu_read_unlock();
 	 /* dont operate on the pointer.. just do a sanity check.. */
 	 if (IS_ERR(opp)) {
 		pr_err("frequency not disabled!\n");
 		/* trigger appropriate actions.. */
 	 } else {
-		opp_enable(dev,1000000000);
+		dev_pm_opp_enable(dev,1000000000);
 	 }
 
 	NOTE: This is the only search function that operates on OPPs which are
 	not available.
 
-opp_find_freq_floor - Search for an available OPP which is *at most* the
+dev_pm_opp_find_freq_floor - Search for an available OPP which is *at most* the
 	provided frequency. This function is useful while searching for a lesser
 	match OR operating on OPP information in the order of decreasing
 	frequency.
 	Example: To find the highest opp for a device:
 	 freq = ULONG_MAX;
 	 rcu_read_lock();
-	 opp_find_freq_floor(dev, &freq);
+	 dev_pm_opp_find_freq_floor(dev, &freq);
 	 rcu_read_unlock();
 
-opp_find_freq_ceil - Search for an available OPP which is *at least* the
+dev_pm_opp_find_freq_ceil - Search for an available OPP which is *at least* the
 	provided frequency. This function is useful while searching for a
 	higher match OR operating on OPP information in the order of increasing
 	frequency.
 	Example 1: To find the lowest opp for a device:
 	 freq = 0;
 	 rcu_read_lock();
-	 opp_find_freq_ceil(dev, &freq);
+	 dev_pm_opp_find_freq_ceil(dev, &freq);
 	 rcu_read_unlock();
 	Example 2: A simplified implementation of a SoC cpufreq_driver->target:
 	 soc_cpufreq_target(..)
@@ -170,12 +185,12 @@ opp_find_freq_ceil - Search for an available OPP which is *at least* the
 		/* Do stuff like policy checks etc. */
 		/* Find the best frequency match for the req */
 		rcu_read_lock();
-		opp = opp_find_freq_ceil(dev, &freq);
+		opp = dev_pm_opp_find_freq_ceil(dev, &freq);
 		rcu_read_unlock();
 		if (!IS_ERR(opp))
 			soc_switch_to_freq_voltage(freq);
 		else
-			/* do something when we cant satisfy the req */
+			/* do something when we can't satisfy the req */
 		/* do other stuff */
 	 }
 
@@ -190,34 +205,34 @@ as thermal considerations (e.g. don't use OPPx until the temperature drops).
 
 WARNING: Do not use these functions in interrupt context.
 
-opp_enable - Make a OPP available for operation.
+dev_pm_opp_enable - Make a OPP available for operation.
 	Example: Lets say that 1GHz OPP is to be made available only if the
 	SoC temperature is lower than a certain threshold. The SoC framework
 	implementation might choose to do something as follows:
 	 if (cur_temp < temp_low_thresh) {
 		/* Enable 1GHz if it was disabled */
 		rcu_read_lock();
-		opp = opp_find_freq_exact(dev, 1000000000, false);
+		opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false);
 		rcu_read_unlock();
 		/* just error check */
 		if (!IS_ERR(opp))
-			ret = opp_enable(dev, 1000000000);
+			ret = dev_pm_opp_enable(dev, 1000000000);
 		else
 			goto try_something_else;
 	 }
 
-opp_disable - Make an OPP to be not available for operation
+dev_pm_opp_disable - Make an OPP to be not available for operation
 	Example: Lets say that 1GHz OPP is to be disabled if the temperature
 	exceeds a threshold value. The SoC framework implementation might
 	choose to do something as follows:
 	 if (cur_temp > temp_high_thresh) {
 		/* Disable 1GHz if it was enabled */
 		rcu_read_lock();
-		opp = opp_find_freq_exact(dev, 1000000000, true);
+		opp = dev_pm_opp_find_freq_exact(dev, 1000000000, true);
 		rcu_read_unlock();
 		/* just error check */
 		if (!IS_ERR(opp))
-			ret = opp_disable(dev, 1000000000);
+			ret = dev_pm_opp_disable(dev, 1000000000);
 		else
 			goto try_something_else;
 	 }
@@ -229,7 +244,7 @@ information from the OPP structure is necessary. Once an OPP pointer is
 retrieved using the search functions, the following functions can be used by SoC
 framework to retrieve the information represented inside the OPP layer.
 
-opp_get_voltage - Retrieve the voltage represented by the opp pointer.
+dev_pm_opp_get_voltage - Retrieve the voltage represented by the opp pointer.
 	Example: At a cpufreq transition to a different frequency, SoC
 	framework requires to set the voltage represented by the OPP using
 	the regulator framework to the Power Management chip providing the
@@ -238,15 +253,15 @@ opp_get_voltage - Retrieve the voltage represented by the opp pointer.
 	 {
 		/* do things */
 		rcu_read_lock();
-		opp = opp_find_freq_ceil(dev, &freq);
-		v = opp_get_voltage(opp);
+		opp = dev_pm_opp_find_freq_ceil(dev, &freq);
+		v = dev_pm_opp_get_voltage(opp);
 		rcu_read_unlock();
 		if (v)
 			regulator_set_voltage(.., v);
 		/* do other things */
 	 }
 
-opp_get_freq - Retrieve the freq represented by the opp pointer.
+dev_pm_opp_get_freq - Retrieve the freq represented by the opp pointer.
 	Example: Lets say the SoC framework uses a couple of helper functions
 	we could pass opp pointers instead of doing additional parameters to
 	handle quiet a bit of data parameters.
@@ -255,8 +270,8 @@ opp_get_freq - Retrieve the freq represented by the opp pointer.
 		/* do things.. */
 		 max_freq = ULONG_MAX;
 		 rcu_read_lock();
-		 max_opp = opp_find_freq_floor(dev,&max_freq);
-		 requested_opp = opp_find_freq_ceil(dev,&freq);
+		 max_opp = dev_pm_opp_find_freq_floor(dev,&max_freq);
+		 requested_opp = dev_pm_opp_find_freq_ceil(dev,&freq);
 		 if (!IS_ERR(max_opp) && !IS_ERR(requested_opp))
 			r = soc_test_validity(max_opp, requested_opp);
 		 rcu_read_unlock();
@@ -264,25 +279,25 @@ opp_get_freq - Retrieve the freq represented by the opp pointer.
 	 }
 	 soc_test_validity(..)
 	 {
-		 if(opp_get_voltage(max_opp) < opp_get_voltage(requested_opp))
+		 if(dev_pm_opp_get_voltage(max_opp) < dev_pm_opp_get_voltage(requested_opp))
 			 return -EINVAL;
-		 if(opp_get_freq(max_opp) < opp_get_freq(requested_opp))
+		 if(dev_pm_opp_get_freq(max_opp) < dev_pm_opp_get_freq(requested_opp))
 			 return -EINVAL;
 		/* do things.. */
 	 }
 
-opp_get_opp_count - Retrieve the number of available opps for a device
+dev_pm_opp_get_opp_count - Retrieve the number of available opps for a device
 	Example: Lets say a co-processor in the SoC needs to know the available
 	frequencies in a table, the main processor can notify as following:
 	 soc_notify_coproc_available_frequencies()
 	 {
 		/* Do things */
 		rcu_read_lock();
-		num_available = opp_get_opp_count(dev);
+		num_available = dev_pm_opp_get_opp_count(dev);
 		speeds = kzalloc(sizeof(u32) * num_available, GFP_KERNEL);
 		/* populate the table in increasing order */
 		freq = 0;
-		while (!IS_ERR(opp = opp_find_freq_ceil(dev, &freq))) {
+		while (!IS_ERR(opp = dev_pm_opp_find_freq_ceil(dev, &freq))) {
 			speeds[i] = freq;
 			freq++;
 			i++;
@@ -293,32 +308,7 @@ opp_get_opp_count - Retrieve the number of available opps for a device
 		/* Do other things */
 	 }
 
-6. Cpufreq Table Generation
-===========================
-opp_init_cpufreq_table - cpufreq framework typically is initialized with
-	cpufreq_frequency_table_cpuinfo which is provided with the list of
-	frequencies that are available for operation. This function provides
-	a ready to use conversion routine to translate the OPP layer's internal
-	information about the available frequencies into a format readily
-	providable to cpufreq.
-
-	WARNING: Do not use this function in interrupt context.
-
-	Example:
-	 soc_pm_init()
-	 {
-		/* Do things */
-		r = opp_init_cpufreq_table(dev, &freq_table);
-		if (!r)
-			cpufreq_frequency_table_cpuinfo(policy, freq_table);
-		/* Do other things */
-	 }
-
-	NOTE: This function is available only if CONFIG_CPU_FREQ is enabled in
-	addition to CONFIG_PM as power management feature is required to
-	dynamically scale voltage and frequency in a system.
-
-7. Data Structures
+6. Data Structures
 ==================
 Typically an SoC contains multiple voltage domains which are variable. Each
 domain is represented by a device pointer. The relationship to OPP can be
@@ -338,16 +328,16 @@ accessed by various functions as described above. However, the structures
 representing the actual OPPs and domains are internal to the OPP library itself
 to allow for suitable abstraction reusable across systems.
 
-struct opp - The internal data structure of OPP library which is used to
+struct dev_pm_opp - The internal data structure of OPP library which is used to
 	represent an OPP. In addition to the freq, voltage, availability
 	information, it also contains internal book keeping information required
 	for the OPP library to operate on.  Pointer to this structure is
 	provided back to the users such as SoC framework to be used as a
 	identifier for OPP in the interactions with OPP layer.
 
-	WARNING: The struct opp pointer should not be parsed or modified by the
-	users. The defaults of for an instance is populated by opp_add, but the
-	availability of the OPP can be modified by opp_enable/disable functions.
+	WARNING: The struct dev_pm_opp pointer should not be parsed or modified by the
+	users. The defaults of for an instance is populated by dev_pm_opp_add, but the
+	availability of the OPP can be modified by dev_pm_opp_enable/disable functions.
 
 struct device - This is used to identify a domain to the OPP layer. The
 	nature of the device and it's implementation is left to the user of
@@ -357,19 +347,19 @@ Overall, in a simplistic view, the data structure operations is represented as
 following:
 
 Initialization / modification:
-            +-----+        /- opp_enable
-opp_add --> | opp | <-------
-  |         +-----+        \- opp_disable
+            +-----+        /- dev_pm_opp_enable
+dev_pm_opp_add --> | opp | <-------
+  |         +-----+        \- dev_pm_opp_disable
   \-------> domain_info(device)
 
 Search functions:
-             /-- opp_find_freq_ceil  ---\   +-----+
-domain_info<---- opp_find_freq_exact -----> | opp |
-             \-- opp_find_freq_floor ---/   +-----+
+             /-- dev_pm_opp_find_freq_ceil  ---\   +-----+
+domain_info<---- dev_pm_opp_find_freq_exact -----> | opp |
+             \-- dev_pm_opp_find_freq_floor ---/   +-----+
 
 Retrieval functions:
-+-----+     /- opp_get_voltage
++-----+     /- dev_pm_opp_get_voltage
 | opp | <---
-+-----+     \- opp_get_freq
++-----+     \- dev_pm_opp_get_freq
 
-domain_info <- opp_get_opp_count
+domain_info <- dev_pm_opp_get_opp_count
diff --git a/Documentation/power/pm_qos_interface.txt b/Documentation/power/pm_qos_interface.txt
index bfed898a03f..a5da5c7e712 100644
--- a/Documentation/power/pm_qos_interface.txt
+++ b/Documentation/power/pm_qos_interface.txt
@@ -4,14 +4,19 @@ This interface provides a kernel and user mode interface for registering
 performance expectations by drivers, subsystems and user space applications on
 one of the parameters.
 
-Currently we have {cpu_dma_latency, network_latency, network_throughput} as the
-initial set of pm_qos parameters.
+Two different PM QoS frameworks are available:
+1. PM QoS classes for cpu_dma_latency, network_latency, network_throughput.
+2. the per-device PM QoS framework provides the API to manage the per-device latency
+constraints and PM QoS flags.
 
 Each parameters have defined units:
  * latency: usec
  * timeout: usec
  * throughput: kbs (kilo bit / sec)
 
+
+1. PM QoS framework
+
 The infrastructure exposes multiple misc device nodes one per implemented
 parameter.  The set of parameters implement is defined by pm_qos_power_init()
 and pm_qos_params.h.  This is done because having the available parameters
@@ -23,14 +28,18 @@ an aggregated target value.  The aggregated target value is updated with
 changes to the request list or elements of the list.  Typically the
 aggregated target value is simply the max or min of the request values held
 in the parameter list elements.
+Note: the aggregated target value is implemented as an atomic variable so that
+reading the aggregated value does not require any locking mechanism.
+
 
 From kernel mode the use of this interface is simple:
 
-handle = pm_qos_add_request(param_class, target_value):
-Will insert an element into the list for that identified PM_QOS class with the
+void pm_qos_add_request(handle, param_class, target_value):
+Will insert an element into the list for that identified PM QoS class with the
 target value.  Upon change to this list the new target is recomputed and any
 registered notifiers are called only if the target value is now different.
-Clients of pm_qos need to save the returned handle.
+Clients of pm_qos need to save the returned handle for future use in other
+pm_qos API functions.
 
 void pm_qos_update_request(handle, new_target_value):
 Will update the list element pointed to by the handle with the new target value
@@ -42,6 +51,20 @@ Will remove the element.  After removal it will update the aggregate target and
 call the notification tree if the target was changed as a result of removing
 the request.
 
+int pm_qos_request(param_class):
+Returns the aggregated value for a given PM QoS class.
+
+int pm_qos_request_active(handle):
+Returns if the request is still active, i.e. it has not been removed from a
+PM QoS class constraints list.
+
+int pm_qos_add_notifier(param_class, notifier):
+Adds a notification callback function to the PM QoS class. The callback is
+called when the aggregated value for the PM QoS class is changed.
+
+int pm_qos_remove_notifier(int param_class, notifier):
+Removes the notification callback function for the PM QoS class.
+
 
 From user mode:
 Only processes can register a pm_qos request.  To provide for automatic
@@ -63,4 +86,137 @@ To remove the user mode request for a target value simply close the device
 node.
 
 
+2. PM QoS per-device latency and flags framework
+
+For each device, there are three lists of PM QoS requests. Two of them are
+maintained along with the aggregated targets of resume latency and active
+state latency tolerance (in microseconds) and the third one is for PM QoS flags.
+Values are updated in response to changes of the request list.
+
+The target values of resume latency and active state latency tolerance are
+simply the minimum of the request values held in the parameter list elements.
+The PM QoS flags aggregate value is a gather (bitwise OR) of all list elements'
+values.  Two device PM QoS flags are defined currently: PM_QOS_FLAG_NO_POWER_OFF
+and PM_QOS_FLAG_REMOTE_WAKEUP.
+
+Note: The aggregated target values are implemented in such a way that reading
+the aggregated value does not require any locking mechanism.
+
+
+From kernel mode the use of this interface is the following:
+
+int dev_pm_qos_add_request(device, handle, type, value):
+Will insert an element into the list for that identified device with the
+target value.  Upon change to this list the new target is recomputed and any
+registered notifiers are called only if the target value is now different.
+Clients of dev_pm_qos need to save the handle for future use in other
+dev_pm_qos API functions.
+
+int dev_pm_qos_update_request(handle, new_value):
+Will update the list element pointed to by the handle with the new target value
+and recompute the new aggregated target, calling the notification trees if the
+target is changed.
+
+int dev_pm_qos_remove_request(handle):
+Will remove the element.  After removal it will update the aggregate target and
+call the notification trees if the target was changed as a result of removing
+the request.
 
+s32 dev_pm_qos_read_value(device):
+Returns the aggregated value for a given device's constraints list.
+
+enum pm_qos_flags_status dev_pm_qos_flags(device, mask)
+Check PM QoS flags of the given device against the given mask of flags.
+The meaning of the return values is as follows:
+	PM_QOS_FLAGS_ALL: All flags from the mask are set
+	PM_QOS_FLAGS_SOME: Some flags from the mask are set
+	PM_QOS_FLAGS_NONE: No flags from the mask are set
+	PM_QOS_FLAGS_UNDEFINED: The device's PM QoS structure has not been
+			initialized or the list of requests is empty.
+
+int dev_pm_qos_add_ancestor_request(dev, handle, type, value)
+Add a PM QoS request for the first direct ancestor of the given device whose
+power.ignore_children flag is unset (for DEV_PM_QOS_RESUME_LATENCY requests)
+or whose power.set_latency_tolerance callback pointer is not NULL (for
+DEV_PM_QOS_LATENCY_TOLERANCE requests).
+
+int dev_pm_qos_expose_latency_limit(device, value)
+Add a request to the device's PM QoS list of resume latency constraints and
+create a sysfs attribute pm_qos_resume_latency_us under the device's power
+directory allowing user space to manipulate that request.
+
+void dev_pm_qos_hide_latency_limit(device)
+Drop the request added by dev_pm_qos_expose_latency_limit() from the device's
+PM QoS list of resume latency constraints and remove sysfs attribute
+pm_qos_resume_latency_us from the device's power directory.
+
+int dev_pm_qos_expose_flags(device, value)
+Add a request to the device's PM QoS list of flags and create sysfs attributes
+pm_qos_no_power_off and pm_qos_remote_wakeup under the device's power directory
+allowing user space to change these flags' value.
+
+void dev_pm_qos_hide_flags(device)
+Drop the request added by dev_pm_qos_expose_flags() from the device's PM QoS list
+of flags and remove sysfs attributes pm_qos_no_power_off and pm_qos_remote_wakeup
+under the device's power directory.
+
+Notification mechanisms:
+The per-device PM QoS framework has 2 different and distinct notification trees:
+a per-device notification tree and a global notification tree.
+
+int dev_pm_qos_add_notifier(device, notifier):
+Adds a notification callback function for the device.
+The callback is called when the aggregated value of the device constraints list
+is changed (for resume latency device PM QoS only).
+
+int dev_pm_qos_remove_notifier(device, notifier):
+Removes the notification callback function for the device.
+
+int dev_pm_qos_add_global_notifier(notifier):
+Adds a notification callback function in the global notification tree of the
+framework.
+The callback is called when the aggregated value for any device is changed
+(for resume latency device PM QoS only).
+
+int dev_pm_qos_remove_global_notifier(notifier):
+Removes the notification callback function from the global notification tree
+of the framework.
+
+
+Active state latency tolerance
+
+This device PM QoS type is used to support systems in which hardware may switch
+to energy-saving operation modes on the fly.  In those systems, if the operation
+mode chosen by the hardware attempts to save energy in an overly aggressive way,
+it may cause excess latencies to be visible to software, causing it to miss
+certain protocol requirements or target frame or sample rates etc.
+
+If there is a latency tolerance control mechanism for a given device available
+to software, the .set_latency_tolerance callback in that device's dev_pm_info
+structure should be populated.  The routine pointed to by it is should implement
+whatever is necessary to transfer the effective requirement value to the
+hardware.
+
+Whenever the effective latency tolerance changes for the device, its
+.set_latency_tolerance() callback will be executed and the effective value will
+be passed to it.  If that value is negative, which means that the list of
+latency tolerance requirements for the device is empty, the callback is expected
+to switch the underlying hardware latency tolerance control mechanism to an
+autonomous mode if available.  If that value is PM_QOS_LATENCY_ANY, in turn, and
+the hardware supports a special "no requirement" setting, the callback is
+expected to use it.  That allows software to prevent the hardware from
+automatically updating the device's latency tolerance in response to its power
+state changes (e.g. during transitions from D3cold to D0), which generally may
+be done in the autonomous latency tolerance control mode.
+
+If .set_latency_tolerance() is present for the device, sysfs attribute
+pm_qos_latency_tolerance_us will be present in the devivce's power directory.
+Then, user space can use that attribute to specify its latency tolerance
+requirement for the device, if any.  Writing "any" to it means "no requirement,
+but do not let the hardware control latency tolerance" and writing "auto" to it
+allows the hardware to be switched to the autonomous mode if there are no other
+requirements from the kernel side in the device's list.
+
+Kernel code can use the functions described above along with the
+DEV_PM_QOS_LATENCY_TOLERANCE device PM QoS type to add, remove and update
+latency tolerance requirements for devices.
diff --git a/Documentation/power/power_supply_class.txt b/Documentation/power/power_supply_class.txt
index 9f16c5178b6..89a8816990f 100644
--- a/Documentation/power/power_supply_class.txt
+++ b/Documentation/power/power_supply_class.txt
@@ -81,9 +81,14 @@ This defines trickle and fast charges.  For batteries that
 are already charged or discharging, 'n/a' can be displayed (or
 'unknown', if the status is not known).
 
+AUTHENTIC - indicates the power supply (battery or charger) connected
+to the platform is authentic(1) or non authentic(0).
+
 HEALTH - represents health of the battery, values corresponds to
 POWER_SUPPLY_HEALTH_*, defined in battery.h.
 
+VOLTAGE_OCV - open circuit voltage of the battery.
+
 VOLTAGE_MAX_DESIGN, VOLTAGE_MIN_DESIGN - design values for maximal and
 minimal power supply voltages. Maximal/minimal means values of voltages
 when battery considered "full"/"empty" at normal conditions. Yes, there is
@@ -110,14 +115,31 @@ CHARGE_COUNTER - the current charge counter (in µAh).  This could easily
 be negative; there is no empty or full value.  It is only useful for
 relative, time-based measurements.
 
+CONSTANT_CHARGE_CURRENT - constant charge current programmed by charger.
+CONSTANT_CHARGE_CURRENT_MAX - maximum charge current supported by the
+power supply object.
+
+CONSTANT_CHARGE_VOLTAGE - constant charge voltage programmed by charger.
+CONSTANT_CHARGE_VOLTAGE_MAX - maximum charge voltage supported by the
+power supply object.
+
+CHARGE_CONTROL_LIMIT - current charge control limit setting
+CHARGE_CONTROL_LIMIT_MAX - maximum charge control limit setting
+
 ENERGY_FULL, ENERGY_EMPTY - same as above but for energy.
 
 CAPACITY - capacity in percents.
+CAPACITY_ALERT_MIN - minimum capacity alert value in percents.
+CAPACITY_ALERT_MAX - maximum capacity alert value in percents.
 CAPACITY_LEVEL - capacity level. This corresponds to
 POWER_SUPPLY_CAPACITY_LEVEL_*.
 
 TEMP - temperature of the power supply.
+TEMP_ALERT_MIN - minimum battery temperature alert.
+TEMP_ALERT_MAX - maximum battery temperature alert.
 TEMP_AMBIENT - ambient temperature.
+TEMP_AMBIENT_ALERT_MIN - minimum ambient temperature alert.
+TEMP_AMBIENT_ALERT_MAX - maximum ambient temperature alert.
 
 TIME_TO_EMPTY - seconds left for battery to be considered empty (i.e.
 while battery powers a load)
diff --git a/Documentation/power/powercap/powercap.txt b/Documentation/power/powercap/powercap.txt
new file mode 100644
index 00000000000..1e6ef164e07
--- /dev/null
+++ b/Documentation/power/powercap/powercap.txt
@@ -0,0 +1,236 @@
+Power Capping Framework
+==================================
+
+The power capping framework provides a consistent interface between the kernel
+and the user space that allows power capping drivers to expose the settings to
+user space in a uniform way.
+
+Terminology
+=========================
+The framework exposes power capping devices to user space via sysfs in the
+form of a tree of objects. The objects at the root level of the tree represent
+'control types', which correspond to different methods of power capping.  For
+example, the intel-rapl control type represents the Intel "Running Average
+Power Limit" (RAPL) technology, whereas the 'idle-injection' control type
+corresponds to the use of idle injection for controlling power.
+
+Power zones represent different parts of the system, which can be controlled and
+monitored using the power capping method determined by the control type the
+given zone belongs to. They each contain attributes for monitoring power, as
+well as controls represented in the form of power constraints.  If the parts of
+the system represented by different power zones are hierarchical (that is, one
+bigger part consists of multiple smaller parts that each have their own power
+controls), those power zones may also be organized in a hierarchy with one
+parent power zone containing multiple subzones and so on to reflect the power
+control topology of the system.  In that case, it is possible to apply power
+capping to a set of devices together using the parent power zone and if more
+fine grained control is required, it can be applied through the subzones.
+
+
+Example sysfs interface tree:
+
+/sys/devices/virtual/powercap
+??? intel-rapl
+    ??? intel-rapl:0
+    ?   ??? constraint_0_name
+    ?   ??? constraint_0_power_limit_uw
+    ?   ??? constraint_0_time_window_us
+    ?   ??? constraint_1_name
+    ?   ??? constraint_1_power_limit_uw
+    ?   ??? constraint_1_time_window_us
+    ?   ??? device -> ../../intel-rapl
+    ?   ??? energy_uj
+    ?   ??? intel-rapl:0:0
+    ?   ?   ??? constraint_0_name
+    ?   ?   ??? constraint_0_power_limit_uw
+    ?   ?   ??? constraint_0_time_window_us
+    ?   ?   ??? constraint_1_name
+    ?   ?   ??? constraint_1_power_limit_uw
+    ?   ?   ??? constraint_1_time_window_us
+    ?   ?   ??? device -> ../../intel-rapl:0
+    ?   ?   ??? energy_uj
+    ?   ?   ??? max_energy_range_uj
+    ?   ?   ??? name
+    ?   ?   ??? enabled
+    ?   ?   ??? power
+    ?   ?   ?   ??? async
+    ?   ?   ?   []
+    ?   ?   ??? subsystem -> ../../../../../../class/power_cap
+    ?   ?   ??? uevent
+    ?   ??? intel-rapl:0:1
+    ?   ?   ??? constraint_0_name
+    ?   ?   ??? constraint_0_power_limit_uw
+    ?   ?   ??? constraint_0_time_window_us
+    ?   ?   ??? constraint_1_name
+    ?   ?   ??? constraint_1_power_limit_uw
+    ?   ?   ??? constraint_1_time_window_us
+    ?   ?   ??? device -> ../../intel-rapl:0
+    ?   ?   ??? energy_uj
+    ?   ?   ??? max_energy_range_uj
+    ?   ?   ??? name
+    ?   ?   ??? enabled
+    ?   ?   ??? power
+    ?   ?   ?   ??? async
+    ?   ?   ?   []
+    ?   ?   ??? subsystem -> ../../../../../../class/power_cap
+    ?   ?   ??? uevent
+    ?   ??? max_energy_range_uj
+    ?   ??? max_power_range_uw
+    ?   ??? name
+    ?   ??? enabled
+    ?   ??? power
+    ?   ?   ??? async
+    ?   ?   []
+    ?   ??? subsystem -> ../../../../../class/power_cap
+    ?   ??? enabled
+    ?   ??? uevent
+    ??? intel-rapl:1
+    ?   ??? constraint_0_name
+    ?   ??? constraint_0_power_limit_uw
+    ?   ??? constraint_0_time_window_us
+    ?   ??? constraint_1_name
+    ?   ??? constraint_1_power_limit_uw
+    ?   ??? constraint_1_time_window_us
+    ?   ??? device -> ../../intel-rapl
+    ?   ??? energy_uj
+    ?   ??? intel-rapl:1:0
+    ?   ?   ??? constraint_0_name
+    ?   ?   ??? constraint_0_power_limit_uw
+    ?   ?   ??? constraint_0_time_window_us
+    ?   ?   ??? constraint_1_name
+    ?   ?   ??? constraint_1_power_limit_uw
+    ?   ?   ??? constraint_1_time_window_us
+    ?   ?   ??? device -> ../../intel-rapl:1
+    ?   ?   ??? energy_uj
+    ?   ?   ??? max_energy_range_uj
+    ?   ?   ??? name
+    ?   ?   ??? enabled
+    ?   ?   ??? power
+    ?   ?   ?   ??? async
+    ?   ?   ?   []
+    ?   ?   ??? subsystem -> ../../../../../../class/power_cap
+    ?   ?   ??? uevent
+    ?   ??? intel-rapl:1:1
+    ?   ?   ??? constraint_0_name
+    ?   ?   ??? constraint_0_power_limit_uw
+    ?   ?   ??? constraint_0_time_window_us
+    ?   ?   ??? constraint_1_name
+    ?   ?   ??? constraint_1_power_limit_uw
+    ?   ?   ??? constraint_1_time_window_us
+    ?   ?   ??? device -> ../../intel-rapl:1
+    ?   ?   ??? energy_uj
+    ?   ?   ??? max_energy_range_uj
+    ?   ?   ??? name
+    ?   ?   ??? enabled
+    ?   ?   ??? power
+    ?   ?   ?   ??? async
+    ?   ?   ?   []
+    ?   ?   ??? subsystem -> ../../../../../../class/power_cap
+    ?   ?   ??? uevent
+    ?   ??? max_energy_range_uj
+    ?   ??? max_power_range_uw
+    ?   ??? name
+    ?   ??? enabled
+    ?   ??? power
+    ?   ?   ??? async
+    ?   ?   []
+    ?   ??? subsystem -> ../../../../../class/power_cap
+    ?   ??? uevent
+    ??? power
+    ?   ??? async
+    ?   []
+    ??? subsystem -> ../../../../class/power_cap
+    ??? enabled
+    ??? uevent
+
+The above example illustrates a case in which the Intel RAPL technology,
+available in Intel® IA-64 and IA-32 Processor Architectures, is used. There is one
+control type called intel-rapl which contains two power zones, intel-rapl:0 and
+intel-rapl:1, representing CPU packages.  Each of these power zones contains
+two subzones, intel-rapl:j:0 and intel-rapl:j:1 (j = 0, 1), representing the
+"core" and the "uncore" parts of the given CPU package, respectively.  All of
+the zones and subzones contain energy monitoring attributes (energy_uj,
+max_energy_range_uj) and constraint attributes (constraint_*) allowing controls
+to be applied (the constraints in the 'package' power zones apply to the whole
+CPU packages and the subzone constraints only apply to the respective parts of
+the given package individually). Since Intel RAPL doesn't provide instantaneous
+power value, there is no power_uw attribute.
+
+In addition to that, each power zone contains a name attribute, allowing the
+part of the system represented by that zone to be identified.
+For example:
+
+cat /sys/class/power_cap/intel-rapl/intel-rapl:0/name
+package-0
+
+The Intel RAPL technology allows two constraints, short term and long term,
+with two different time windows to be applied to each power zone.  Thus for
+each zone there are 2 attributes representing the constraint names, 2 power
+limits and 2 attributes representing the sizes of the time windows. Such that,
+constraint_j_* attributes correspond to the jth constraint (j = 0,1).
+
+For example:
+	constraint_0_name
+	constraint_0_power_limit_uw
+	constraint_0_time_window_us
+	constraint_1_name
+	constraint_1_power_limit_uw
+	constraint_1_time_window_us
+
+Power Zone Attributes
+=================================
+Monitoring attributes
+----------------------
+
+energy_uj (rw): Current energy counter in micro joules. Write "0" to reset.
+If the counter can not be reset, then this attribute is read only.
+
+max_energy_range_uj (ro): Range of the above energy counter in micro-joules.
+
+power_uw (ro): Current power in micro watts.
+
+max_power_range_uw (ro): Range of the above power value in micro-watts.
+
+name (ro): Name of this power zone.
+
+It is possible that some domains have both power ranges and energy counter ranges;
+however, only one is mandatory.
+
+Constraints
+----------------
+constraint_X_power_limit_uw (rw): Power limit in micro watts, which should be
+applicable for the time window specified by "constraint_X_time_window_us".
+
+constraint_X_time_window_us (rw): Time window in micro seconds.
+
+constraint_X_name (ro): An optional name of the constraint
+
+constraint_X_max_power_uw(ro): Maximum allowed power in micro watts.
+
+constraint_X_min_power_uw(ro): Minimum allowed power in micro watts.
+
+constraint_X_max_time_window_us(ro): Maximum allowed time window in micro seconds.
+
+constraint_X_min_time_window_us(ro): Minimum allowed time window in micro seconds.
+
+Except power_limit_uw and time_window_us other fields are optional.
+
+Common zone and control type attributes
+----------------------------------------
+enabled (rw): Enable/Disable controls at zone level or for all zones using
+a control type.
+
+Power Cap Client Driver Interface
+==================================
+The API summary:
+
+Call powercap_register_control_type() to register control type object.
+Call powercap_register_zone() to register a power zone (under a given
+control type), either as a top-level power zone or as a subzone of another
+power zone registered earlier.
+The number of constraints in a power zone and the corresponding callbacks have
+to be defined prior to calling powercap_register_zone() to register that zone.
+
+To Free a power zone call powercap_unregister_zone().
+To free a control type object call powercap_unregister_control_type().
+Detailed API can be generated using kernel-doc on include/linux/powercap.h.
diff --git a/Documentation/power/regulator/machine.txt b/Documentation/power/regulator/machine.txt
index bdec39b9bd7..ce63af0a8e3 100644
--- a/Documentation/power/regulator/machine.txt
+++ b/Documentation/power/regulator/machine.txt
@@ -16,7 +16,7 @@ initialisation code by creating a struct regulator_consumer_supply for
 each regulator.
 
 struct regulator_consumer_supply {
-	struct device *dev;	/* consumer */
+	const char *dev_name;	/* consumer dev_name() */
 	const char *supply;	/* consumer supply - e.g. "vcc" */
 };
 
@@ -24,13 +24,13 @@ e.g. for the machine above
 
 static struct regulator_consumer_supply regulator1_consumers[] = {
 {
-	.dev	= &platform_consumerB_device.dev,
-	.supply	= "Vcc",
+	.dev_name	= "dev_name(consumer B)",
+	.supply		= "Vcc",
 },};
 
 static struct regulator_consumer_supply regulator2_consumers[] = {
 {
-	.dev	= &platform_consumerA_device.dev,
+	.dev	= "dev_name(consumer A"),
 	.supply	= "Vcc",
 },};
 
@@ -43,6 +43,7 @@ to their supply regulator :-
 
 static struct regulator_init_data regulator1_data = {
 	.constraints = {
+		.name = "Regulator-1",
 		.min_uV = 3300000,
 		.max_uV = 3300000,
 		.valid_modes_mask = REGULATOR_MODE_NORMAL,
@@ -51,13 +52,19 @@ static struct regulator_init_data regulator1_data = {
 	.consumer_supplies = regulator1_consumers,
 };
 
+The name field should be set to something that is usefully descriptive
+for the board for configuration of supplies for other regulators and
+for use in logging and other diagnostic output.  Normally the name
+used for the supply rail in the schematic is a good choice.  If no
+name is provided then the subsystem will choose one.
+
 Regulator-1 supplies power to Regulator-2. This relationship must be registered
 with the core so that Regulator-1 is also enabled when Consumer A enables its
-supply (Regulator-2). The supply regulator is set by the supply_regulator_dev
-field below:-
+supply (Regulator-2). The supply regulator is set by the supply_regulator
+field below and co:-
 
 static struct regulator_init_data regulator2_data = {
-	.supply_regulator_dev = &platform_regulator1_device.dev,
+	.supply_regulator = "Regulator-1",
 	.constraints = {
 		.min_uV = 1800000,
 		.max_uV = 2000000,
diff --git a/Documentation/power/regulator/regulator.txt b/Documentation/power/regulator/regulator.txt
index 3f8b528f237..13902778ae4 100644
--- a/Documentation/power/regulator/regulator.txt
+++ b/Documentation/power/regulator/regulator.txt
@@ -11,8 +11,7 @@ Registration
 Drivers can register a regulator by calling :-
 
 struct regulator_dev *regulator_register(struct regulator_desc *regulator_desc,
-	struct device *dev, struct regulator_init_data *init_data,
-	void *driver_data);
+					 const struct regulator_config *config);
 
 This will register the regulators capabilities and operations to the regulator
 core.
diff --git a/Documentation/power/runtime_pm.txt b/Documentation/power/runtime_pm.txt
index 489e9bacd16..f32ce541957 100644
--- a/Documentation/power/runtime_pm.txt
+++ b/Documentation/power/runtime_pm.txt
@@ -1,39 +1,40 @@
-Run-time Power Management Framework for I/O Devices
+Runtime Power Management Framework for I/O Devices
 
-(C) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+(C) 2009-2011 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
 (C) 2010 Alan Stern <stern@rowland.harvard.edu>
+(C) 2014 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
 
 1. Introduction
 
-Support for run-time power management (run-time PM) of I/O devices is provided
+Support for runtime power management (runtime PM) of I/O devices is provided
 at the power management core (PM core) level by means of:
 
 * The power management workqueue pm_wq in which bus types and device drivers can
   put their PM-related work items.  It is strongly recommended that pm_wq be
-  used for queuing all work items related to run-time PM, because this allows
+  used for queuing all work items related to runtime PM, because this allows
   them to be synchronized with system-wide power transitions (suspend to RAM,
   hibernation and resume from system sleep states).  pm_wq is declared in
   include/linux/pm_runtime.h and defined in kernel/power/main.c.
 
-* A number of run-time PM fields in the 'power' member of 'struct device' (which
+* A number of runtime PM fields in the 'power' member of 'struct device' (which
   is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can
-  be used for synchronizing run-time PM operations with one another.
+  be used for synchronizing runtime PM operations with one another.
 
-* Three device run-time PM callbacks in 'struct dev_pm_ops' (defined in
+* Three device runtime PM callbacks in 'struct dev_pm_ops' (defined in
   include/linux/pm.h).
 
 * A set of helper functions defined in drivers/base/power/runtime.c that can be
-  used for carrying out run-time PM operations in such a way that the
+  used for carrying out runtime PM operations in such a way that the
   synchronization between them is taken care of by the PM core.  Bus types and
   device drivers are encouraged to use these functions.
 
-The run-time PM callbacks present in 'struct dev_pm_ops', the device run-time PM
+The runtime PM callbacks present in 'struct dev_pm_ops', the device runtime PM
 fields of 'struct dev_pm_info' and the core helper functions provided for
-run-time PM are described below.
+runtime PM are described below.
 
-2. Device Run-time PM Callbacks
+2. Device Runtime PM Callbacks
 
-There are three device run-time PM callbacks defined in 'struct dev_pm_ops':
+There are three device runtime PM callbacks defined in 'struct dev_pm_ops':
 
 struct dev_pm_ops {
 	...
@@ -43,85 +44,119 @@ struct dev_pm_ops {
 	...
 };
 
-The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks are
-executed by the PM core for either the bus type, or device type (if the bus
-type's callback is not defined), or device class (if the bus type's and device
-type's callbacks are not defined) of given device.  The bus type, device type
-and device class callbacks are referred to as subsystem-level callbacks in what
-follows.
+The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks
+are executed by the PM core for the device's subsystem that may be either of
+the following:
 
-The subsystem-level suspend callback is _entirely_ _responsible_ for handling
-the suspend of the device as appropriate, which may, but need not include
-executing the device driver's own ->runtime_suspend() callback (from the
+  1. PM domain of the device, if the device's PM domain object, dev->pm_domain,
+     is present.
+
+  2. Device type of the device, if both dev->type and dev->type->pm are present.
+
+  3. Device class of the device, if both dev->class and dev->class->pm are
+     present.
+
+  4. Bus type of the device, if both dev->bus and dev->bus->pm are present.
+
+If the subsystem chosen by applying the above rules doesn't provide the relevant
+callback, the PM core will invoke the corresponding driver callback stored in
+dev->driver->pm directly (if present).
+
+The PM core always checks which callback to use in the order given above, so the
+priority order of callbacks from high to low is: PM domain, device type, class
+and bus type.  Moreover, the high-priority one will always take precedence over
+a low-priority one.  The PM domain, bus type, device type and class callbacks
+are referred to as subsystem-level callbacks in what follows.
+
+By default, the callbacks are always invoked in process context with interrupts
+enabled.  However, the pm_runtime_irq_safe() helper function can be used to tell
+the PM core that it is safe to run the ->runtime_suspend(), ->runtime_resume()
+and ->runtime_idle() callbacks for the given device in atomic context with
+interrupts disabled.  This implies that the callback routines in question must
+not block or sleep, but it also means that the synchronous helper functions
+listed at the end of Section 4 may be used for that device within an interrupt
+handler or generally in an atomic context.
+
+The subsystem-level suspend callback, if present, is _entirely_ _responsible_
+for handling the suspend of the device as appropriate, which may, but need not
+include executing the device driver's own ->runtime_suspend() callback (from the
 PM core's point of view it is not necessary to implement a ->runtime_suspend()
 callback in a device driver as long as the subsystem-level suspend callback
 knows what to do to handle the device).
 
-  * Once the subsystem-level suspend callback has completed successfully
-    for given device, the PM core regards the device as suspended, which need
-    not mean that the device has been put into a low power state.  It is
-    supposed to mean, however, that the device will not process data and will
-    not communicate with the CPU(s) and RAM until the subsystem-level resume
-    callback is executed for it.  The run-time PM status of a device after
-    successful execution of the subsystem-level suspend callback is 'suspended'.
-
-  * If the subsystem-level suspend callback returns -EBUSY or -EAGAIN,
-    the device's run-time PM status is 'active', which means that the device
-    _must_ be fully operational afterwards.
-
-  * If the subsystem-level suspend callback returns an error code different
-    from -EBUSY or -EAGAIN, the PM core regards this as a fatal error and will
-    refuse to run the helper functions described in Section 4 for the device,
-    until the status of it is directly set either to 'active', or to 'suspended'
-    (the PM core provides special helper functions for this purpose).
-
-In particular, if the driver requires remote wake-up capability (i.e. hardware
+  * Once the subsystem-level suspend callback (or the driver suspend callback,
+    if invoked directly) has completed successfully for the given device, the PM
+    core regards the device as suspended, which need not mean that it has been
+    put into a low power state.  It is supposed to mean, however, that the
+    device will not process data and will not communicate with the CPU(s) and
+    RAM until the appropriate resume callback is executed for it.  The runtime
+    PM status of a device after successful execution of the suspend callback is
+    'suspended'.
+
+  * If the suspend callback returns -EBUSY or -EAGAIN, the device's runtime PM
+    status remains 'active', which means that the device _must_ be fully
+    operational afterwards.
+
+  * If the suspend callback returns an error code different from -EBUSY and
+    -EAGAIN, the PM core regards this as a fatal error and will refuse to run
+    the helper functions described in Section 4 for the device until its status
+    is directly set to  either'active', or 'suspended' (the PM core provides
+    special helper functions for this purpose).
+
+In particular, if the driver requires remote wakeup capability (i.e. hardware
 mechanism allowing the device to request a change of its power state, such as
 PCI PME) for proper functioning and device_run_wake() returns 'false' for the
 device, then ->runtime_suspend() should return -EBUSY.  On the other hand, if
-device_run_wake() returns 'true' for the device and the device is put into a low
-power state during the execution of the subsystem-level suspend callback, it is
-expected that remote wake-up will be enabled for the device.  Generally, remote
-wake-up should be enabled for all input devices put into a low power state at
-run time.
-
-The subsystem-level resume callback is _entirely_ _responsible_ for handling the
-resume of the device as appropriate, which may, but need not include executing
-the device driver's own ->runtime_resume() callback (from the PM core's point of
-view it is not necessary to implement a ->runtime_resume() callback in a device
-driver as long as the subsystem-level resume callback knows what to do to handle
-the device).
-
-  * Once the subsystem-level resume callback has completed successfully, the PM
-    core regards the device as fully operational, which means that the device
-    _must_ be able to complete I/O operations as needed.  The run-time PM status
-    of the device is then 'active'.
-
-  * If the subsystem-level resume callback returns an error code, the PM core
-    regards this as a fatal error and will refuse to run the helper functions
-    described in Section 4 for the device, until its status is directly set
-    either to 'active' or to 'suspended' (the PM core provides special helper
-    functions for this purpose).
-
-The subsystem-level idle callback is executed by the PM core whenever the device
-appears to be idle, which is indicated to the PM core by two counters, the
-device's usage counter and the counter of 'active' children of the device.
+device_run_wake() returns 'true' for the device and the device is put into a
+low-power state during the execution of the suspend callback, it is expected
+that remote wakeup will be enabled for the device.  Generally, remote wakeup
+should be enabled for all input devices put into low-power states at run time.
+
+The subsystem-level resume callback, if present, is _entirely_ _responsible_ for
+handling the resume of the device as appropriate, which may, but need not
+include executing the device driver's own ->runtime_resume() callback (from the
+PM core's point of view it is not necessary to implement a ->runtime_resume()
+callback in a device driver as long as the subsystem-level resume callback knows
+what to do to handle the device).
+
+  * Once the subsystem-level resume callback (or the driver resume callback, if
+    invoked directly) has completed successfully, the PM core regards the device
+    as fully operational, which means that the device _must_ be able to complete
+    I/O operations as needed.  The runtime PM status of the device is then
+    'active'.
+
+  * If the resume callback returns an error code, the PM core regards this as a
+    fatal error and will refuse to run the helper functions described in Section
+    4 for the device, until its status is directly set to either 'active', or
+    'suspended' (by means of special helper functions provided by the PM core
+    for this purpose).
+
+The idle callback (a subsystem-level one, if present, or the driver one) is
+executed by the PM core whenever the device appears to be idle, which is
+indicated to the PM core by two counters, the device's usage counter and the
+counter of 'active' children of the device.
 
   * If any of these counters is decreased using a helper function provided by
     the PM core and it turns out to be equal to zero, the other counter is
     checked.  If that counter also is equal to zero, the PM core executes the
-    subsystem-level idle callback with the device as an argument.
+    idle callback with the device as its argument.
 
-The action performed by a subsystem-level idle callback is totally dependent on
-the subsystem in question, but the expected and recommended action is to check
+The action performed by the idle callback is totally dependent on the subsystem
+(or driver) in question, but the expected and recommended action is to check
 if the device can be suspended (i.e. if all of the conditions necessary for
 suspending the device are satisfied) and to queue up a suspend request for the
-device in that case.  The value returned by this callback is ignored by the PM
-core.
+device in that case.  If there is no idle callback, or if the callback returns
+0, then the PM core will attempt to carry out a runtime suspend of the device,
+also respecting devices configured for autosuspend.  In essence this means a
+call to pm_runtime_autosuspend() (do note that drivers needs to update the
+device last busy mark, pm_runtime_mark_last_busy(), to control the delay under
+this circumstance).  To prevent this (for example, if the callback routine has
+started a delayed suspend), the routine must return a non-zero value.  Negative
+error return codes are ignored by the PM core.
 
 The helper functions provided by the PM core, described in Section 4, guarantee
-that the following constraints are met with respect to the bus type's run-time
-PM callbacks:
+that the following constraints are met with respect to runtime PM callbacks for
+one device:
 
 (1) The callbacks are mutually exclusive (e.g. it is forbidden to execute
     ->runtime_suspend() in parallel with ->runtime_resume() or with another
@@ -132,7 +167,7 @@ PM callbacks:
 
 (2) ->runtime_idle() and ->runtime_suspend() can only be executed for 'active'
     devices (i.e. the PM core will only execute ->runtime_idle() or
-    ->runtime_suspend() for the devices the run-time PM status of which is
+    ->runtime_suspend() for the devices the runtime PM status of which is
     'active').
 
 (3) ->runtime_idle() and ->runtime_suspend() can only be executed for a device
@@ -141,7 +176,7 @@ PM callbacks:
     flag of which is set.
 
 (4) ->runtime_resume() can only be executed for 'suspended' devices  (i.e. the
-    PM core will only execute ->runtime_resume() for the devices the run-time
+    PM core will only execute ->runtime_resume() for the devices the runtime
     PM status of which is 'suspended').
 
 Additionally, the helper functions provided by the PM core obey the following
@@ -161,9 +196,9 @@ rules:
     scheduled requests to execute the other callbacks for the same device,
     except for scheduled autosuspends.
 
-3. Run-time PM Device Fields
+3. Runtime PM Device Fields
 
-The following device run-time PM fields are present in 'struct dev_pm_info', as
+The following device runtime PM fields are present in 'struct dev_pm_info', as
 defined in include/linux/pm.h:
 
   struct timer_list suspend_timer;
@@ -195,10 +230,10 @@ defined in include/linux/pm.h:
 
   unsigned int disable_depth;
     - used for disabling the helper funcions (they work normally if this is
-      equal to zero); the initial value of it is 1 (i.e. run-time PM is
+      equal to zero); the initial value of it is 1 (i.e. runtime PM is
       initially disabled for all devices)
 
-  unsigned int runtime_error;
+  int runtime_error;
     - if set, there was a fatal error (one of the callbacks returned error code
       as described in Section 2), so the helper funtions will not work until
       this flag is cleared; this is the error code returned by the failing
@@ -219,10 +254,10 @@ defined in include/linux/pm.h:
       suspend to complete; means "start a resume as soon as you've suspended"
 
   unsigned int run_wake;
-    - set if the device is capable of generating run-time wake-up events
+    - set if the device is capable of generating runtime wake-up events
 
   enum rpm_status runtime_status;
-    - the run-time PM status of the device; this field's initial value is
+    - the runtime PM status of the device; this field's initial value is
       RPM_SUSPENDED, which means that each device is initially regarded by the
       PM core as 'suspended', regardless of its real hardware status
 
@@ -233,10 +268,14 @@ defined in include/linux/pm.h:
       and pm_runtime_forbid() helper functions
 
   unsigned int no_callbacks;
-    - indicates that the device does not use the run-time PM callbacks (see
+    - indicates that the device does not use the runtime PM callbacks (see
       Section 8); it may be modified only by the pm_runtime_no_callbacks()
       helper function
 
+  unsigned int irq_safe;
+    - indicates that the ->runtime_suspend() and ->runtime_resume() callbacks
+      will be invoked with the spinlock held and interrupts disabled
+
   unsigned int use_autosuspend;
     - indicates that the device's driver supports delayed autosuspend (see
       Section 9); it may be modified only by the
@@ -256,28 +295,30 @@ defined in include/linux/pm.h:
 
 All of the above fields are members of the 'power' member of 'struct device'.
 
-4. Run-time PM Device Helper Functions
+4. Runtime PM Device Helper Functions
 
-The following run-time PM helper functions are defined in
+The following runtime PM helper functions are defined in
 drivers/base/power/runtime.c and include/linux/pm_runtime.h:
 
   void pm_runtime_init(struct device *dev);
-    - initialize the device run-time PM fields in 'struct dev_pm_info'
+    - initialize the device runtime PM fields in 'struct dev_pm_info'
 
   void pm_runtime_remove(struct device *dev);
-    - make sure that the run-time PM of the device will be disabled after
+    - make sure that the runtime PM of the device will be disabled after
       removing the device from device hierarchy
 
   int pm_runtime_idle(struct device *dev);
-    - execute the subsystem-level idle callback for the device; returns 0 on
-      success or error code on failure, where -EINPROGRESS means that
-      ->runtime_idle() is already being executed
+    - execute the subsystem-level idle callback for the device; returns an
+      error code on failure, where -EINPROGRESS means that ->runtime_idle() is
+      already being executed; if there is no callback or the callback returns 0
+      then run pm_runtime_autosuspend(dev) and return its result
 
   int pm_runtime_suspend(struct device *dev);
     - execute the subsystem-level suspend callback for the device; returns 0 on
-      success, 1 if the device's run-time PM status was already 'suspended', or
+      success, 1 if the device's runtime PM status was already 'suspended', or
       error code on failure, where -EAGAIN or -EBUSY means it is safe to attempt
-      to suspend the device again in future
+      to suspend the device again in future and -EACCES means that
+      'power.disable_depth' is different from 0
 
   int pm_runtime_autosuspend(struct device *dev);
     - same as pm_runtime_suspend() except that the autosuspend delay is taken
@@ -287,10 +328,11 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h:
 
   int pm_runtime_resume(struct device *dev);
     - execute the subsystem-level resume callback for the device; returns 0 on
-      success, 1 if the device's run-time PM status was already 'active' or
+      success, 1 if the device's runtime PM status was already 'active' or
       error code on failure, where -EAGAIN means it may be safe to attempt to
       resume the device again in future, but 'power.runtime_error' should be
-      checked additionally
+      checked additionally, and -EACCES means that 'power.disable_depth' is
+      different from 0
 
   int pm_request_idle(struct device *dev);
     - submit a request to execute the subsystem-level idle callback for the
@@ -307,7 +349,7 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h:
       device in future, where 'delay' is the time to wait before queuing up a
       suspend work item in pm_wq, in milliseconds (if 'delay' is zero, the work
       item is queued up immediately); returns 0 on success, 1 if the device's PM
-      run-time status was already 'suspended', or error code if the request
+      runtime status was already 'suspended', or error code if the request
       hasn't been scheduled (or queued up if 'delay' is 0); if the execution of
       ->runtime_suspend() is already scheduled and not yet expired, the new
       value of 'delay' will be used as the time to wait
@@ -315,7 +357,7 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h:
   int pm_request_resume(struct device *dev);
     - submit a request to execute the subsystem-level resume callback for the
       device (the request is represented by a work item in pm_wq); returns 0 on
-      success, 1 if the device's run-time PM status was already 'active', or
+      success, 1 if the device's runtime PM status was already 'active', or
       error code if the request hasn't been queued up
 
   void pm_runtime_get_noresume(struct device *dev);
@@ -344,27 +386,41 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h:
     - decrement the device's usage counter; if the result is 0 then run
       pm_runtime_idle(dev) and return its result
 
+  int pm_runtime_put_sync_suspend(struct device *dev);
+    - decrement the device's usage counter; if the result is 0 then run
+      pm_runtime_suspend(dev) and return its result
+
   int pm_runtime_put_sync_autosuspend(struct device *dev);
     - decrement the device's usage counter; if the result is 0 then run
       pm_runtime_autosuspend(dev) and return its result
 
   void pm_runtime_enable(struct device *dev);
-    - enable the run-time PM helper functions to run the device bus type's
-      run-time PM callbacks described in Section 2
+    - decrement the device's 'power.disable_depth' field; if that field is equal
+      to zero, the runtime PM helper functions can execute subsystem-level
+      callbacks described in Section 2 for the device
 
   int pm_runtime_disable(struct device *dev);
-    - prevent the run-time PM helper functions from running subsystem-level
-      run-time PM callbacks for the device, make sure that all of the pending
-      run-time PM operations on the device are either completed or canceled;
-      returns 1 if there was a resume request pending and it was necessary to
-      execute the subsystem-level resume callback for the device to satisfy that
-      request, otherwise 0 is returned
+    - increment the device's 'power.disable_depth' field (if the value of that
+      field was previously zero, this prevents subsystem-level runtime PM
+      callbacks from being run for the device), make sure that all of the
+      pending runtime PM operations on the device are either completed or
+      canceled; returns 1 if there was a resume request pending and it was
+      necessary to execute the subsystem-level resume callback for the device
+      to satisfy that request, otherwise 0 is returned
+
+  int pm_runtime_barrier(struct device *dev);
+    - check if there's a resume request pending for the device and resume it
+      (synchronously) in that case, cancel any other pending runtime PM requests
+      regarding it and wait for all runtime PM operations on it in progress to
+      complete; returns 1 if there was a resume request pending and it was
+      necessary to execute the subsystem-level resume callback for the device to
+      satisfy that request, otherwise 0 is returned
 
   void pm_suspend_ignore_children(struct device *dev, bool enable);
     - set/unset the power.ignore_children flag of the device
 
   int pm_runtime_set_active(struct device *dev);
-    - clear the device's 'power.runtime_error' flag, set the device's run-time
+    - clear the device's 'power.runtime_error' flag, set the device's runtime
       PM status to 'active' and update its parent's counter of 'active'
       children as appropriate (it is only valid to use this function if
       'power.runtime_error' is set or 'power.disable_depth' is greater than
@@ -372,15 +428,26 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h:
       which is not active and the 'power.ignore_children' flag of which is unset
 
   void pm_runtime_set_suspended(struct device *dev);
-    - clear the device's 'power.runtime_error' flag, set the device's run-time
+    - clear the device's 'power.runtime_error' flag, set the device's runtime
       PM status to 'suspended' and update its parent's counter of 'active'
       children as appropriate (it is only valid to use this function if
       'power.runtime_error' is set or 'power.disable_depth' is greater than
       zero)
 
+  bool pm_runtime_active(struct device *dev);
+    - return true if the device's runtime PM status is 'active' or its
+      'power.disable_depth' field is not equal to zero, or false otherwise
+
   bool pm_runtime_suspended(struct device *dev);
-    - return true if the device's runtime PM status is 'suspended', or false
-      otherwise
+    - return true if the device's runtime PM status is 'suspended' and its
+      'power.disable_depth' field is equal to zero, or false otherwise
+
+  bool pm_runtime_status_suspended(struct device *dev);
+    - return true if the device's runtime PM status is 'suspended'
+
+  bool pm_runtime_suspended_if_enabled(struct device *dev);
+    - return true if the device's runtime PM status is 'suspended' and its
+      'power.disable_depth' field is equal to 1
 
   void pm_runtime_allow(struct device *dev);
     - set the power.runtime_auto flag for the device and decrease its usage
@@ -393,10 +460,14 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h:
       effectively prevent the device from being power managed at run time)
 
   void pm_runtime_no_callbacks(struct device *dev);
-    - set the power.no_callbacks flag for the device and remove the run-time
+    - set the power.no_callbacks flag for the device and remove the runtime
       PM attributes from /sys/devices/.../power (or prevent them from being
       added when the device is registered)
 
+  void pm_runtime_irq_safe(struct device *dev);
+    - set the power.irq_safe flag for the device, causing the runtime-PM
+      callbacks to be invoked with interrupts off
+
   void pm_runtime_mark_last_busy(struct device *dev);
     - set the power.last_busy field to the current time
 
@@ -408,7 +479,7 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h:
 
   void pm_runtime_set_autosuspend_delay(struct device *dev, int delay);
     - set the power.autosuspend_delay value to 'delay' (expressed in
-      milliseconds); if 'delay' is negative then run-time suspends are
+      milliseconds); if 'delay' is negative then runtime suspends are
       prevented
 
   unsigned long pm_runtime_autosuspend_expiration(struct device *dev);
@@ -438,76 +509,102 @@ pm_runtime_suspended()
 pm_runtime_mark_last_busy()
 pm_runtime_autosuspend_expiration()
 
-5. Run-time PM Initialization, Device Probing and Removal
+If pm_runtime_irq_safe() has been called for a device then the following helper
+functions may also be used in interrupt context:
+
+pm_runtime_idle()
+pm_runtime_suspend()
+pm_runtime_autosuspend()
+pm_runtime_resume()
+pm_runtime_get_sync()
+pm_runtime_put_sync()
+pm_runtime_put_sync_suspend()
+pm_runtime_put_sync_autosuspend()
+
+5. Runtime PM Initialization, Device Probing and Removal
 
-Initially, the run-time PM is disabled for all devices, which means that the
-majority of the run-time PM helper funtions described in Section 4 will return
+Initially, the runtime PM is disabled for all devices, which means that the
+majority of the runtime PM helper funtions described in Section 4 will return
 -EAGAIN until pm_runtime_enable() is called for the device.
 
-In addition to that, the initial run-time PM status of all devices is
+In addition to that, the initial runtime PM status of all devices is
 'suspended', but it need not reflect the actual physical state of the device.
 Thus, if the device is initially active (i.e. it is able to process I/O), its
-run-time PM status must be changed to 'active', with the help of
+runtime PM status must be changed to 'active', with the help of
 pm_runtime_set_active(), before pm_runtime_enable() is called for the device.
 
-However, if the device has a parent and the parent's run-time PM is enabled,
+However, if the device has a parent and the parent's runtime PM is enabled,
 calling pm_runtime_set_active() for the device will affect the parent, unless
 the parent's 'power.ignore_children' flag is set.  Namely, in that case the
 parent won't be able to suspend at run time, using the PM core's helper
 functions, as long as the child's status is 'active', even if the child's
-run-time PM is still disabled (i.e. pm_runtime_enable() hasn't been called for
+runtime PM is still disabled (i.e. pm_runtime_enable() hasn't been called for
 the child yet or pm_runtime_disable() has been called for it).  For this reason,
 once pm_runtime_set_active() has been called for the device, pm_runtime_enable()
-should be called for it too as soon as reasonably possible or its run-time PM
+should be called for it too as soon as reasonably possible or its runtime PM
 status should be changed back to 'suspended' with the help of
 pm_runtime_set_suspended().
 
-If the default initial run-time PM status of the device (i.e. 'suspended')
+If the default initial runtime PM status of the device (i.e. 'suspended')
 reflects the actual state of the device, its bus type's or its driver's
 ->probe() callback will likely need to wake it up using one of the PM core's
 helper functions described in Section 4.  In that case, pm_runtime_resume()
-should be used.  Of course, for this purpose the device's run-time PM has to be
+should be used.  Of course, for this purpose the device's runtime PM has to be
 enabled earlier by calling pm_runtime_enable().
 
-If the device bus type's or driver's ->probe() or ->remove() callback runs
-pm_runtime_suspend() or pm_runtime_idle() or their asynchronous counterparts,
-they will fail returning -EAGAIN, because the device's usage counter is
-incremented by the core before executing ->probe() and ->remove().  Still, it
-may be desirable to suspend the device as soon as ->probe() or ->remove() has
-finished, so the PM core uses pm_runtime_idle_sync() to invoke the
-subsystem-level idle callback for the device at that time.
+It may be desirable to suspend the device once ->probe() has finished.
+Therefore the driver core uses the asyncronous pm_request_idle() to submit a
+request to execute the subsystem-level idle callback for the device at that
+time.  A driver that makes use of the runtime autosuspend feature, may want to
+update the last busy mark before returning from ->probe().
+
+Moreover, the driver core prevents runtime PM callbacks from racing with the bus
+notifier callback in __device_release_driver(), which is necessary, because the
+notifier is used by some subsystems to carry out operations affecting the
+runtime PM functionality.  It does so by calling pm_runtime_get_sync() before
+driver_sysfs_remove() and the BUS_NOTIFY_UNBIND_DRIVER notifications.  This
+resumes the device if it's in the suspended state and prevents it from
+being suspended again while those routines are being executed.
+
+To allow bus types and drivers to put devices into the suspended state by
+calling pm_runtime_suspend() from their ->remove() routines, the driver core
+executes pm_runtime_put_sync() after running the BUS_NOTIFY_UNBIND_DRIVER
+notifications in __device_release_driver().  This requires bus types and
+drivers to make their ->remove() callbacks avoid races with runtime PM directly,
+but also it allows of more flexibility in the handling of devices during the
+removal of their drivers.
 
 The user space can effectively disallow the driver of the device to power manage
 it at run time by changing the value of its /sys/devices/.../power/control
 attribute to "on", which causes pm_runtime_forbid() to be called.  In principle,
 this mechanism may also be used by the driver to effectively turn off the
-run-time power management of the device until the user space turns it on.
-Namely, during the initialization the driver can make sure that the run-time PM
+runtime power management of the device until the user space turns it on.
+Namely, during the initialization the driver can make sure that the runtime PM
 status of the device is 'active' and call pm_runtime_forbid().  It should be
 noted, however, that if the user space has already intentionally changed the
 value of /sys/devices/.../power/control to "auto" to allow the driver to power
 manage the device at run time, the driver may confuse it by using
 pm_runtime_forbid() this way.
 
-6. Run-time PM and System Sleep
+6. Runtime PM and System Sleep
 
-Run-time PM and system sleep (i.e., system suspend and hibernation, also known
+Runtime PM and system sleep (i.e., system suspend and hibernation, also known
 as suspend-to-RAM and suspend-to-disk) interact with each other in a couple of
 ways.  If a device is active when a system sleep starts, everything is
 straightforward.  But what should happen if the device is already suspended?
 
-The device may have different wake-up settings for run-time PM and system sleep.
-For example, remote wake-up may be enabled for run-time suspend but disallowed
+The device may have different wake-up settings for runtime PM and system sleep.
+For example, remote wake-up may be enabled for runtime suspend but disallowed
 for system sleep (device_may_wakeup(dev) returns 'false').  When this happens,
 the subsystem-level system suspend callback is responsible for changing the
 device's wake-up setting (it may leave that to the device driver's system
 suspend routine).  It may be necessary to resume the device and suspend it again
 in order to do so.  The same is true if the driver uses different power levels
-or other settings for run-time suspend and system sleep.
+or other settings for runtime suspend and system sleep.
 
-During system resume, devices generally should be brought back to full power,
-even if they were suspended before the system sleep began.  There are several
-reasons for this, including:
+During system resume, the simplest approach is to bring all devices back to full
+power, even if they had been suspended before the system suspend began.  There
+are several reasons for this, including:
 
   * The device might need to switch power levels, wake-up settings, etc.
 
@@ -522,22 +619,64 @@ reasons for this, including:
   * The device might need to be reset.
 
   * Even though the device was suspended, if its usage counter was > 0 then most
-    likely it would need a run-time resume in the near future anyway.
+    likely it would need a runtime resume in the near future anyway.
 
-  * Always going back to full power is simplest.
-
-If the device was suspended before the sleep began, then its run-time PM status
-will have to be updated to reflect the actual post-system sleep status.  The way
-to do this is:
+If the device had been suspended before the system suspend began and it's
+brought back to full power during resume, then its runtime PM status will have
+to be updated to reflect the actual post-system sleep status.  The way to do
+this is:
 
 	pm_runtime_disable(dev);
 	pm_runtime_set_active(dev);
 	pm_runtime_enable(dev);
 
-The PM core always increments the run-time usage counter before calling the
-->prepare() callback and decrements it after calling the ->complete() callback.
-Hence disabling run-time PM temporarily like this will not cause any run-time
-suspend callbacks to be lost.
+The PM core always increments the runtime usage counter before calling the
+->suspend() callback and decrements it after calling the ->resume() callback.
+Hence disabling runtime PM temporarily like this will not cause any runtime
+suspend attempts to be permanently lost.  If the usage count goes to zero
+following the return of the ->resume() callback, the ->runtime_idle() callback
+will be invoked as usual.
+
+On some systems, however, system sleep is not entered through a global firmware
+or hardware operation.  Instead, all hardware components are put into low-power
+states directly by the kernel in a coordinated way.  Then, the system sleep
+state effectively follows from the states the hardware components end up in
+and the system is woken up from that state by a hardware interrupt or a similar
+mechanism entirely under the kernel's control.  As a result, the kernel never
+gives control away and the states of all devices during resume are precisely
+known to it.  If that is the case and none of the situations listed above takes
+place (in particular, if the system is not waking up from hibernation), it may
+be more efficient to leave the devices that had been suspended before the system
+suspend began in the suspended state.
+
+To this end, the PM core provides a mechanism allowing some coordination between
+different levels of device hierarchy.  Namely, if a system suspend .prepare()
+callback returns a positive number for a device, that indicates to the PM core
+that the device appears to be runtime-suspended and its state is fine, so it
+may be left in runtime suspend provided that all of its descendants are also
+left in runtime suspend.  If that happens, the PM core will not execute any
+system suspend and resume callbacks for all of those devices, except for the
+complete callback, which is then entirely responsible for handling the device
+as appropriate.  This only applies to system suspend transitions that are not
+related to hibernation (see Documentation/power/devices.txt for more
+information).
+
+The PM core does its best to reduce the probability of race conditions between
+the runtime PM and system suspend/resume (and hibernation) callbacks by carrying
+out the following operations:
+
+  * During system suspend pm_runtime_get_noresume() is called for every device
+    right before executing the subsystem-level .prepare() callback for it and
+    pm_runtime_barrier() is called for every device right before executing the
+    subsystem-level .suspend() callback for it.  In addition to that the PM core
+    calls  __pm_runtime_disable() with 'false' as the second argument for every
+    device right before executing the subsystem-level .suspend_late() callback
+    for it.
+
+  * During system resume pm_runtime_enable() and pm_runtime_put() are called for
+    every device right after executing the subsystem-level .resume_early()
+    callback and right after executing the subsystem-level .complete() callback
+    for it, respectively.
 
 7. Generic subsystem callbacks
 
@@ -545,58 +684,78 @@ Subsystems may wish to conserve code space by using the set of generic power
 management callbacks provided by the PM core, defined in
 driver/base/power/generic_ops.c:
 
-  int pm_generic_runtime_idle(struct device *dev);
-    - invoke the ->runtime_idle() callback provided by the driver of this
-      device, if defined, and call pm_runtime_suspend() for this device if the
-      return value is 0 or the callback is not defined
-
   int pm_generic_runtime_suspend(struct device *dev);
     - invoke the ->runtime_suspend() callback provided by the driver of this
-      device and return its result, or return -EINVAL if not defined
+      device and return its result, or return 0 if not defined
 
   int pm_generic_runtime_resume(struct device *dev);
     - invoke the ->runtime_resume() callback provided by the driver of this
-      device and return its result, or return -EINVAL if not defined
+      device and return its result, or return 0 if not defined
 
   int pm_generic_suspend(struct device *dev);
     - if the device has not been suspended at run time, invoke the ->suspend()
       callback provided by its driver and return its result, or return 0 if not
       defined
 
+  int pm_generic_suspend_noirq(struct device *dev);
+    - if pm_runtime_suspended(dev) returns "false", invoke the ->suspend_noirq()
+      callback provided by the device's driver and return its result, or return
+      0 if not defined
+
   int pm_generic_resume(struct device *dev);
     - invoke the ->resume() callback provided by the driver of this device and,
       if successful, change the device's runtime PM status to 'active'
 
+  int pm_generic_resume_noirq(struct device *dev);
+    - invoke the ->resume_noirq() callback provided by the driver of this device
+
   int pm_generic_freeze(struct device *dev);
     - if the device has not been suspended at run time, invoke the ->freeze()
       callback provided by its driver and return its result, or return 0 if not
       defined
 
+  int pm_generic_freeze_noirq(struct device *dev);
+    - if pm_runtime_suspended(dev) returns "false", invoke the ->freeze_noirq()
+      callback provided by the device's driver and return its result, or return
+      0 if not defined
+
   int pm_generic_thaw(struct device *dev);
     - if the device has not been suspended at run time, invoke the ->thaw()
       callback provided by its driver and return its result, or return 0 if not
       defined
 
+  int pm_generic_thaw_noirq(struct device *dev);
+    - if pm_runtime_suspended(dev) returns "false", invoke the ->thaw_noirq()
+      callback provided by the device's driver and return its result, or return
+      0 if not defined
+
   int pm_generic_poweroff(struct device *dev);
     - if the device has not been suspended at run time, invoke the ->poweroff()
       callback provided by its driver and return its result, or return 0 if not
       defined
 
+  int pm_generic_poweroff_noirq(struct device *dev);
+    - if pm_runtime_suspended(dev) returns "false", run the ->poweroff_noirq()
+      callback provided by the device's driver and return its result, or return
+      0 if not defined
+
   int pm_generic_restore(struct device *dev);
     - invoke the ->restore() callback provided by the driver of this device and,
       if successful, change the device's runtime PM status to 'active'
 
-These functions can be assigned to the ->runtime_idle(), ->runtime_suspend(),
-->runtime_resume(), ->suspend(), ->resume(), ->freeze(), ->thaw(), ->poweroff(),
-or ->restore() callback pointers in the subsystem-level dev_pm_ops structures.
+  int pm_generic_restore_noirq(struct device *dev);
+    - invoke the ->restore_noirq() callback provided by the device's driver
 
-If a subsystem wishes to use all of them at the same time, it can simply assign
-the GENERIC_SUBSYS_PM_OPS macro, defined in include/linux/pm.h, to its
-dev_pm_ops structure pointer.
+These functions are the defaults used by the PM core, if a subsystem doesn't
+provide its own callbacks for ->runtime_idle(), ->runtime_suspend(),
+->runtime_resume(), ->suspend(), ->suspend_noirq(), ->resume(),
+->resume_noirq(), ->freeze(), ->freeze_noirq(), ->thaw(), ->thaw_noirq(),
+->poweroff(), ->poweroff_noirq(), ->restore(), ->restore_noirq() in the
+subsystem-level dev_pm_ops structure.
 
 Device drivers that wish to use the same function as a system suspend, freeze,
-poweroff and run-time suspend callback, and similarly for system resume, thaw,
-restore, and run-time resume, can achieve this with the help of the
+poweroff and runtime suspend callback, and similarly for system resume, thaw,
+restore, and runtime resume, can achieve this with the help of the
 UNIVERSAL_DEV_PM_OPS macro defined in include/linux/pm.h (possibly setting its
 last argument to NULL).
 
@@ -606,7 +765,7 @@ Some "devices" are only logical sub-devices of their parent and cannot be
 power-managed on their own.  (The prototype example is a USB interface.  Entire
 USB devices can go into low-power mode or send wake-up requests, but neither is
 possible for individual interfaces.)  The drivers for these devices have no
-need of run-time PM callbacks; if the callbacks did exist, ->runtime_suspend()
+need of runtime PM callbacks; if the callbacks did exist, ->runtime_suspend()
 and ->runtime_resume() would always return 0 without doing anything else and
 ->runtime_idle() would always call pm_runtime_suspend().
 
@@ -614,7 +773,7 @@ Subsystems can tell the PM core about these devices by calling
 pm_runtime_no_callbacks().  This should be done after the device structure is
 initialized and before it is registered (although after device registration is
 also okay).  The routine will set the device's power.no_callbacks flag and
-prevent the non-debugging run-time PM sysfs attributes from being created.
+prevent the non-debugging runtime PM sysfs attributes from being created.
 
 When power.no_callbacks is set, the PM core will not invoke the
 ->runtime_idle(), ->runtime_suspend(), or ->runtime_resume() callbacks.
@@ -622,7 +781,7 @@ Instead it will assume that suspends and resumes always succeed and that idle
 devices should be suspended.
 
 As a consequence, the PM core will never directly inform the device's subsystem
-or driver about run-time power changes.  Instead, the driver for the device's
+or driver about runtime power changes.  Instead, the driver for the device's
 parent must take responsibility for telling the device's driver when the
 parent's power state changes.
 
@@ -633,13 +792,13 @@ A device should be put in a low-power state only when there's some reason to
 think it will remain in that state for a substantial time.  A common heuristic
 says that a device which hasn't been used for a while is liable to remain
 unused; following this advice, drivers should not allow devices to be suspended
-at run-time until they have been inactive for some minimum period.  Even when
+at runtime until they have been inactive for some minimum period.  Even when
 the heuristic ends up being non-optimal, it will still prevent devices from
 "bouncing" too rapidly between low-power and full-power states.
 
 The term "autosuspend" is an historical remnant.  It doesn't mean that the
 device is automatically suspended (the subsystem or driver still has to call
-the appropriate PM routines); rather it means that run-time suspends will
+the appropriate PM routines); rather it means that runtime suspends will
 automatically be delayed until the desired period of inactivity has elapsed.
 
 Inactivity is determined based on the power.last_busy field.  Drivers should
@@ -665,6 +824,16 @@ will behave normally, not taking the autosuspend delay into account.
 Similarly, if the power.use_autosuspend field isn't set then the autosuspend
 helper functions will behave just like the non-autosuspend counterparts.
 
+Under some circumstances a driver or subsystem may want to prevent a device
+from autosuspending immediately, even though the usage counter is zero and the
+autosuspend delay time has expired.  If the ->runtime_suspend() callback
+returns -EAGAIN or -EBUSY, and if the next autosuspend delay expiration time is
+in the future (as it normally would be if the callback invoked
+pm_runtime_mark_last_busy()), the PM core will automatically reschedule the
+autosuspend.  The ->runtime_suspend() callback can't do this rescheduling
+itself because no suspend requests of any kind are accepted while the device is
+suspending (i.e., while the callback is running).
+
 The implementation is well suited for asynchronous use in interrupt contexts.
 However such use inevitably involves races, because the PM core can't
 synchronize ->runtime_suspend() callbacks with the arrival of I/O requests.
@@ -720,7 +889,7 @@ Here is a schematic pseudo-code example:
 		foo->is_suspended = 0;
 		pm_runtime_mark_last_busy(&foo->dev);
 		if (foo->num_pending_requests > 0)
-			foo_process_requests(foo);
+			foo_process_next_request(foo);
 		unlock(&foo->private_lock);
 		return 0;
 	}
diff --git a/Documentation/power/states.txt b/Documentation/power/states.txt
index 34800cc521b..50f3ef9177c 100644
--- a/Documentation/power/states.txt
+++ b/Documentation/power/states.txt
@@ -1,54 +1,87 @@
+System Power Management Sleep States
 
-System Power Management States
+(C) 2014 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
 
+The kernel supports up to four system sleep states generically, although three
+of them depend on the platform support code to implement the low-level details
+for each state.
 
-The kernel supports three power management states generically, though
-each is dependent on platform support code to implement the low-level
-details for each state. This file describes each state, what they are
-commonly called, what ACPI state they map to, and what string to write
-to /sys/power/state to enter that state
+The states are represented by strings that can be read or written to the
+/sys/power/state file.  Those strings may be "mem", "standby", "freeze" and
+"disk", where the last one always represents hibernation (Suspend-To-Disk) and
+the meaning of the remaining ones depends on the relative_sleep_states command
+line argument.
+
+For relative_sleep_states=1, the strings "mem", "standby" and "freeze" label the
+available non-hibernation sleep states from the deepest to the shallowest,
+respectively.  In that case, "mem" is always present in /sys/power/state,
+because there is at least one non-hibernation sleep state in every system.  If
+the given system supports two non-hibernation sleep states, "standby" is present
+in /sys/power/state in addition to "mem".  If the system supports three
+non-hibernation sleep states, "freeze" will be present in /sys/power/state in
+addition to "mem" and "standby".
+
+For relative_sleep_states=0, which is the default, the following descriptions
+apply.
+
+state:		Suspend-To-Idle
+ACPI state:	S0
+Label:		"freeze"
+
+This state is a generic, pure software, light-weight, system sleep state.
+It allows more energy to be saved relative to runtime idle by freezing user
+space and putting all I/O devices into low-power states (possibly
+lower-power than available at run time), such that the processors can
+spend more time in their idle states.
+
+This state can be used for platforms without Power-On Suspend/Suspend-to-RAM
+support, or it can be used in addition to Suspend-to-RAM (memory sleep)
+to provide reduced resume latency.  It is always supported.
 
 
 State:		Standby / Power-On Suspend
 ACPI State:	S1
-String:		"standby"
+Label:		"standby"
 
-This state offers minimal, though real, power savings, while providing
-a very low-latency transition back to a working system. No operating
-state is lost (the CPU retains power), so the system easily starts up
+This state, if supported, offers moderate, though real, power savings, while
+providing a relatively low-latency transition back to a working system.  No
+operating state is lost (the CPU retains power), so the system easily starts up
 again where it left off. 
 
-We try to put devices in a low-power state equivalent to D1, which
-also offers low power savings, but low resume latency. Not all devices
-support D1, and those that don't are left on. 
-
-A transition from Standby to the On state should take about 1-2
-seconds. 
+In addition to freezing user space and putting all I/O devices into low-power
+states, which is done for Suspend-To-Idle too, nonboot CPUs are taken offline
+and all low-level system functions are suspended during transitions into this
+state.  For this reason, it should allow more energy to be saved relative to
+Suspend-To-Idle, but the resume latency will generally be greater than for that
+state.
 
 
 State:		Suspend-to-RAM
 ACPI State:	S3
-String:		"mem"
+Label:		"mem"
 
-This state offers significant power savings as everything in the
-system is put into a low-power state, except for memory, which is
-placed in self-refresh mode to retain its contents. 
+This state, if supported, offers significant power savings as everything in the
+system is put into a low-power state, except for memory, which should be placed
+into the self-refresh mode to retain its contents.  All of the steps carried out
+when entering Power-On Suspend are also carried out during transitions to STR.
+Additional operations may take place depending on the platform capabilities.  In
+particular, on ACPI systems the kernel passes control to the BIOS (platform
+firmware) as the last step during STR transitions and that usually results in
+powering down some more low-level components that aren't directly controlled by
+the kernel.
 
-System and device state is saved and kept in memory. All devices are
-suspended and put into D3. In many cases, all peripheral buses lose
-power when entering STR, so devices must be able to handle the
-transition back to the On state. 
+System and device state is saved and kept in memory.  All devices are suspended
+and put into low-power states.  In many cases, all peripheral buses lose power
+when entering STR, so devices must be able to handle the transition back to the
+"on" state.
 
-For at least ACPI, STR requires some minimal boot-strapping code to
-resume the system from STR. This may be true on other platforms. 
-
-A transition from Suspend-to-RAM to the On state should take about
-3-5 seconds. 
+For at least ACPI, STR requires some minimal boot-strapping code to resume the
+system from it.  This may be the case on other platforms too.
 
 
 State:		Suspend-to-disk
 ACPI State:	S4
-String:		"disk"
+Label:		"disk"
 
 This state offers the greatest power savings, and can be used even in
 the absence of low-level platform support for power management. This
@@ -62,19 +95,15 @@ setup via another operating system for it to use. Despite the
 inconvenience, this method requires minimal work by the kernel, since
 the firmware will also handle restoring memory contents on resume. 
 
-For suspend-to-disk, a mechanism called swsusp called 'swsusp' (Swap
-Suspend) is used to write memory contents to free swap space.
-swsusp has some restrictive requirements, but should work in most
-cases. Some, albeit outdated, documentation can be found in
-Documentation/power/swsusp.txt. Alternatively, userspace can do most
-of the actual suspend to disk work, see userland-swsusp.txt.
+For suspend-to-disk, a mechanism called 'swsusp' (Swap Suspend) is used
+to write memory contents to free swap space. swsusp has some restrictive
+requirements, but should work in most cases. Some, albeit outdated,
+documentation can be found in Documentation/power/swsusp.txt.
+Alternatively, userspace can do most of the actual suspend to disk work,
+see userland-swsusp.txt.
 
 Once memory state is written to disk, the system may either enter a
 low-power state (like ACPI S4), or it may simply power down. Powering
 down offers greater savings, and allows this mechanism to work on any
 system. However, entering a real low-power state allows the user to
 trigger wake up events (e.g. pressing a key or opening a laptop lid).
-
-A transition from Suspend-to-Disk to the On state should take about 30
-seconds, though it's typically a bit more with the current
-implementation. 
diff --git a/Documentation/power/suspend-and-cpuhotplug.txt b/Documentation/power/suspend-and-cpuhotplug.txt
new file mode 100644
index 00000000000..2850df3bf95
--- /dev/null
+++ b/Documentation/power/suspend-and-cpuhotplug.txt
@@ -0,0 +1,275 @@
+Interaction of Suspend code (S3) with the CPU hotplug infrastructure
+
+     (C) 2011 - 2014 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
+
+
+I. How does the regular CPU hotplug code differ from how the Suspend-to-RAM
+   infrastructure uses it internally? And where do they share common code?
+
+Well, a picture is worth a thousand words... So ASCII art follows :-)
+
+[This depicts the current design in the kernel, and focusses only on the
+interactions involving the freezer and CPU hotplug and also tries to explain
+the locking involved. It outlines the notifications involved as well.
+But please note that here, only the call paths are illustrated, with the aim
+of describing where they take different paths and where they share code.
+What happens when regular CPU hotplug and Suspend-to-RAM race with each other
+is not depicted here.]
+
+On a high level, the suspend-resume cycle goes like this:
+
+|Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw |
+|tasks |    |     cpus      |    |          |    |     cpus     |    |tasks|
+
+
+More details follow:
+
+                                Suspend call path
+                                -----------------
+
+                                  Write 'mem' to
+                                /sys/power/state
+                                    sysfs file
+                                        |
+                                        v
+                               Acquire pm_mutex lock
+                                        |
+                                        v
+                             Send PM_SUSPEND_PREPARE
+                                   notifications
+                                        |
+                                        v
+                                   Freeze tasks
+                                        |
+                                        |
+                                        v
+                              disable_nonboot_cpus()
+                                   /* start */
+                                        |
+                                        v
+                            Acquire cpu_add_remove_lock
+                                        |
+                                        v
+                             Iterate over CURRENTLY
+                                   online CPUs
+                                        |
+                                        |
+                                        |                ----------
+                                        v                          | L
+             ======>               _cpu_down()                     |
+            |              [This takes cpuhotplug.lock             |
+  Common    |               before taking down the CPU             |
+   code     |               and releases it when done]             | O
+            |            While it is at it, notifications          |
+            |            are sent when notable events occur,       |
+             ======>     by running all registered callbacks.      |
+                                        |                          | O
+                                        |                          |
+                                        |                          |
+                                        v                          |
+                            Note down these cpus in                | P
+                                frozen_cpus mask         ----------
+                                        |
+                                        v
+                           Disable regular cpu hotplug
+                        by setting cpu_hotplug_disabled=1
+                                        |
+                                        v
+                            Release cpu_add_remove_lock
+                                        |
+                                        v
+                       /* disable_nonboot_cpus() complete */
+                                        |
+                                        v
+                                   Do suspend
+
+
+
+Resuming back is likewise, with the counterparts being (in the order of
+execution during resume):
+* enable_nonboot_cpus() which involves:
+   |  Acquire cpu_add_remove_lock
+   |  Reset cpu_hotplug_disabled to 0, thereby enabling regular cpu hotplug
+   |  Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop]
+   |  Release cpu_add_remove_lock
+   v
+
+* thaw tasks
+* send PM_POST_SUSPEND notifications
+* Release pm_mutex lock.
+
+
+It is to be noted here that the pm_mutex lock is acquired at the very
+beginning, when we are just starting out to suspend, and then released only
+after the entire cycle is complete (i.e., suspend + resume).
+
+
+
+                          Regular CPU hotplug call path
+                          -----------------------------
+
+                                Write 0 (or 1) to
+                       /sys/devices/system/cpu/cpu*/online
+                                    sysfs file
+                                        |
+                                        |
+                                        v
+                                    cpu_down()
+                                        |
+                                        v
+                           Acquire cpu_add_remove_lock
+                                        |
+                                        v
+                          If cpu_hotplug_disabled is 1
+                                return gracefully
+                                        |
+                                        |
+                                        v
+             ======>                _cpu_down()
+            |              [This takes cpuhotplug.lock
+  Common    |               before taking down the CPU
+   code     |               and releases it when done]
+            |            While it is at it, notifications
+            |           are sent when notable events occur,
+             ======>    by running all registered callbacks.
+                                        |
+                                        |
+                                        v
+                          Release cpu_add_remove_lock
+                               [That's it!, for
+                              regular CPU hotplug]
+
+
+
+So, as can be seen from the two diagrams (the parts marked as "Common code"),
+regular CPU hotplug and the suspend code path converge at the _cpu_down() and
+_cpu_up() functions. They differ in the arguments passed to these functions,
+in that during regular CPU hotplug, 0 is passed for the 'tasks_frozen'
+argument. But during suspend, since the tasks are already frozen by the time
+the non-boot CPUs are offlined or onlined, the _cpu_*() functions are called
+with the 'tasks_frozen' argument set to 1.
+[See below for some known issues regarding this.]
+
+
+Important files and functions/entry points:
+------------------------------------------
+
+kernel/power/process.c : freeze_processes(), thaw_processes()
+kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish()
+kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus()
+
+
+
+II. What are the issues involved in CPU hotplug?
+    -------------------------------------------
+
+There are some interesting situations involving CPU hotplug and microcode
+update on the CPUs, as discussed below:
+
+[Please bear in mind that the kernel requests the microcode images from
+userspace, using the request_firmware() function defined in
+drivers/base/firmware_class.c]
+
+
+a. When all the CPUs are identical:
+
+   This is the most common situation and it is quite straightforward: we want
+   to apply the same microcode revision to each of the CPUs.
+   To give an example of x86, the collect_cpu_info() function defined in
+   arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU
+   and thereby in applying the correct microcode revision to it.
+   But note that the kernel does not maintain a common microcode image for the
+   all CPUs, in order to handle case 'b' described below.
+
+
+b. When some of the CPUs are different than the rest:
+
+   In this case since we probably need to apply different microcode revisions
+   to different CPUs, the kernel maintains a copy of the correct microcode
+   image for each CPU (after appropriate CPU type/model discovery using
+   functions such as collect_cpu_info()).
+
+
+c. When a CPU is physically hot-unplugged and a new (and possibly different
+   type of) CPU is hot-plugged into the system:
+
+   In the current design of the kernel, whenever a CPU is taken offline during
+   a regular CPU hotplug operation, upon receiving the CPU_DEAD notification
+   (which is sent by the CPU hotplug code), the microcode update driver's
+   callback for that event reacts by freeing the kernel's copy of the
+   microcode image for that CPU.
+
+   Hence, when a new CPU is brought online, since the kernel finds that it
+   doesn't have the microcode image, it does the CPU type/model discovery
+   afresh and then requests the userspace for the appropriate microcode image
+   for that CPU, which is subsequently applied.
+
+   For example, in x86, the mc_cpu_callback() function (which is the microcode
+   update driver's callback registered for CPU hotplug events) calls
+   microcode_update_cpu() which would call microcode_init_cpu() in this case,
+   instead of microcode_resume_cpu() when it finds that the kernel doesn't
+   have a valid microcode image. This ensures that the CPU type/model
+   discovery is performed and the right microcode is applied to the CPU after
+   getting it from userspace.
+
+
+d. Handling microcode update during suspend/hibernate:
+
+   Strictly speaking, during a CPU hotplug operation which does not involve
+   physically removing or inserting CPUs, the CPUs are not actually powered
+   off during a CPU offline. They are just put to the lowest C-states possible.
+   Hence, in such a case, it is not really necessary to re-apply microcode
+   when the CPUs are brought back online, since they wouldn't have lost the
+   image during the CPU offline operation.
+
+   This is the usual scenario encountered during a resume after a suspend.
+   However, in the case of hibernation, since all the CPUs are completely
+   powered off, during restore it becomes necessary to apply the microcode
+   images to all the CPUs.
+
+   [Note that we don't expect someone to physically pull out nodes and insert
+   nodes with a different type of CPUs in-between a suspend-resume or a
+   hibernate/restore cycle.]
+
+   In the current design of the kernel however, during a CPU offline operation
+   as part of the suspend/hibernate cycle (the CPU_DEAD_FROZEN notification),
+   the existing copy of microcode image in the kernel is not freed up.
+   And during the CPU online operations (during resume/restore), since the
+   kernel finds that it already has copies of the microcode images for all the
+   CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU
+   type/model and the need for validating whether the microcode revisions are
+   right for the CPUs or not (due to the above assumption that physical CPU
+   hotplug will not be done in-between suspend/resume or hibernate/restore
+   cycles).
+
+
+III. Are there any known problems when regular CPU hotplug and suspend race
+     with each other?
+
+Yes, they are listed below:
+
+1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to
+   the _cpu_down() and _cpu_up() functions is *always* 0.
+   This might not reflect the true current state of the system, since the
+   tasks could have been frozen by an out-of-band event such as a suspend
+   operation in progress. Hence, it will lead to wrong notifications being
+   sent during the cpu online/offline events (eg, CPU_ONLINE notification
+   instead of CPU_ONLINE_FROZEN) which in turn will lead to execution of
+   inappropriate code by the callbacks registered for such CPU hotplug events.
+
+2. If a regular CPU hotplug stress test happens to race with the freezer due
+   to a suspend operation in progress at the same time, then we could hit the
+   situation described below:
+
+    * A regular cpu online operation continues its journey from userspace
+      into the kernel, since the freezing has not yet begun.
+    * Then freezer gets to work and freezes userspace.
+    * If cpu online has not yet completed the microcode update stuff by now,
+      it will now start waiting on the frozen userspace in the
+      TASK_UNINTERRUPTIBLE state, in order to get the microcode image.
+    * Now the freezer continues and tries to freeze the remaining tasks. But
+      due to this wait mentioned above, the freezer won't be able to freeze
+      the cpu online hotplug task and hence freezing of tasks fails.
+
+   As a result of this task freezing failure, the suspend operation gets
+   aborted.
diff --git a/Documentation/power/swsusp.txt b/Documentation/power/swsusp.txt
index ea718891a66..f732a8321e8 100644
--- a/Documentation/power/swsusp.txt
+++ b/Documentation/power/swsusp.txt
@@ -33,6 +33,11 @@ echo shutdown > /sys/power/disk; echo disk > /sys/power/state
 
 echo platform > /sys/power/disk; echo disk > /sys/power/state
 
+. If you would like to write hibernation image to swap and then suspend
+to RAM (provided your platform supports it), you can try
+
+echo suspend > /sys/power/disk; echo disk > /sys/power/state
+
 . If you have SATA disks, you'll need recent kernels with SATA suspend
 support. For suspend and resume to work, make sure your disk drivers
 are built into kernel -- not modules. [There's way to make
@@ -45,10 +50,23 @@ echo N > /sys/power/image_size
 
 before suspend (it is limited to 500 MB by default).
 
+. The resume process checks for the presence of the resume device,
+if found, it then checks the contents for the hibernation image signature.
+If both are found, it resumes the hibernation image.
+
+. The resume process may be triggered in two ways:
+  1) During lateinit:  If resume=/dev/your_swap_partition is specified on
+     the kernel command line, lateinit runs the resume process.  If the
+     resume device has not been probed yet, the resume process fails and
+     bootup continues.
+  2) Manually from an initrd or initramfs:  May be run from
+     the init script by using the /sys/power/resume file.  It is vital
+     that this be done prior to remounting any filesystems (even as
+     read-only) otherwise data may be corrupted.
 
 Article about goals and implementation of Software Suspend for Linux
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Author: Gábor Kuti
+Author: Gábor Kuti
 Last revised: 2003-10-20 by Pavel Machek
 
 Idea and goals to achieve
@@ -192,7 +210,7 @@ Q: There don't seem to be any generally useful behavioral
 distinctions between SUSPEND and FREEZE.
 
 A: Doing SUSPEND when you are asked to do FREEZE is always correct,
-but it may be unneccessarily slow. If you want your driver to stay simple,
+but it may be unnecessarily slow. If you want your driver to stay simple,
 slowness may not matter to you. It can always be fixed later.
 
 For devices like disk it does matter, you do not want to spindown for
@@ -202,7 +220,10 @@ Q: After resuming, system is paging heavily, leading to very bad interactivity.
 
 A: Try running
 
-cat `cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u` > /dev/null
+cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u | while read file
+do
+  test -f "$file" && cat "$file" > /dev/null
+done
 
 after resume. swapoff -a; swapon -a may also be useful.
 
@@ -237,7 +258,7 @@ disk. Whole sequence goes like
 
       running system, user asks for suspend-to-disk
 
-      user processes are stopped (in common case there are none, but with resume-from-initrd, noone knows)
+      user processes are stopped (in common case there are none, but with resume-from-initrd, no one knows)
 
       read image from disk
 
@@ -321,7 +342,7 @@ Q: How can distributions ship a swsusp-supporting kernel with modular
 disk drivers (especially SATA)?
 
 A: Well, it can be done, load the drivers, then do echo into
-/sys/power/disk/resume file from initrd. Be sure not to mount
+/sys/power/resume file from initrd. Be sure not to mount
 anything, not even read-only mount, or you are going to lose your
 data.
 
diff --git a/Documentation/power/userland-swsusp.txt b/Documentation/power/userland-swsusp.txt
index 81680f9f590..0e870825c1b 100644
--- a/Documentation/power/userland-swsusp.txt
+++ b/Documentation/power/userland-swsusp.txt
@@ -77,7 +77,8 @@ SNAPSHOT_SET_SWAP_AREA - set the resume partition and the offset (in <PAGE_SIZE>
 	resume_swap_area, as defined in kernel/power/suspend_ioctls.h,
 	containing the resume device specification and the offset); for swap
 	partitions the offset is always 0, but it is different from zero for
-	swap files (see Documentation/swsusp-and-swap-files.txt for details).
+	swap files (see Documentation/power/swsusp-and-swap-files.txt for
+	details).
 
 SNAPSHOT_PLATFORM_SUPPORT - enable/disable the hibernation platform support,
 	depending on the argument value (enable, if the argument is nonzero)
@@ -98,7 +99,7 @@ SNAPSHOT_S2RAM - suspend to RAM; using this call causes the kernel to
 The device's read() operation can be used to transfer the snapshot image from
 the kernel.  It has the following limitations:
 - you cannot read() more than one virtual memory page at a time
-- read()s accross page boundaries are impossible (ie. if ypu read() 1/2 of
+- read()s across page boundaries are impossible (ie. if ypu read() 1/2 of
 	a page in the previous call, you will only be able to read()
 	_at_ _most_ 1/2 of the page in the next call)
 
@@ -137,7 +138,7 @@ mechanism and the userland utilities using the interface SHOULD use additional
 means, such as checksums, to ensure the integrity of the snapshot image.
 
 The suspending and resuming utilities MUST lock themselves in memory,
-preferrably using mlockall(), before calling SNAPSHOT_FREEZE.
+preferably using mlockall(), before calling SNAPSHOT_FREEZE.
 
 The suspending utility MUST check the value stored by SNAPSHOT_CREATE_IMAGE
 in the memory location pointed to by the last argument of ioctl() and proceed
@@ -147,7 +148,7 @@ in accordance with it:
 	(a)	The suspending utility MUST NOT close the snapshot device
 		_unless_ the whole suspend procedure is to be cancelled, in
 		which case, if the snapshot image has already been saved, the
-		suspending utility SHOULD destroy it, preferrably by zapping
+		suspending utility SHOULD destroy it, preferably by zapping
 		its header.  If the suspend is not to be cancelled, the
 		system MUST be powered off or rebooted after the snapshot
 		image has been saved.
diff --git a/Documentation/power/video_extension.txt b/Documentation/power/video_extension.txt
deleted file mode 100644
index b2f9b1598ac..00000000000
--- a/Documentation/power/video_extension.txt
+++ /dev/null
@@ -1,37 +0,0 @@
-ACPI video extensions
-~~~~~~~~~~~~~~~~~~~~~
-
-This driver implement the ACPI Extensions For Display Adapters for
-integrated graphics devices on motherboard, as specified in ACPI 2.0
-Specification, Appendix B, allowing to perform some basic control like
-defining the video POST device, retrieving EDID information or to
-setup a video output, etc.  Note that this is an ref. implementation
-only.  It may or may not work for your integrated video device.
-
-Interfaces exposed to userland through /proc/acpi/video:
-
-VGA/info : display the supported video bus device capability like Video ROM, CRT/LCD/TV.
-VGA/ROM :  Used to get a copy of the display devices' ROM data (up to 4k).
-VGA/POST_info : Used to determine what options are implemented.
-VGA/POST : Used to get/set POST device.
-VGA/DOS : Used to get/set ownership of output switching:
-	Please refer ACPI spec B.4.1 _DOS
-VGA/CRT : CRT output
-VGA/LCD : LCD output
-VGA/TVO : TV output
-VGA/*/brightness : Used to get/set brightness of output device
-
-Notify event through /proc/acpi/event:
-
-#define ACPI_VIDEO_NOTIFY_SWITCH        0x80
-#define ACPI_VIDEO_NOTIFY_PROBE         0x81
-#define ACPI_VIDEO_NOTIFY_CYCLE         0x82
-#define ACPI_VIDEO_NOTIFY_NEXT_OUTPUT   0x83
-#define ACPI_VIDEO_NOTIFY_PREV_OUTPUT   0x84
-
-#define ACPI_VIDEO_NOTIFY_CYCLE_BRIGHTNESS      0x82
-#define ACPI_VIDEO_NOTIFY_INC_BRIGHTNESS        0x83
-#define ACPI_VIDEO_NOTIFY_DEC_BRIGHTNESS        0x84
-#define ACPI_VIDEO_NOTIFY_ZERO_BRIGHTNESS       0x85
-#define ACPI_VIDEO_NOTIFY_DISPLAY_OFF           0x86
-