29 files changed, 4620 insertions, 1111 deletions
diff --git a/Documentation/power/00-INDEX b/Documentation/power/00-INDEX
index a55d7f1c836..ad04cc8097e 100644
--- a/Documentation/power/00-INDEX
+++ b/Documentation/power/00-INDEX
@@ -1,7 +1,11 @@
 00-INDEX
 	- This file
+apm-acpi.txt
+	- basic info about the APM and ACPI support.
 basic-pm-debugging.txt
 	- Debugging suspend and resume
+charger-manager.txt
+	- Battery charger management.
 devices.txt
 	- How drivers interact with system-wide power management
 drivers-testing.txt
@@ -12,18 +16,22 @@ interface.txt
 	- Power management user interface in /sys/power
 notifiers.txt
 	- Registering suspend notifiers in device drivers
+opp.txt
+	- Operating Performance Point library
 pci.txt
 	- How the PCI Subsystem Does Power Management
-pm.txt
-	- info on Linux power management support.
 pm_qos_interface.txt
 	- info on Linux PM Quality of Service interface
 power_supply_class.txt
 	- Tells userspace about battery, UPS, AC or DC power supply properties
+runtime_pm.txt
+	- Power management framework for I/O devices.
 s2ram.txt
 	- How to get suspend to ram working (and debug it when it isn't)
 states.txt
 	- System power management states
+suspend-and-cpuhotplug.txt
+	- Explains the interaction between Suspend-to-RAM (S3) and CPU hotplug
 swsusp-and-swap-files.txt
 	- Using swap files with software suspend (to disk)
 swsusp-dmcrypt.txt
@@ -34,7 +42,5 @@ tricks.txt
 	- How to trick software suspend (to disk) into working when it isn't
 userland-swsusp.txt
 	- Experimental implementation of software suspend in userspace
-video_extension.txt
-	- ACPI video extensions
 video.txt
 	- Video issues during resume from suspend
diff --git a/Documentation/power/apm-acpi.txt b/Documentation/power/apm-acpi.txt
new file mode 100644
index 00000000000..6cc423d3662
--- /dev/null
+++ b/Documentation/power/apm-acpi.txt
@@ -0,0 +1,32 @@
+APM or ACPI?
+------------
+If you have a relatively recent x86 mobile, desktop, or server system,
+odds are it supports either Advanced Power Management (APM) or
+Advanced Configuration and Power Interface (ACPI).  ACPI is the newer
+of the two technologies and puts power management in the hands of the
+operating system, allowing for more intelligent power management than
+is possible with BIOS controlled APM.
+
+The best way to determine which, if either, your system supports is to
+build a kernel with both ACPI and APM enabled (as of 2.3.x ACPI is
+enabled by default).  If a working ACPI implementation is found, the
+ACPI driver will override and disable APM, otherwise the APM driver
+will be used.
+
+No, sorry, you cannot have both ACPI and APM enabled and running at
+once.  Some people with broken ACPI or broken APM implementations
+would like to use both to get a full set of working features, but you
+simply cannot mix and match the two.  Only one power management
+interface can be in control of the machine at once.  Think about it..
+
+User-space Daemons
+------------------
+Both APM and ACPI rely on user-space daemons, apmd and acpid
+respectively, to be completely functional.  Obtain both of these
+daemons from your Linux distribution or from the Internet (see below)
+and be sure that they are started sometime in the system boot process.
+Go ahead and start both.  If ACPI or APM is not available on your
+system the associated daemon will exit gracefully.
+
+  apmd:   http://ftp.debian.org/pool/main/a/apmd/
+  acpid:  http://acpid.sf.net/
diff --git a/Documentation/power/basic-pm-debugging.txt b/Documentation/power/basic-pm-debugging.txt
index 1555001bc73..edeecd447d2 100644
--- a/Documentation/power/basic-pm-debugging.txt
+++ b/Documentation/power/basic-pm-debugging.txt
@@ -15,7 +15,7 @@ test at least a couple of times in a row for confidence.  [This is necessary,
 because some problems only show up on a second attempt at suspending and
 resuming the system.]  Moreover, hibernating in the "reboot" and "shutdown"
 modes causes the PM core to skip some platform-related callbacks which on ACPI
-systems might be necessary to make hibernation work.  Thus, if you machine fails
+systems might be necessary to make hibernation work.  Thus, if your machine fails
 to hibernate or resume in the "reboot" mode, you should try the "platform" mode:
 
 # echo platform > /sys/power/disk
@@ -172,15 +172,14 @@ you can boot the kernel with the 'no_console_suspend' parameter and try to log
 kernel messages using the serial console.  This may provide you with some
 information about the reasons of the suspend (resume) failure.  Alternatively,
 it may be possible to use a FireWire port for debugging with firescope
-(ftp://ftp.firstfloor.org/pub/ak/firescope/).  On x86 it is also possible to
-use the PM_TRACE mechanism documented in Documentation/s2ram.txt .
+(http://v3.sk/~lkundrak/firescope/).  On x86 it is also possible to
+use the PM_TRACE mechanism documented in Documentation/power/s2ram.txt .
 
 2. Testing suspend to RAM (STR)
 
 To verify that the STR works, it is generally more convenient to use the s2ram
 tool available from http://suspend.sf.net and documented at
-http://en.opensuse.org/s2ram .  However, before doing that it is recommended to
-carry out STR testing using the facility described in section 1.
+http://en.opensuse.org/SDB:Suspend_to_RAM (S2RAM_LINK).
 
 Namely, after writing "freezer", "devices", "platform", "processors", or "core"
 into /sys/power/pm_test (available if the kernel is compiled with
@@ -195,10 +194,34 @@ Among other things, the testing with the help of /sys/power/pm_test may allow
 you to identify drivers that fail to suspend or resume their devices.  They
 should be unloaded every time before an STR transition.
 
-Next, you can follow the instructions at http://en.opensuse.org/s2ram to test
-the system, but if it does not work "out of the box", you may need to boot it
-with "init=/bin/bash" and test s2ram in the minimal configuration.  In that
-case, you may be able to search for failing drivers by following the procedure
+Next, you can follow the instructions at S2RAM_LINK to test the system, but if
+it does not work "out of the box", you may need to boot it with
+"init=/bin/bash" and test s2ram in the minimal configuration.  In that case,
+you may be able to search for failing drivers by following the procedure
 analogous to the one described in section 1.  If you find some failing drivers,
 you will have to unload them every time before an STR transition (ie. before
 you run s2ram), and please report the problems with them.
+
+There is a debugfs entry which shows the suspend to RAM statistics. Here is an
+example of its output.
+	# mount -t debugfs none /sys/kernel/debug
+	# cat /sys/kernel/debug/suspend_stats
+	success: 20
+	fail: 5
+	failed_freeze: 0
+	failed_prepare: 0
+	failed_suspend: 5
+	failed_suspend_noirq: 0
+	failed_resume: 0
+	failed_resume_noirq: 0
+	failures:
+	  last_failed_dev:	alarm
+				adc
+	  last_failed_errno:	-16
+				-16
+	  last_failed_step:	suspend
+				suspend
+Field success means the success number of suspend to RAM, and field fail means
+the failure number. Others are the failure number of different steps of suspend
+to RAM. suspend_stats just lists the last 2 failed devices, error number and
+failed step of suspend.
diff --git a/Documentation/power/charger-manager.txt b/Documentation/power/charger-manager.txt
new file mode 100644
index 00000000000..b4f7f4b23f6
--- /dev/null
+++ b/Documentation/power/charger-manager.txt
@@ -0,0 +1,200 @@
+Charger Manager
+	(C) 2011 MyungJoo Ham <myungjoo.ham@samsung.com>, GPL
+
+Charger Manager provides in-kernel battery charger management that
+requires temperature monitoring during suspend-to-RAM state
+and where each battery may have multiple chargers attached and the userland
+wants to look at the aggregated information of the multiple chargers.
+
+Charger Manager is a platform_driver with power-supply-class entries.
+An instance of Charger Manager (a platform-device created with Charger-Manager)
+represents an independent battery with chargers. If there are multiple
+batteries with their own chargers acting independently in a system,
+the system may need multiple instances of Charger Manager.
+
+1. Introduction
+===============
+
+Charger Manager supports the following:
+
+* Support for multiple chargers (e.g., a device with USB, AC, and solar panels)
+	A system may have multiple chargers (or power sources) and some of
+	they may be activated at the same time. Each charger may have its
+	own power-supply-class and each power-supply-class can provide
+	different information about the battery status. This framework
+	aggregates charger-related information from multiple sources and
+	shows combined information as a single power-supply-class.
+
+* Support for in suspend-to-RAM polling (with suspend_again callback)
+	While the battery is being charged and the system is in suspend-to-RAM,
+	we may need to monitor the battery health by looking at the ambient or
+	battery temperature. We can accomplish this by waking up the system
+	periodically. However, such a method wakes up devices unncessary for
+	monitoring the battery health and tasks, and user processes that are
+	supposed to be kept suspended. That, in turn, incurs unnecessary power
+	consumption and slow down charging process. Or even, such peak power
+	consumption can stop chargers in the middle of charging
+	(external power input < device power consumption), which not
+	only affects the charging time, but the lifespan of the battery.
+
+	Charger Manager provides a function "cm_suspend_again" that can be
+	used as suspend_again callback of platform_suspend_ops. If the platform
+	requires tasks other than cm_suspend_again, it may implement its own
+	suspend_again callback that calls cm_suspend_again in the middle.
+	Normally, the platform will need to resume and suspend some devices
+	that are used by Charger Manager.
+
+* Support for premature full-battery event handling
+	If the battery voltage drops by "fullbatt_vchkdrop_uV" after
+	"fullbatt_vchkdrop_ms" from the full-battery event, the framework
+	restarts charging. This check is also performed while suspended by
+	setting wakeup time accordingly and using suspend_again.
+
+* Support for uevent-notify
+	With the charger-related events, the device sends
+	notification to users with UEVENT.
+
+2. Global Charger-Manager Data related with suspend_again
+========================================================
+In order to setup Charger Manager with suspend-again feature
+(in-suspend monitoring), the user should provide charger_global_desc
+with setup_charger_manager(struct charger_global_desc *).
+This charger_global_desc data for in-suspend monitoring is global
+as the name suggests. Thus, the user needs to provide only once even
+if there are multiple batteries. If there are multiple batteries, the
+multiple instances of Charger Manager share the same charger_global_desc
+and it will manage in-suspend monitoring for all instances of Charger Manager.
+
+The user needs to provide all the three entries properly in order to activate
+in-suspend monitoring:
+
+struct charger_global_desc {
+
+char *rtc_name;
+	: The name of rtc (e.g., "rtc0") used to wakeup the system from
+	suspend for Charger Manager. The alarm interrupt (AIE) of the rtc
+	should be able to wake up the system from suspend. Charger Manager
+	saves and restores the alarm value and use the previously-defined
+	alarm if it is going to go off earlier than Charger Manager so that
+	Charger Manager does not interfere with previously-defined alarms.
+
+bool (*rtc_only_wakeup)(void);
+	: This callback should let CM know whether
+	the wakeup-from-suspend is caused only by the alarm of "rtc" in the
+	same struct. If there is any other wakeup source triggered the
+	wakeup, it should return false. If the "rtc" is the only wakeup
+	reason, it should return true.
+
+bool assume_timer_stops_in_suspend;
+	: if true, Charger Manager assumes that
+	the timer (CM uses jiffies as timer) stops during suspend. Then, CM
+	assumes that the suspend-duration is same as the alarm length.
+};
+
+3. How to setup suspend_again
+=============================
+Charger Manager provides a function "extern bool cm_suspend_again(void)".
+When cm_suspend_again is called, it monitors every battery. The suspend_ops
+callback of the system's platform_suspend_ops can call cm_suspend_again
+function to know whether Charger Manager wants to suspend again or not.
+If there are no other devices or tasks that want to use suspend_again
+feature, the platform_suspend_ops may directly refer to cm_suspend_again
+for its suspend_again callback.
+
+The cm_suspend_again() returns true (meaning "I want to suspend again")
+if the system was woken up by Charger Manager and the polling
+(in-suspend monitoring) results in "normal".
+
+4. Charger-Manager Data (struct charger_desc)
+=============================================
+For each battery charged independently from other batteries (if a series of
+batteries are charged by a single charger, they are counted as one independent
+battery), an instance of Charger Manager is attached to it.
+
+struct charger_desc {
+
+char *psy_name;
+	: The power-supply-class name of the battery. Default is
+	"battery" if psy_name is NULL. Users can access the psy entries
+	at "/sys/class/power_supply/[psy_name]/".
+
+enum polling_modes polling_mode;
+	: CM_POLL_DISABLE: do not poll this battery.
+	  CM_POLL_ALWAYS: always poll this battery.
+	  CM_POLL_EXTERNAL_POWER_ONLY: poll this battery if and only if
+				       an external power source is attached.
+	  CM_POLL_CHARGING_ONLY: poll this battery if and only if the
+				 battery is being charged.
+
+unsigned int fullbatt_vchkdrop_ms;
+unsigned int fullbatt_vchkdrop_uV;
+	: If both have non-zero values, Charger Manager will check the
+	battery voltage drop fullbatt_vchkdrop_ms after the battery is fully
+	charged. If the voltage drop is over fullbatt_vchkdrop_uV, Charger
+	Manager will try to recharge the battery by disabling and enabling
+	chargers. Recharge with voltage drop condition only (without delay
+	condition) is needed to be implemented with hardware interrupts from
+	fuel gauges or charger devices/chips.
+
+unsigned int fullbatt_uV;
+	: If specified with a non-zero value, Charger Manager assumes
+	that the battery is full (capacity = 100) if the battery is not being
+	charged and the battery voltage is equal to or greater than
+	fullbatt_uV.
+
+unsigned int polling_interval_ms;
+	: Required polling interval in ms. Charger Manager will poll
+	this battery every polling_interval_ms or more frequently.
+
+enum data_source battery_present;
+	: CM_BATTERY_PRESENT: assume that the battery exists.
+	CM_NO_BATTERY: assume that the battery does not exists.
+	CM_FUEL_GAUGE: get battery presence information from fuel gauge.
+	CM_CHARGER_STAT: get battery presence from chargers.
+
+char **psy_charger_stat;
+	: An array ending with NULL that has power-supply-class names of
+	chargers. Each power-supply-class should provide "PRESENT" (if
+	battery_present is "CM_CHARGER_STAT"), "ONLINE" (shows whether an
+	external power source is attached or not), and "STATUS" (shows whether
+	the battery is {"FULL" or not FULL} or {"FULL", "Charging",
+	"Discharging", "NotCharging"}).
+
+int num_charger_regulators;
+struct regulator_bulk_data *charger_regulators;
+	: Regulators representing the chargers in the form for
+	regulator framework's bulk functions.
+
+char *psy_fuel_gauge;
+	: Power-supply-class name of the fuel gauge.
+
+int (*temperature_out_of_range)(int *mC);
+bool measure_battery_temp;
+	: This callback returns 0 if the temperature is safe for charging,
+	a positive number if it is too hot to charge, and a negative number
+	if it is too cold to charge. With the variable mC, the callback returns
+	the temperature in 1/1000 of centigrade.
+	The source of temperature can be battery or ambient one according to
+	the value of measure_battery_temp.
+};
+
+5. Notify Charger-Manager of charger events: cm_notify_event()
+=========================================================
+If there is an charger event is required to notify
+Charger Manager, a charger device driver that triggers the event can call
+cm_notify_event(psy, type, msg) to notify the corresponding Charger Manager.
+In the function, psy is the charger driver's power_supply pointer, which is
+associated with Charger-Manager. The parameter "type"
+is the same as irq's type (enum cm_event_types). The event message "msg" is
+optional and is effective only if the event type is "UNDESCRIBED" or "OTHERS".
+
+6. Other Considerations
+=======================
+
+At the charger/battery-related events such as battery-pulled-out,
+charger-pulled-out, charger-inserted, DCIN-over/under-voltage, charger-stopped,
+and others critical to chargers, the system should be configured to wake up.
+At least the following should wake up the system from a suspend:
+a) charger-on/off b) external-power-in/out c) battery-in/out (while charging)
+
+It is usually accomplished by configuring the PMIC as a wakeup source.
diff --git a/Documentation/power/devices.txt b/Documentation/power/devices.txt
index 461e4f1dbec..d172bce0fd4 100644
--- a/Documentation/power/devices.txt
+++ b/Documentation/power/devices.txt
@@ -1,7 +1,14 @@
+Device Power Management
+
+Copyright (c) 2010-2011 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+Copyright (c) 2010 Alan Stern <stern@rowland.harvard.edu>
+Copyright (c) 2014 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
+
 Most of the code in Linux is device drivers, so most of the Linux power
-management code is also driver-specific.  Most drivers will do very little;
-others, especially for platforms with small batteries (like cell phones),
-will do a lot.
+management (PM) code is also driver-specific.  Most drivers will do very
+little; others, especially for platforms with small batteries (like cell
+phones), will do a lot.
 
 This writeup gives an overview of how drivers interact with system-wide
 power management goals, emphasizing the models and interfaces that are
@@ -15,9 +22,10 @@ Drivers will use one or both of these models to put devices into low-power
 states:
 
     System Sleep model:
-	Drivers can enter low power states as part of entering system-wide
-	low-power states like "suspend-to-ram", or (mostly for systems with
-	disks) "hibernate" (suspend-to-disk).
+	Drivers can enter low-power states as part of entering system-wide
+	low-power states like "suspend" (also known as "suspend-to-RAM"), or
+	(mostly for systems with disks) "hibernation" (also known as
+	"suspend-to-disk").
 
 	This is something that device, bus, and class drivers collaborate on
 	by implementing various role-specific suspend and resume methods to
@@ -25,33 +33,41 @@ states:
 	them without loss of data.
 
 	Some drivers can manage hardware wakeup events, which make the system
-	leave that low-power state.  This feature may be disabled using the
-	relevant /sys/devices/.../power/wakeup file; enabling it may cost some
-	power usage, but let the whole system enter low power states more often.
+	leave the low-power state.  This feature may be enabled or disabled
+	using the relevant /sys/devices/.../power/wakeup file (for Ethernet
+	drivers the ioctl interface used by ethtool may also be used for this
+	purpose); enabling it may cost some power usage, but let the whole
+	system enter low-power states more often.
 
     Runtime Power Management model:
-	Drivers may also enter low power states while the system is running,
-	independently of other power management activity.  Upstream drivers
-	will normally not know (or care) if the device is in some low power
-	state when issuing requests; the driver will auto-resume anything
-	that's needed when it gets a request.
-
-	This doesn't have, or need much infrastructure; it's just something you
-	should do when writing your drivers.  For example, clk_disable() unused
-	clocks as part of minimizing power drain for currently-unused hardware.
-	Of course, sometimes clusters of drivers will collaborate with each
-	other, which could involve task-specific power management.
-
-There's not a lot to be said about those low power states except that they
-are very system-specific, and often device-specific.  Also, that if enough
-drivers put themselves into low power states (at "runtime"), the effect may be
-the same as entering some system-wide low-power state (system sleep) ... and
-that synergies exist, so that several drivers using runtime pm might put the
-system into a state where even deeper power saving options are available.
-
-Most suspended devices will have quiesced all I/O:  no more DMA or irqs, no
-more data read or written, and requests from upstream drivers are no longer
-accepted.  A given bus or platform may have different requirements though.
+	Devices may also be put into low-power states while the system is
+	running, independently of other power management activity in principle.
+	However, devices are not generally independent of each other (for
+	example, a parent device cannot be suspended unless all of its child
+	devices have been suspended).  Moreover, depending on the bus type the
+	device is on, it may be necessary to carry out some bus-specific
+	operations on the device for this purpose.  Devices put into low power
+	states at run time may require special handling during system-wide power
+	transitions (suspend or hibernation).
+
+	For these reasons not only the device driver itself, but also the
+	appropriate subsystem (bus type, device type or device class) driver and
+	the PM core are involved in runtime power management.  As in the system
+	sleep power management case, they need to collaborate by implementing
+	various role-specific suspend and resume methods, so that the hardware
+	is cleanly powered down and reactivated without data or service loss.
+
+There's not a lot to be said about those low-power states except that they are
+very system-specific, and often device-specific.  Also, that if enough devices
+have been put into low-power states (at runtime), the effect may be very similar
+to entering some system-wide low-power state (system sleep) ... and that
+synergies exist, so that several drivers using runtime PM might put the system
+into a state where even deeper power saving options are available.
+
+Most suspended devices will have quiesced all I/O: no more DMA or IRQs (except
+for wakeup events), no more data read or written, and requests from upstream
+drivers are no longer accepted.  A given bus or platform may have different
+requirements though.
 
 Examples of hardware wakeup events include an alarm from a real time clock,
 network wake-on-LAN packets, keyboard or mouse activity, and media insertion
@@ -60,132 +76,179 @@ or removal (for PCMCIA, MMC/SD, USB, and so on).
 
 Interfaces for Entering System Sleep States
 ===========================================
-Most of the programming interfaces a device driver needs to know about
-relate to that first model:  entering a system-wide low power state,
-rather than just minimizing power consumption by one device.
-
+There are programming interfaces provided for subsystems (bus type, device type,
+device class) and device drivers to allow them to participate in the power
+management of devices they are concerned with.  These interfaces cover both
+system sleep and runtime power management.
+
+
+Device Power Management Operations
+----------------------------------
+Device power management operations, at the subsystem level as well as at the
+device driver level, are implemented by defining and populating objects of type
+struct dev_pm_ops:
+
+struct dev_pm_ops {
+	int (*prepare)(struct device *dev);
+	void (*complete)(struct device *dev);
+	int (*suspend)(struct device *dev);
+	int (*resume)(struct device *dev);
+	int (*freeze)(struct device *dev);
+	int (*thaw)(struct device *dev);
+	int (*poweroff)(struct device *dev);
+	int (*restore)(struct device *dev);
+	int (*suspend_late)(struct device *dev);
+	int (*resume_early)(struct device *dev);
+	int (*freeze_late)(struct device *dev);
+	int (*thaw_early)(struct device *dev);
+	int (*poweroff_late)(struct device *dev);
+	int (*restore_early)(struct device *dev);
+	int (*suspend_noirq)(struct device *dev);
+	int (*resume_noirq)(struct device *dev);
+	int (*freeze_noirq)(struct device *dev);
+	int (*thaw_noirq)(struct device *dev);
+	int (*poweroff_noirq)(struct device *dev);
+	int (*restore_noirq)(struct device *dev);
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	int (*runtime_idle)(struct device *dev);
+};
 
-Bus Driver Methods
-------------------
-The core methods to suspend and resume devices reside in struct bus_type.
-These are mostly of interest to people writing infrastructure for busses
-like PCI or USB, or because they define the primitives that device drivers
-may need to apply in domain-specific ways to their devices:
+This structure is defined in include/linux/pm.h and the methods included in it
+are also described in that file.  Their roles will be explained in what follows.
+For now, it should be sufficient to remember that the last three methods are
+specific to runtime power management while the remaining ones are used during
+system-wide power transitions.
 
-struct bus_type {
-	...
-	int  (*suspend)(struct device *dev, pm_message_t state);
-	int  (*suspend_late)(struct device *dev, pm_message_t state);
+There also is a deprecated "old" or "legacy" interface for power management
+operations available at least for some subsystems.  This approach does not use
+struct dev_pm_ops objects and it is suitable only for implementing system sleep
+power management methods.  Therefore it is not described in this document, so
+please refer directly to the source code for more information about it.
 
-	int  (*resume_early)(struct device *dev);
-	int  (*resume)(struct device *dev);
-};
 
-Bus drivers implement those methods as appropriate for the hardware and
-the drivers using it; PCI works differently from USB, and so on.  Not many
-people write bus drivers; most driver code is a "device driver" that
-builds on top of bus-specific framework code.
+Subsystem-Level Methods
+-----------------------
+The core methods to suspend and resume devices reside in struct dev_pm_ops
+pointed to by the ops member of struct dev_pm_domain, or by the pm member of
+struct bus_type, struct device_type and struct class.  They are mostly of
+interest to the people writing infrastructure for platforms and buses, like PCI
+or USB, or device type and device class drivers.  They also are relevant to the
+writers of device drivers whose subsystems (PM domains, device types, device
+classes and bus types) don't provide all power management methods.
+
+Bus drivers implement these methods as appropriate for the hardware and the
+drivers using it; PCI works differently from USB, and so on.  Not many people
+write subsystem-level drivers; most driver code is a "device driver" that builds
+on top of bus-specific framework code.
 
 For more information on these driver calls, see the description later;
 they are called in phases for every device, respecting the parent-child
-sequencing in the driver model tree.  Note that as this is being written,
-only the suspend() and resume() are widely available; not many bus drivers
-leverage all of those phases, or pass them down to lower driver levels.
+sequencing in the driver model tree.
 
 
 /sys/devices/.../power/wakeup files
 -----------------------------------
-All devices in the driver model have two flags to control handling of
-wakeup events, which are hardware signals that can force the device and/or
-system out of a low power state.  These are initialized by bus or device
-driver code using device_init_wakeup(dev,can_wakeup).
-
-The "can_wakeup" flag just records whether the device (and its driver) can
-physically support wakeup events.  When that flag is clear, the sysfs
-"wakeup" file is empty, and device_may_wakeup() returns false.
-
-For devices that can issue wakeup events, a separate flag controls whether
-that device should try to use its wakeup mechanism.  The initial value of
-device_may_wakeup() will be true, so that the device's "wakeup" file holds
-the value "enabled".  Userspace can change that to "disabled" so that
-device_may_wakeup() returns false; or change it back to "enabled" (so that
-it returns true again).
-
-
-EXAMPLE:  PCI Device Driver Methods
------------------------------------
-PCI framework software calls these methods when the PCI device driver bound
-to a device device has provided them:
-
-struct pci_driver {
-	...
-	int  (*suspend)(struct pci_device *pdev, pm_message_t state);
-	int  (*suspend_late)(struct pci_device *pdev, pm_message_t state);
-
-	int  (*resume_early)(struct pci_device *pdev);
-	int  (*resume)(struct pci_device *pdev);
-};
-
-Drivers will implement those methods, and call PCI-specific procedures
-like pci_set_power_state(), pci_enable_wake(), pci_save_state(), and
-pci_restore_state() to manage PCI-specific mechanisms.  (PCI config space
-could be saved during driver probe, if it weren't for the fact that some
-systems rely on userspace tweaking using setpci.)  Devices are suspended
-before their bridges enter low power states, and likewise bridges resume
-before their devices.
-
-
-Upper Layers of Driver Stacks
------------------------------
-Device drivers generally have at least two interfaces, and the methods
-sketched above are the ones which apply to the lower level (nearer PCI, USB,
-or other bus hardware).  The network and block layers are examples of upper
-level interfaces, as is a character device talking to userspace.
-
-Power management requests normally need to flow through those upper levels,
-which often use domain-oriented requests like "blank that screen".  In
-some cases those upper levels will have power management intelligence that
-relates to end-user activity, or other devices that work in cooperation.
-
-When those interfaces are structured using class interfaces, there is a
-standard way to have the upper layer stop issuing requests to a given
-class device (and restart later):
-
-struct class {
-	...
-	int  (*suspend)(struct device *dev, pm_message_t state);
-	int  (*resume)(struct device *dev);
-};
-
-Those calls are issued in specific phases of the process by which the
-system enters a low power "suspend" state, or resumes from it.
-
-
-Calling Drivers to Enter System Sleep States
-============================================
-When the system enters a low power state, each device's driver is asked
-to suspend the device by putting it into state compatible with the target
+All device objects in the driver model contain fields that control the handling
+of system wakeup events (hardware signals that can force the system out of a
+sleep state).  These fields are initialized by bus or device driver code using
+device_set_wakeup_capable() and device_set_wakeup_enable(), defined in
+include/linux/pm_wakeup.h.
+
+The "power.can_wakeup" flag just records whether the device (and its driver) can
+physically support wakeup events.  The device_set_wakeup_capable() routine
+affects this flag.  The "power.wakeup" field is a pointer to an object of type
+struct wakeup_source used for controlling whether or not the device should use
+its system wakeup mechanism and for notifying the PM core of system wakeup
+events signaled by the device.  This object is only present for wakeup-capable
+devices (i.e. devices whose "can_wakeup" flags are set) and is created (or
+removed) by device_set_wakeup_capable().
+
+Whether or not a device is capable of issuing wakeup events is a hardware
+matter, and the kernel is responsible for keeping track of it.  By contrast,
+whether or not a wakeup-capable device should issue wakeup events is a policy
+decision, and it is managed by user space through a sysfs attribute: the
+"power/wakeup" file.  User space can write the strings "enabled" or "disabled"
+to it to indicate whether or not, respectively, the device is supposed to signal
+system wakeup.  This file is only present if the "power.wakeup" object exists
+for the given device and is created (or removed) along with that object, by
+device_set_wakeup_capable().  Reads from the file will return the corresponding
+string.
+
+The "power/wakeup" file is supposed to contain the "disabled" string initially
+for the majority of devices; the major exceptions are power buttons, keyboards,
+and Ethernet adapters whose WoL (wake-on-LAN) feature has been set up with
+ethtool.  It should also default to "enabled" for devices that don't generate
+wakeup requests on their own but merely forward wakeup requests from one bus to
+another (like PCI Express ports).
+
+The device_may_wakeup() routine returns true only if the "power.wakeup" object
+exists and the corresponding "power/wakeup" file contains the string "enabled".
+This information is used by subsystems, like the PCI bus type code, to see
+whether or not to enable the devices' wakeup mechanisms.  If device wakeup
+mechanisms are enabled or disabled directly by drivers, they also should use
+device_may_wakeup() to decide what to do during a system sleep transition.
+Device drivers, however, are not supposed to call device_set_wakeup_enable()
+directly in any case.
+
+It ought to be noted that system wakeup is conceptually different from "remote
+wakeup" used by runtime power management, although it may be supported by the
+same physical mechanism.  Remote wakeup is a feature allowing devices in
+low-power states to trigger specific interrupts to signal conditions in which
+they should be put into the full-power state.  Those interrupts may or may not
+be used to signal system wakeup events, depending on the hardware design.  On
+some systems it is impossible to trigger them from system sleep states.  In any
+case, remote wakeup should always be enabled for runtime power management for
+all devices and drivers that support it.
+
+/sys/devices/.../power/control files
+------------------------------------
+Each device in the driver model has a flag to control whether it is subject to
+runtime power management.  This flag, called runtime_auto, is initialized by the
+bus type (or generally subsystem) code using pm_runtime_allow() or
+pm_runtime_forbid(); the default is to allow runtime power management.
+
+The setting can be adjusted by user space by writing either "on" or "auto" to
+the device's power/control sysfs file.  Writing "auto" calls pm_runtime_allow(),
+setting the flag and allowing the device to be runtime power-managed by its
+driver.  Writing "on" calls pm_runtime_forbid(), clearing the flag, returning
+the device to full power if it was in a low-power state, and preventing the
+device from being runtime power-managed.  User space can check the current value
+of the runtime_auto flag by reading the file.
+
+The device's runtime_auto flag has no effect on the handling of system-wide
+power transitions.  In particular, the device can (and in the majority of cases
+should and will) be put into a low-power state during a system-wide transition
+to a sleep state even though its runtime_auto flag is clear.
+
+For more information about the runtime power management framework, refer to
+Documentation/power/runtime_pm.txt.
+
+
+Calling Drivers to Enter and Leave System Sleep States
+======================================================
+When the system goes into a sleep state, each device's driver is asked to
+suspend the device by putting it into a state compatible with the target
 system state.  That's usually some version of "off", but the details are
 system-specific.  Also, wakeup-enabled devices will usually stay partly
 functional in order to wake the system.
 
-When the system leaves that low power state, the device's driver is asked
-to resume it.  The suspend and resume operations always go together, and
-both are multi-phase operations.
+When the system leaves that low-power state, the device's driver is asked to
+resume it by returning it to full power.  The suspend and resume operations
+always go together, and both are multi-phase operations.
 
-For simple drivers, suspend might quiesce the device using the class code
-and then turn its hardware as "off" as possible with late_suspend.  The
+For simple drivers, suspend might quiesce the device using class code
+and then turn its hardware as "off" as possible during suspend_noirq.  The
 matching resume calls would then completely reinitialize the hardware
 before reactivating its class I/O queues.
 
-More power-aware drivers drivers will use more than one device low power
-state, either at runtime or during system sleep states, and might trigger
-system wakeup events.
+More power-aware drivers might prepare the devices for triggering system wakeup
+events.
 
 
 Call Sequence Guarantees
 ------------------------
-To ensure that bridges and similar links needed to talk to a device are
+To ensure that bridges and similar links needing to talk to a device are
 available when the device is suspended or resumed, the device tree is
 walked in a bottom-up order to suspend devices.  A top-down order is
 used to resume those devices.
@@ -196,77 +259,374 @@ its parent; and can't be removed or suspended after that parent.
 
 The policy is that the device tree should match hardware bus topology.
 (Or at least the control bus, for devices which use multiple busses.)
+In particular, this means that a device registration may fail if the parent of
+the device is suspending (i.e. has been chosen by the PM core as the next
+device to suspend) or has already suspended, as well as after all of the other
+devices have been suspended.  Device drivers must be prepared to cope with such
+situations.
 
 
-Suspending Devices
-------------------
-Suspending a given device is done in several phases.  Suspending the
-system always includes every phase, executing calls for every device
-before the next phase begins.  Not all busses or classes support all
-these callbacks; and not all drivers use all the callbacks.
+System Power Management Phases
+------------------------------
+Suspending or resuming the system is done in several phases.  Different phases
+are used for freeze, standby, and memory sleep states ("suspend-to-RAM") and the
+hibernation state ("suspend-to-disk").  Each phase involves executing callbacks
+for every device before the next phase begins.  Not all busses or classes
+support all these callbacks and not all drivers use all the callbacks.  The
+various phases always run after tasks have been frozen and before they are
+unfrozen.  Furthermore, the *_noirq phases run at a time when IRQ handlers have
+been disabled (except for those marked with the IRQF_NO_SUSPEND flag).
 
-The phases are seen by driver notifications issued in this order:
+All phases use PM domain, bus, type, class or driver callbacks (that is, methods
+defined in dev->pm_domain->ops, dev->bus->pm, dev->type->pm, dev->class->pm or
+dev->driver->pm).  These callbacks are regarded by the PM core as mutually
+exclusive.  Moreover, PM domain callbacks always take precedence over all of the
+other callbacks and, for example, type callbacks take precedence over bus, class
+and driver callbacks.  To be precise, the following rules are used to determine
+which callback to execute in the given phase:
 
-   1	class.suspend(dev, message) is called after tasks are frozen, for
-	devices associated with a class that has such a method.  This
-	method may sleep.
+    1.	If dev->pm_domain is present, the PM core will choose the callback
+	included in dev->pm_domain->ops for execution
 
-	Since I/O activity usually comes from such higher layers, this is
-	a good place to quiesce all drivers of a given type (and keep such
-	code out of those drivers).
+    2.	Otherwise, if both dev->type and dev->type->pm are present, the callback
+	included in dev->type->pm will be chosen for execution.
 
-   2	bus.suspend(dev, message) is called next.  This method may sleep,
-	and is often morphed into a device driver call with bus-specific
-	parameters and/or rules.
+    3.	Otherwise, if both dev->class and dev->class->pm are present, the
+	callback included in dev->class->pm will be chosen for execution.
 
-	This call should handle parts of device suspend logic that require
-	sleeping.  It probably does work to quiesce the device which hasn't
-	been abstracted into class.suspend() or bus.suspend_late().
+    4.	Otherwise, if both dev->bus and dev->bus->pm are present, the callback
+	included in dev->bus->pm will be chosen for execution.
 
-   3	bus.suspend_late(dev, message) is called with IRQs disabled, and
-	with only one CPU active.  Until the bus.resume_early() phase
-	completes (see later), IRQs are not enabled again.  This method
-	won't be exposed by all busses; for message based busses like USB,
-	I2C, or SPI, device interactions normally require IRQs.  This bus
-	call may be morphed into a driver call with bus-specific parameters.
+This allows PM domains and device types to override callbacks provided by bus
+types or device classes if necessary.
 
-	This call might save low level hardware state that might otherwise
-	be lost in the upcoming low power state, and actually put the
-	device into a low power state ... so that in some cases the device
-	may stay partly usable until this late.  This "late" call may also
-	help when coping with hardware that behaves badly.
+The PM domain, type, class and bus callbacks may in turn invoke device- or
+driver-specific methods stored in dev->driver->pm, but they don't have to do
+that.
 
-The pm_message_t parameter is currently used to refine those semantics
-(described later).
+If the subsystem callback chosen for execution is not present, the PM core will
+execute the corresponding method from dev->driver->pm instead if there is one.
 
-At the end of those phases, drivers should normally have stopped all I/O
-transactions (DMA, IRQs), saved enough state that they can re-initialize
-or restore previous state (as needed by the hardware), and placed the
-device into a low-power state.  On many platforms they will also use
-clk_disable() to gate off one or more clock sources; sometimes they will
-also switch off power supplies, or reduce voltages.  Drivers which have
-runtime PM support may already have performed some or all of the steps
-needed to prepare for the upcoming system sleep state.
 
-When any driver sees that its device_can_wakeup(dev), it should make sure
-to use the relevant hardware signals to trigger a system wakeup event.
-For example, enable_irq_wake() might identify GPIO signals hooked up to
-a switch or other external hardware, and pci_enable_wake() does something
-similar for PCI's PME# signal.
+Entering System Suspend
+-----------------------
+When the system goes into the freeze, standby or memory sleep state,
+the phases are:
+
+		prepare, suspend, suspend_late, suspend_noirq.
+
+    1.	The prepare phase is meant to prevent races by preventing new devices
+	from being registered; the PM core would never know that all the
+	children of a device had been suspended if new children could be
+	registered at will.  (By contrast, devices may be unregistered at any
+	time.)  Unlike the other suspend-related phases, during the prepare
+	phase the device tree is traversed top-down.
+
+	After the prepare callback method returns, no new children may be
+	registered below the device.  The method may also prepare the device or
+	driver in some way for the upcoming system power transition, but it
+	should not put the device into a low-power state.
+
+	For devices supporting runtime power management, the return value of the
+	prepare callback can be used to indicate to the PM core that it may
+	safely leave the device in runtime suspend (if runtime-suspended
+	already), provided that all of the device's descendants are also left in
+	runtime suspend.  Namely, if the prepare callback returns a positive
+	number and that happens for all of the descendants of the device too,
+	and all of them (including the device itself) are runtime-suspended, the
+	PM core will skip the suspend, suspend_late and	suspend_noirq suspend
+	phases as well as the resume_noirq, resume_early and resume phases of
+	the following system resume for all of these devices.	In that case,
+	the complete callback will be called directly after the prepare callback
+	and is entirely responsible for bringing the device back to the
+	functional state as appropriate.
+
+    2.	The suspend methods should quiesce the device to stop it from performing
+	I/O.  They also may save the device registers and put it into the
+	appropriate low-power state, depending on the bus type the device is on,
+	and they may enable wakeup events.
+
+    3	For a number of devices it is convenient to split suspend into the
+	"quiesce device" and "save device state" phases, in which cases
+	suspend_late is meant to do the latter.  It is always executed after
+	runtime power management has been disabled for all devices.
+
+    4.	The suspend_noirq phase occurs after IRQ handlers have been disabled,
+	which means that the driver's interrupt handler will not be called while
+	the callback method is running.  The methods should save the values of
+	the device's registers that weren't saved previously and finally put the
+	device into the appropriate low-power state.
+
+	The majority of subsystems and device drivers need not implement this
+	callback.  However, bus types allowing devices to share interrupt
+	vectors, like PCI, generally need it; otherwise a driver might encounter
+	an error during the suspend phase by fielding a shared interrupt
+	generated by some other device after its own device had been set to low
+	power.
+
+At the end of these phases, drivers should have stopped all I/O transactions
+(DMA, IRQs), saved enough state that they can re-initialize or restore previous
+state (as needed by the hardware), and placed the device into a low-power state.
+On many platforms they will gate off one or more clock sources; sometimes they
+will also switch off power supplies or reduce voltages.  (Drivers supporting
+runtime PM may already have performed some or all of these steps.)
+
+If device_may_wakeup(dev) returns true, the device should be prepared for
+generating hardware wakeup signals to trigger a system wakeup event when the
+system is in the sleep state.  For example, enable_irq_wake() might identify
+GPIO signals hooked up to a switch or other external hardware, and
+pci_enable_wake() does something similar for the PCI PME signal.
+
+If any of these callbacks returns an error, the system won't enter the desired
+low-power state.  Instead the PM core will unwind its actions by resuming all
+the devices that were suspended.
+
+
+Leaving System Suspend
+----------------------
+When resuming from freeze, standby or memory sleep, the phases are:
+
+		resume_noirq, resume_early, resume, complete.
+
+    1.	The resume_noirq callback methods should perform any actions needed
+	before the driver's interrupt handlers are invoked.  This generally
+	means undoing the actions of the suspend_noirq phase.  If the bus type
+	permits devices to share interrupt vectors, like PCI, the method should
+	bring the device and its driver into a state in which the driver can
+	recognize if the device is the source of incoming interrupts, if any,
+	and handle them correctly.
+
+	For example, the PCI bus type's ->pm.resume_noirq() puts the device into
+	the full-power state (D0 in the PCI terminology) and restores the
+	standard configuration registers of the device.  Then it calls the
+	device driver's ->pm.resume_noirq() method to perform device-specific
+	actions.
+
+    2.	The resume_early methods should prepare devices for the execution of
+	the resume methods.  This generally involves undoing the actions of the
+	preceding suspend_late phase.
+
+    3	The resume methods should bring the device back to its operating
+	state, so that it can perform normal I/O.  This generally involves
+	undoing the actions of the suspend phase.
+
+    4.	The complete phase should undo the actions of the prepare phase.  Note,
+	however, that new children may be registered below the device as soon as
+	the resume callbacks occur; it's not necessary to wait until the
+	complete phase.
+
+	Moreover, if the preceding prepare callback returned a positive number,
+	the device may have been left in runtime suspend throughout the whole
+	system suspend and resume (the suspend, suspend_late, suspend_noirq
+	phases of system suspend and the resume_noirq, resume_early, resume
+	phases of system resume may have been skipped for it).  In that case,
+	the complete callback is entirely responsible for bringing the device
+	back to the functional state after system suspend if necessary.  [For
+	example, it may need to queue up a runtime resume request for the device
+	for this purpose.]  To check if that is the case, the complete callback
+	can consult the device's power.direct_complete flag.  Namely, if that
+	flag is set when the complete callback is being run, it has been called
+	directly after the preceding prepare and special action may be required
+	to make the device work correctly afterward.
+
+At the end of these phases, drivers should be as functional as they were before
+suspending: I/O can be performed using DMA and IRQs, and the relevant clocks are
+gated on.
 
-If a driver (or bus, or class) fails it suspend method, the system won't
-enter the desired low power state; it will resume all the devices it's
-suspended so far.
+However, the details here may again be platform-specific.  For example,
+some systems support multiple "run" states, and the mode in effect at
+the end of resume might not be the one which preceded suspension.
+That means availability of certain clocks or power supplies changed,
+which could easily affect how a driver works.
+
+Drivers need to be able to handle hardware which has been reset since the
+suspend methods were called, for example by complete reinitialization.
+This may be the hardest part, and the one most protected by NDA'd documents
+and chip errata.  It's simplest if the hardware state hasn't changed since
+the suspend was carried out, but that can't be guaranteed (in fact, it usually
+is not the case).
+
+Drivers must also be prepared to notice that the device has been removed
+while the system was powered down, whenever that's physically possible.
+PCMCIA, MMC, USB, Firewire, SCSI, and even IDE are common examples of busses
+where common Linux platforms will see such removal.  Details of how drivers
+will notice and handle such removals are currently bus-specific, and often
+involve a separate thread.
 
-Note that drivers may need to perform different actions based on the target
-system lowpower/sleep state.  At this writing, there are only platform
-specific APIs through which drivers could determine those target states.
+These callbacks may return an error value, but the PM core will ignore such
+errors since there's nothing it can do about them other than printing them in
+the system log.
+
+
+Entering Hibernation
+--------------------
+Hibernating the system is more complicated than putting it into the other
+sleep states, because it involves creating and saving a system image.
+Therefore there are more phases for hibernation, with a different set of
+callbacks.  These phases always run after tasks have been frozen and memory has
+been freed.
+
+The general procedure for hibernation is to quiesce all devices (freeze), create
+an image of the system memory while everything is stable, reactivate all
+devices (thaw), write the image to permanent storage, and finally shut down the
+system (poweroff).  The phases used to accomplish this are:
+
+	prepare, freeze, freeze_late, freeze_noirq, thaw_noirq, thaw_early,
+	thaw, complete, prepare, poweroff, poweroff_late, poweroff_noirq
+
+    1.	The prepare phase is discussed in the "Entering System Suspend" section
+	above.
+
+    2.	The freeze methods should quiesce the device so that it doesn't generate
+	IRQs or DMA, and they may need to save the values of device registers.
+	However the device does not have to be put in a low-power state, and to
+	save time it's best not to do so.  Also, the device should not be
+	prepared to generate wakeup events.
+
+    3.	The freeze_late phase is analogous to the suspend_late phase described
+	above, except that the device should not be put in a low-power state and
+	should not be allowed to generate wakeup events by it.
+
+    4.	The freeze_noirq phase is analogous to the suspend_noirq phase discussed
+	above, except again that the device should not be put in a low-power
+	state and should not be allowed to generate wakeup events.
+
+At this point the system image is created.  All devices should be inactive and
+the contents of memory should remain undisturbed while this happens, so that the
+image forms an atomic snapshot of the system state.
+
+    5.	The thaw_noirq phase is analogous to the resume_noirq phase discussed
+	above.  The main difference is that its methods can assume the device is
+	in the same state as at the end of the freeze_noirq phase.
+
+    6.	The thaw_early phase is analogous to the resume_early phase described
+	above.  Its methods should undo the actions of the preceding
+	freeze_late, if necessary.
+
+    7.	The thaw phase is analogous to the resume phase discussed above.  Its
+	methods should bring the device back to an operating state, so that it
+	can be used for saving the image if necessary.
+
+    8.	The complete phase is discussed in the "Leaving System Suspend" section
+	above.
+
+At this point the system image is saved, and the devices then need to be
+prepared for the upcoming system shutdown.  This is much like suspending them
+before putting the system into the freeze, standby or memory sleep state,
+and the phases are similar.
+
+    9.	The prepare phase is discussed above.
+
+    10.	The poweroff phase is analogous to the suspend phase.
+
+    11.	The poweroff_late phase is analogous to the suspend_late phase.
+
+    12.	The poweroff_noirq phase is analogous to the suspend_noirq phase.
+
+The poweroff, poweroff_late and poweroff_noirq callbacks should do essentially
+the same things as the suspend, suspend_late and suspend_noirq callbacks,
+respectively.  The only notable difference is that they need not store the
+device register values, because the registers should already have been stored
+during the freeze, freeze_late or freeze_noirq phases.
+
+
+Leaving Hibernation
+-------------------
+Resuming from hibernation is, again, more complicated than resuming from a sleep
+state in which the contents of main memory are preserved, because it requires
+a system image to be loaded into memory and the pre-hibernation memory contents
+to be restored before control can be passed back to the image kernel.
+
+Although in principle, the image might be loaded into memory and the
+pre-hibernation memory contents restored by the boot loader, in practice this
+can't be done because boot loaders aren't smart enough and there is no
+established protocol for passing the necessary information.  So instead, the
+boot loader loads a fresh instance of the kernel, called the boot kernel, into
+memory and passes control to it in the usual way.  Then the boot kernel reads
+the system image, restores the pre-hibernation memory contents, and passes
+control to the image kernel.  Thus two different kernels are involved in
+resuming from hibernation.  In fact, the boot kernel may be completely different
+from the image kernel: a different configuration and even a different version.
+This has important consequences for device drivers and their subsystems.
+
+To be able to load the system image into memory, the boot kernel needs to
+include at least a subset of device drivers allowing it to access the storage
+medium containing the image, although it doesn't need to include all of the
+drivers present in the image kernel.  After the image has been loaded, the
+devices managed by the boot kernel need to be prepared for passing control back
+to the image kernel.  This is very similar to the initial steps involved in
+creating a system image, and it is accomplished in the same way, using prepare,
+freeze, and freeze_noirq phases.  However the devices affected by these phases
+are only those having drivers in the boot kernel; other devices will still be in
+whatever state the boot loader left them.
+
+Should the restoration of the pre-hibernation memory contents fail, the boot
+kernel would go through the "thawing" procedure described above, using the
+thaw_noirq, thaw, and complete phases, and then continue running normally.  This
+happens only rarely.  Most often the pre-hibernation memory contents are
+restored successfully and control is passed to the image kernel, which then
+becomes responsible for bringing the system back to the working state.
+
+To achieve this, the image kernel must restore the devices' pre-hibernation
+functionality.  The operation is much like waking up from the memory sleep
+state, although it involves different phases:
+
+	restore_noirq, restore_early, restore, complete
+
+    1.	The restore_noirq phase is analogous to the resume_noirq phase.
+
+    2.	The restore_early phase is analogous to the resume_early phase.
+
+    3.	The restore phase is analogous to the resume phase.
+
+    4.	The complete phase is discussed above.
+
+The main difference from resume[_early|_noirq] is that restore[_early|_noirq]
+must assume the device has been accessed and reconfigured by the boot loader or
+the boot kernel.  Consequently the state of the device may be different from the
+state remembered from the freeze, freeze_late and freeze_noirq phases.  The
+device may even need to be reset and completely re-initialized.  In many cases
+this difference doesn't matter, so the resume[_early|_noirq] and
+restore[_early|_norq] method pointers can be set to the same routines.
+Nevertheless, different callback pointers are used in case there is a situation
+where it actually does matter.
+
+
+Device Power Management Domains
+-------------------------------
+Sometimes devices share reference clocks or other power resources.  In those
+cases it generally is not possible to put devices into low-power states
+individually.  Instead, a set of devices sharing a power resource can be put
+into a low-power state together at the same time by turning off the shared
+power resource.  Of course, they also need to be put into the full-power state
+together, by turning the shared power resource on.  A set of devices with this
+property is often referred to as a power domain.
+
+Support for power domains is provided through the pm_domain field of struct
+device.  This field is a pointer to an object of type struct dev_pm_domain,
+defined in include/linux/pm.h, providing a set of power management callbacks
+analogous to the subsystem-level and device driver callbacks that are executed
+for the given device during all power transitions, instead of the respective
+subsystem-level callbacks.  Specifically, if a device's pm_domain pointer is
+not NULL, the ->suspend() callback from the object pointed to by it will be
+executed instead of its subsystem's (e.g. bus type's) ->suspend() callback and
+analogously for all of the remaining callbacks.  In other words, power
+management domain callbacks, if defined for the given device, always take
+precedence over the callbacks provided by the device's subsystem (e.g. bus
+type).
+
+The support for device power management domains is only relevant to platforms
+needing to use the same device driver power management callbacks in many
+different power domain configurations and wanting to avoid incorporating the
+support for power domains into subsystem-level callbacks, for example by
+modifying the platform bus type.  Other platforms need not implement it or take
+it into account in any way.
 
 
 Device Low Power (suspend) States
 ---------------------------------
-Device low-power states aren't very standard.  One device might only handle
-"on" and "off, while another might support a dozen different versions of
+Device low-power states aren't standard.  One device might only handle
+"on" and "off", while another might support a dozen different versions of
 "on" (how many engines are active?), plus a state that gets back to "on"
 faster than from a full "off".
 
@@ -276,7 +636,7 @@ PCI device may not perform DMA or issue IRQs, and any wakeup events it
 issues would be issued through the PME# bus signal.  Plus, there are
 several PCI-standard device states, some of which are optional.
 
-In contrast, integrated system-on-chip processors often use irqs as the
+In contrast, integrated system-on-chip processors often use IRQs as the
 wakeup event sources (so drivers would call enable_irq_wake) and might
 be able to treat DMA completion as a wakeup event (sometimes DMA can stay
 active too, it'd only be the CPU and some peripherals that sleep).
@@ -295,128 +655,17 @@ ways; the aforementioned LCD might be active in one product's "standby",
 but a different product using the same SOC might work differently.
 
 
-Meaning of pm_message_t.event
------------------------------
-Parameters to suspend calls include the device affected and a message of
-type pm_message_t, which has one field:  the event.  If driver does not
-recognize the event code, suspend calls may abort the request and return
-a negative errno.  However, most drivers will be fine if they implement
-PM_EVENT_SUSPEND semantics for all messages.
-
-The event codes are used to refine the goal of suspending the device, and
-mostly matter when creating or resuming system memory image snapshots, as
-used with suspend-to-disk:
-
-    PM_EVENT_SUSPEND -- quiesce the driver and put hardware into a low-power
-	state.  When used with system sleep states like "suspend-to-RAM" or
-	"standby", the upcoming resume() call will often be able to rely on
-	state kept in hardware, or issue system wakeup events.
-
-    PM_EVENT_HIBERNATE -- Put hardware into a low-power state and enable wakeup
-	events as appropriate.  It is only used with hibernation
-	(suspend-to-disk) and few devices are able to wake up the system from
-	this state; most are completely powered off.
-
-    PM_EVENT_FREEZE -- quiesce the driver, but don't necessarily change into
-	any low power mode.  A system snapshot is about to be taken, often
-	followed by a call to the driver's resume() method.  Neither wakeup
-	events nor DMA are allowed.
+Power Management Notifiers
+--------------------------
+There are some operations that cannot be carried out by the power management
+callbacks discussed above, because the callbacks occur too late or too early.
+To handle these cases, subsystems and device drivers may register power
+management notifiers that are called before tasks are frozen and after they have
+been thawed.  Generally speaking, the PM notifiers are suitable for performing
+actions that either require user space to be available, or at least won't
+interfere with user space.
 
-    PM_EVENT_PRETHAW -- quiesce the driver, knowing that the upcoming resume()
-	will restore a suspend-to-disk snapshot from a different kernel image.
-	Drivers that are smart enough to look at their hardware state during
-	resume() processing need that state to be correct ... a PRETHAW could
-	be used to invalidate that state (by resetting the device), like a
-	shutdown() invocation would before a kexec() or system halt.  Other
-	drivers might handle this the same way as PM_EVENT_FREEZE.  Neither
-	wakeup events nor DMA are allowed.
-
-To enter "standby" (ACPI S1) or "Suspend to RAM" (STR, ACPI S3) states, or
-the similarly named APM states, only PM_EVENT_SUSPEND is used; the other event
-codes are used for hibernation ("Suspend to Disk", STD, ACPI S4).
-
-There's also PM_EVENT_ON, a value which never appears as a suspend event
-but is sometimes used to record the "not suspended" device state.
-
-
-Resuming Devices
-----------------
-Resuming is done in multiple phases, much like suspending, with all
-devices processing each phase's calls before the next phase begins.
-
-The phases are seen by driver notifications issued in this order:
-
-   1	bus.resume_early(dev) is called with IRQs disabled, and with
-   	only one CPU active.  As with bus.suspend_late(), this method
-	won't be supported on busses that require IRQs in order to
-	interact with devices.
-
-	This reverses the effects of bus.suspend_late().
-
-   2	bus.resume(dev) is called next.  This may be morphed into a device
-   	driver call with bus-specific parameters; implementations may sleep.
-
-	This reverses the effects of bus.suspend().
-
-   3	class.resume(dev) is called for devices associated with a class
-	that has such a method.  Implementations may sleep.
-
-	This reverses the effects of class.suspend(), and would usually
-	reactivate the device's I/O queue.
-
-At the end of those phases, drivers should normally be as functional as
-they were before suspending:  I/O can be performed using DMA and IRQs, and
-the relevant clocks are gated on.  The device need not be "fully on"; it
-might be in a runtime lowpower/suspend state that acts as if it were.
-
-However, the details here may again be platform-specific.  For example,
-some systems support multiple "run" states, and the mode in effect at
-the end of resume() might not be the one which preceded suspension.
-That means availability of certain clocks or power supplies changed,
-which could easily affect how a driver works.
-
-
-Drivers need to be able to handle hardware which has been reset since the
-suspend methods were called, for example by complete reinitialization.
-This may be the hardest part, and the one most protected by NDA'd documents
-and chip errata.  It's simplest if the hardware state hasn't changed since
-the suspend() was called, but that can't always be guaranteed.
-
-Drivers must also be prepared to notice that the device has been removed
-while the system was powered off, whenever that's physically possible.
-PCMCIA, MMC, USB, Firewire, SCSI, and even IDE are common examples of busses
-where common Linux platforms will see such removal.  Details of how drivers
-will notice and handle such removals are currently bus-specific, and often
-involve a separate thread.
-
-
-Note that the bus-specific runtime PM wakeup mechanism can exist, and might
-be defined to share some of the same driver code as for system wakeup.  For
-example, a bus-specific device driver's resume() method might be used there,
-so it wouldn't only be called from bus.resume() during system-wide wakeup.
-See bus-specific information about how runtime wakeup events are handled.
-
-
-System Devices
---------------
-System devices follow a slightly different API, which can be found in
-
-	include/linux/sysdev.h
-	drivers/base/sys.c
-
-System devices will only be suspended with interrupts disabled, and after
-all other devices have been suspended.  On resume, they will be resumed
-before any other devices, and also with interrupts disabled.
-
-That is, IRQs are disabled, the suspend_late() phase begins, then the
-sysdev_driver.suspend() phase, and the system enters a sleep state.  Then
-the sysdev_driver.resume() phase begins, followed by the resume_early()
-phase, after which IRQs are enabled.
-
-Code to actually enter and exit the system-wide low power state sometimes
-involves hardware details that are only known to the boot firmware, and
-may leave a CPU running software (from SRAM or flash memory) that monitors
-the system and manages its wakeup sequence.
+For details refer to Documentation/power/notifiers.txt.
 
 
 Runtime Power Management
@@ -426,82 +675,23 @@ running. This feature is useful for devices that are not being used, and
 can offer significant power savings on a running system.  These devices
 often support a range of runtime power states, which might use names such
 as "off", "sleep", "idle", "active", and so on.  Those states will in some
-cases (like PCI) be partially constrained by a bus the device uses, and will
+cases (like PCI) be partially constrained by the bus the device uses, and will
 usually include hardware states that are also used in system sleep states.
 
-However, note that if a driver puts a device into a runtime low power state
-and the system then goes into a system-wide sleep state, it normally ought
-to resume into that runtime low power state rather than "full on".  Such
-distinctions would be part of the driver-internal state machine for that
-hardware; the whole point of runtime power management is to be sure that
-drivers are decoupled in that way from the state machine governing phases
-of the system-wide power/sleep state transitions.
-
-
-Power Saving Techniques
------------------------
-Normally runtime power management is handled by the drivers without specific
-userspace or kernel intervention, by device-aware use of techniques like:
-
-    Using information provided by other system layers
-	- stay deeply "off" except between open() and close()
-	- if transceiver/PHY indicates "nobody connected", stay "off"
-	- application protocols may include power commands or hints
-
-    Using fewer CPU cycles
-	- using DMA instead of PIO
-	- removing timers, or making them lower frequency
-	- shortening "hot" code paths
-	- eliminating cache misses
-	- (sometimes) offloading work to device firmware
-
-    Reducing other resource costs
-	- gating off unused clocks in software (or hardware)
-	- switching off unused power supplies
-	- eliminating (or delaying/merging) IRQs
-	- tuning DMA to use word and/or burst modes
-
-    Using device-specific low power states
-	- using lower voltages
-	- avoiding needless DMA transfers
-
-Read your hardware documentation carefully to see the opportunities that
-may be available.  If you can, measure the actual power usage and check
-it against the budget established for your project.
-
-
-Examples:  USB hosts, system timer, system CPU
-----------------------------------------------
-USB host controllers make interesting, if complex, examples.  In many cases
-these have no work to do:  no USB devices are connected, or all of them are
-in the USB "suspend" state.  Linux host controller drivers can then disable
-periodic DMA transfers that would otherwise be a constant power drain on the
-memory subsystem, and enter a suspend state.  In power-aware controllers,
-entering that suspend state may disable the clock used with USB signaling,
-saving a certain amount of power.
-
-The controller will be woken from that state (with an IRQ) by changes to the
-signal state on the data lines of a given port, for example by an existing
-peripheral requesting "remote wakeup" or by plugging a new peripheral.  The
-same wakeup mechanism usually works from "standby" sleep states, and on some
-systems also from "suspend to RAM" (or even "suspend to disk") states.
-(Except that ACPI may be involved instead of normal IRQs, on some hardware.)
-
-System devices like timers and CPUs may have special roles in the platform
-power management scheme.  For example, system timers using a "dynamic tick"
-approach don't just save CPU cycles (by eliminating needless timer IRQs),
-but they may also open the door to using lower power CPU "idle" states that
-cost more than a jiffie to enter and exit.  On x86 systems these are states
-like "C3"; note that periodic DMA transfers from a USB host controller will
-also prevent entry to a C3 state, much like a periodic timer IRQ.
-
-That kind of runtime mechanism interaction is common.  "System On Chip" (SOC)
-processors often have low power idle modes that can't be entered unless
-certain medium-speed clocks (often 12 or 48 MHz) are gated off.  When the
-drivers gate those clocks effectively, then the system idle task may be able
-to use the lower power idle modes and thereby increase battery life.
-
-If the CPU can have a "cpufreq" driver, there also may be opportunities
-to shift to lower voltage settings and reduce the power cost of executing
-a given number of instructions.  (Without voltage adjustment, it's rare
-for cpufreq to save much power; the cost-per-instruction must go down.)
+A system-wide power transition can be started while some devices are in low
+power states due to runtime power management.  The system sleep PM callbacks
+should recognize such situations and react to them appropriately, but the
+necessary actions are subsystem-specific.
+
+In some cases the decision may be made at the subsystem level while in other
+cases the device driver may be left to decide.  In some cases it may be
+desirable to leave a suspended device in that state during a system-wide power
+transition, but in other cases the device must be put back into the full-power
+state temporarily, for example so that its system wakeup capability can be
+disabled.  This all depends on the hardware and the design of the subsystem and
+device driver in question.
+
+During system-wide resume from a sleep state it's easiest to put devices into
+the full-power state, as explained in Documentation/power/runtime_pm.txt.  Refer
+to that document for more information regarding this particular issue as well as
+for information on the device runtime power management framework in general.
diff --git a/Documentation/power/drivers-testing.txt b/Documentation/power/drivers-testing.txt
index 7f7a737f7f9..638afdf4d6b 100644
--- a/Documentation/power/drivers-testing.txt
+++ b/Documentation/power/drivers-testing.txt
@@ -23,10 +23,10 @@ Once you have resolved the suspend/resume-related problems with your test system
 without the new driver, you are ready to test it:
 
 a) Build the driver as a module, load it and try the test modes of hibernation
-   (see: Documents/power/basic-pm-debugging.txt, 1).
+   (see: Documentation/power/basic-pm-debugging.txt, 1).
 
 b) Load the driver and attempt to hibernate in the "reboot", "shutdown" and
-   "platform" modes (see: Documents/power/basic-pm-debugging.txt, 1).
+   "platform" modes (see: Documentation/power/basic-pm-debugging.txt, 1).
 
 c) Compile the driver directly into the kernel and try the test modes of
    hibernation.
@@ -34,12 +34,12 @@ c) Compile the driver directly into the kernel and try the test modes of
 d) Attempt to hibernate with the driver compiled directly into the kernel
    in the "reboot", "shutdown" and "platform" modes.
 
-e) Try the test modes of suspend (see: Documents/power/basic-pm-debugging.txt,
+e) Try the test modes of suspend (see: Documentation/power/basic-pm-debugging.txt,
    2).  [As far as the STR tests are concerned, it should not matter whether or
    not the driver is built as a module.]
 
 f) Attempt to suspend to RAM using the s2ram tool with the driver loaded
-   (see: Documents/power/basic-pm-debugging.txt, 2).
+   (see: Documentation/power/basic-pm-debugging.txt, 2).
 
 Each of the above tests should be repeated several times and the STD tests
 should be mixed with the STR tests.  If any of them fails, the driver cannot be
diff --git a/Documentation/power/freezing-of-tasks.txt b/Documentation/power/freezing-of-tasks.txt
index 38b57248fd6..85894d83b35 100644
--- a/Documentation/power/freezing-of-tasks.txt
+++ b/Documentation/power/freezing-of-tasks.txt
@@ -9,7 +9,7 @@ architectures).
 
 II. How does it work?
 
-There are four per-task flags used for that, PF_NOFREEZE, PF_FROZEN, TIF_FREEZE
+There are three per-task flags used for that, PF_NOFREEZE, PF_FROZEN
 and PF_FREEZER_SKIP (the last one is auxiliary).  The tasks that have
 PF_NOFREEZE unset (all user space processes and some kernel threads) are
 regarded as 'freezable' and treated in a special way before the system enters a
@@ -17,30 +17,31 @@ suspend state as well as before a hibernation image is created (in what follows
 we only consider hibernation, but the description also applies to suspend).
 
 Namely, as the first step of the hibernation procedure the function
-freeze_processes() (defined in kernel/power/process.c) is called.  It executes
-try_to_freeze_tasks() that sets TIF_FREEZE for all of the freezable tasks and
-either wakes them up, if they are kernel threads, or sends fake signals to them,
-if they are user space processes.  A task that has TIF_FREEZE set, should react
-to it by calling the function called refrigerator() (defined in
-kernel/power/process.c), which sets the task's PF_FROZEN flag, changes its state
-to TASK_UNINTERRUPTIBLE and makes it loop until PF_FROZEN is cleared for it.
-Then, we say that the task is 'frozen' and therefore the set of functions
-handling this mechanism is referred to as 'the freezer' (these functions are
-defined in kernel/power/process.c and include/linux/freezer.h).  User space
-processes are generally frozen before kernel threads.
-
-It is not recommended to call refrigerator() directly.  Instead, it is
-recommended to use the try_to_freeze() function (defined in
-include/linux/freezer.h), that checks the task's TIF_FREEZE flag and makes the
-task enter refrigerator() if the flag is set.
+freeze_processes() (defined in kernel/power/process.c) is called.  A system-wide
+variable system_freezing_cnt (as opposed to a per-task flag) is used to indicate
+whether the system is to undergo a freezing operation. And freeze_processes()
+sets this variable.  After this, it executes try_to_freeze_tasks() that sends a
+fake signal to all user space processes, and wakes up all the kernel threads.
+All freezable tasks must react to that by calling try_to_freeze(), which
+results in a call to __refrigerator() (defined in kernel/freezer.c), which sets
+the task's PF_FROZEN flag, changes its state to TASK_UNINTERRUPTIBLE and makes
+it loop until PF_FROZEN is cleared for it. Then, we say that the task is
+'frozen' and therefore the set of functions handling this mechanism is referred
+to as 'the freezer' (these functions are defined in kernel/power/process.c,
+kernel/freezer.c & include/linux/freezer.h). User space processes are generally
+frozen before kernel threads.
+
+__refrigerator() must not be called directly.  Instead, use the
+try_to_freeze() function (defined in include/linux/freezer.h), that checks
+if the task is to be frozen and makes the task enter __refrigerator().
 
 For user space processes try_to_freeze() is called automatically from the
 signal-handling code, but the freezable kernel threads need to call it
 explicitly in suitable places or use the wait_event_freezable() or
 wait_event_freezable_timeout() macros (defined in include/linux/freezer.h)
-that combine interruptible sleep with checking if TIF_FREEZE is set and calling
-try_to_freeze().  The main loop of a freezable kernel thread may look like the
-following one:
+that combine interruptible sleep with checking if the task is to be frozen and
+calling try_to_freeze().  The main loop of a freezable kernel thread may look
+like the following one:
 
 	set_freezable();
 	do {
@@ -53,7 +54,7 @@ following one:
 (from drivers/usb/core/hub.c::hub_thread()).
 
 If a freezable kernel thread fails to call try_to_freeze() after the freezer has
-set TIF_FREEZE for it, the freezing of tasks will fail and the entire
+initiated a freezing operation, the freezing of tasks will fail and the entire
 hibernation operation will be cancelled.  For this reason, freezable kernel
 threads must call try_to_freeze() somewhere or use one of the
 wait_event_freezable() and wait_event_freezable_timeout() macros.
@@ -61,13 +62,34 @@ wait_event_freezable() and wait_event_freezable_timeout() macros.
 After the system memory state has been restored from a hibernation image and
 devices have been reinitialized, the function thaw_processes() is called in
 order to clear the PF_FROZEN flag for each frozen task.  Then, the tasks that
-have been frozen leave refrigerator() and continue running.
+have been frozen leave __refrigerator() and continue running.
+
+
+Rationale behind the functions dealing with freezing and thawing of tasks:
+-------------------------------------------------------------------------
+
+freeze_processes():
+  - freezes only userspace tasks
+
+freeze_kernel_threads():
+  - freezes all tasks (including kernel threads) because we can't freeze
+    kernel threads without freezing userspace tasks
+
+thaw_kernel_threads():
+  - thaws only kernel threads; this is particularly useful if we need to do
+    anything special in between thawing of kernel threads and thawing of
+    userspace tasks, or if we want to postpone the thawing of userspace tasks
+
+thaw_processes():
+  - thaws all tasks (including kernel threads) because we can't thaw userspace
+    tasks without thawing kernel threads
+
 
 III. Which kernel threads are freezable?
 
 Kernel threads are not freezable by default.  However, a kernel thread may clear
 PF_NOFREEZE for itself by calling set_freezable() (the resetting of PF_NOFREEZE
-directly is strongly discouraged).  From this point it is regarded as freezable
+directly is not allowed).  From this point it is regarded as freezable
 and must call try_to_freeze() in a suitable place.
 
 IV. Why do we do that?
@@ -95,7 +117,7 @@ after the memory for the image has been freed, we don't want tasks to allocate
 additional memory and we prevent them from doing that by freezing them earlier.
 [Of course, this also means that device drivers should not allocate substantial
 amounts of memory from their .suspend() callbacks before hibernation, but this
-is e separate issue.]
+is a separate issue.]
 
 3. The third reason is to prevent user space processes and some kernel threads
 from interfering with the suspending and resuming of devices.  A user space
@@ -120,10 +142,10 @@ So in practice, the 'at all' may become a 'why freeze kernel threads?' and
 freezing user threads I don't find really objectionable."
 
 Still, there are kernel threads that may want to be freezable.  For example, if
-a kernel that belongs to a device driver accesses the device directly, it in
-principle needs to know when the device is suspended, so that it doesn't try to
-access it at that time.  However, if the kernel thread is freezable, it will be
-frozen before the driver's .suspend() callback is executed and it will be
+a kernel thread that belongs to a device driver accesses the device directly, it
+in principle needs to know when the device is suspended, so that it doesn't try
+to access it at that time.  However, if the kernel thread is freezable, it will
+be frozen before the driver's .suspend() callback is executed and it will be
 thawed after the driver's .resume() callback has run, so it won't be accessing
 the device while it's suspended.
 
@@ -176,3 +198,33 @@ tasks, since it generally exists anyway.
 A driver must have all firmwares it may need in RAM before suspend() is called.
 If keeping them is not practical, for example due to their size, they must be
 requested early enough using the suspend notifier API described in notifiers.txt.
+
+VI. Are there any precautions to be taken to prevent freezing failures?
+
+Yes, there are.
+
+First of all, grabbing the 'pm_mutex' lock to mutually exclude a piece of code
+from system-wide sleep such as suspend/hibernation is not encouraged.
+If possible, that piece of code must instead hook onto the suspend/hibernation
+notifiers to achieve mutual exclusion. Look at the CPU-Hotplug code
+(kernel/cpu.c) for an example.
+
+However, if that is not feasible, and grabbing 'pm_mutex' is deemed necessary,
+it is strongly discouraged to directly call mutex_[un]lock(&pm_mutex) since
+that could lead to freezing failures, because if the suspend/hibernate code
+successfully acquired the 'pm_mutex' lock, and hence that other entity failed
+to acquire the lock, then that task would get blocked in TASK_UNINTERRUPTIBLE
+state. As a consequence, the freezer would not be able to freeze that task,
+leading to freezing failure.
+
+However, the [un]lock_system_sleep() APIs are safe to use in this scenario,
+since they ask the freezer to skip freezing this task, since it is anyway
+"frozen enough" as it is blocked on 'pm_mutex', which will be released
+only after the entire suspend/hibernation sequence is complete.
+So, to summarize, use [un]lock_system_sleep() instead of directly using
+mutex_[un]lock(&pm_mutex). That would prevent freezing failures.
+
+V. Miscellaneous
+/sys/power/pm_freeze_timeout controls how long it will cost at most to freeze
+all user space processes or all freezable kernel threads, in unit of millisecond.
+The default value is 20000, with range of unsigned integer.
diff --git a/Documentation/power/interface.txt b/Documentation/power/interface.txt
index e67211fe0ee..f1f0f59a7c4 100644
--- a/Documentation/power/interface.txt
+++ b/Documentation/power/interface.txt
@@ -7,8 +7,8 @@ running. The interface exists in /sys/power/ directory (assuming sysfs
 is mounted at /sys). 
 
 /sys/power/state controls system power state. Reading from this file
-returns what states are supported, which is hard-coded to 'standby'
-(Power-On Suspend), 'mem' (Suspend-to-RAM), and 'disk'
+returns what states are supported, which is hard-coded to 'freeze',
+'standby' (Power-On Suspend), 'mem' (Suspend-to-RAM), and 'disk'
 (Suspend-to-Disk). 
 
 Writing to this file one of those strings causes the system to
@@ -57,7 +57,7 @@ smallest image possible.  In particular, if "0" is written to this file, the
 suspend image will be as small as possible.
 
 Reading from this file will display the current image size limit, which
-is set to 500 MB by default.
+is set to 2/5 of available RAM by default.
 
 /sys/power/pm_trace controls the code which saves the last PM event point in
 the RTC across reboots, so that you can debug a machine that just hangs
diff --git a/Documentation/power/notifiers.txt b/Documentation/power/notifiers.txt
index ae1b7ec0768..a81fa254303 100644
--- a/Documentation/power/notifiers.txt
+++ b/Documentation/power/notifiers.txt
@@ -1,46 +1,43 @@
 Suspend notifiers
-	(C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL
-
-There are some operations that device drivers may want to carry out in their
-.suspend() routines, but shouldn't, because they can cause the hibernation or
-suspend to fail. For example, a driver may want to allocate a substantial amount
-of memory (like 50 MB) in .suspend(), but that shouldn't be done after the
-swsusp's memory shrinker has run.
-
-Also, there may be some operations, that subsystems want to carry out before a
-hibernation/suspend or after a restore/resume, requiring the system to be fully
-functional, so the drivers' .suspend() and .resume() routines are not suitable
-for this purpose.  For example, device drivers may want to upload firmware to
-their devices after a restore from a hibernation image, but they cannot do it by
-calling request_firmware() from their .resume() routines (user land processes
-are frozen at this point).  The solution may be to load the firmware into
-memory before processes are frozen and upload it from there in the .resume()
-routine.  Of course, a hibernation notifier may be used for this purpose.
-
-The subsystems that have such needs can register suspend notifiers that will be
-called upon the following events by the suspend core:
-
-PM_HIBERNATION_PREPARE	The system is going to hibernate or suspend, tasks will
-			be frozen immediately.
+	(C) 2007-2011 Rafael J. Wysocki <rjw@sisk.pl>, GPL
+
+There are some operations that subsystems or drivers may want to carry out
+before hibernation/suspend or after restore/resume, but they require the system
+to be fully functional, so the drivers' and subsystems' .suspend() and .resume()
+or even .prepare() and .complete() callbacks are not suitable for this purpose.
+For example, device drivers may want to upload firmware to their devices after
+resume/restore, but they cannot do it by calling request_firmware() from their
+.resume() or .complete() routines (user land processes are frozen at these
+points).  The solution may be to load the firmware into memory before processes
+are frozen and upload it from there in the .resume() routine.
+A suspend/hibernation notifier may be used for this purpose.
+
+The subsystems or drivers having such needs can register suspend notifiers that
+will be called upon the following events by the PM core:
+
+PM_HIBERNATION_PREPARE	The system is going to hibernate, tasks will be frozen
+			immediately. This is different from PM_SUSPEND_PREPARE
+			below because here we do additional work between notifiers
+			and drivers freezing.
 
 PM_POST_HIBERNATION	The system memory state has been restored from a
-			hibernation image or an error occured during the
-			hibernation.  Device drivers' .resume() callbacks have
+			hibernation image or an error occurred during
+			hibernation.  Device drivers' restore callbacks have
 			been executed and tasks have been thawed.
 
 PM_RESTORE_PREPARE	The system is going to restore a hibernation image.
-			If all goes well the restored kernel will issue a
+			If all goes well, the restored kernel will issue a
 			PM_POST_HIBERNATION notification.
 
-PM_POST_RESTORE		An error occurred during the hibernation restore.
-			Device drivers' .resume() callbacks have been executed
+PM_POST_RESTORE		An error occurred during restore from hibernation.
+			Device drivers' restore callbacks have been executed
 			and tasks have been thawed.
 
-PM_SUSPEND_PREPARE	The system is preparing for a suspend.
+PM_SUSPEND_PREPARE	The system is preparing for suspend.
 
-PM_POST_SUSPEND		The system has just resumed or an error occured during
-			the suspend.	Device drivers' .resume() callbacks have
-			been executed and tasks have been thawed.
+PM_POST_SUSPEND		The system has just resumed or an error occurred during
+			suspend.  Device drivers' resume callbacks have been
+			executed and tasks have been thawed.
 
 It is generally assumed that whatever the notifiers do for
 PM_HIBERNATION_PREPARE, should be undone for PM_POST_HIBERNATION.  Analogously,
diff --git a/Documentation/power/opp.txt b/Documentation/power/opp.txt
new file mode 100644
index 00000000000..a9adad828cd
--- /dev/null
+++ b/Documentation/power/opp.txt
@@ -0,0 +1,365 @@
+Operating Performance Points (OPP) Library
+==========================================
+
+(C) 2009-2010 Nishanth Menon <nm@ti.com>, Texas Instruments Incorporated
+
+Contents
+--------
+1. Introduction
+2. Initial OPP List Registration
+3. OPP Search Functions
+4. OPP Availability Control Functions
+5. OPP Data Retrieval Functions
+6. Data Structures
+
+1. Introduction
+===============
+1.1 What is an Operating Performance Point (OPP)?
+
+Complex SoCs of today consists of a multiple sub-modules working in conjunction.
+In an operational system executing varied use cases, not all modules in the SoC
+need to function at their highest performing frequency all the time. To
+facilitate this, sub-modules in a SoC are grouped into domains, allowing some
+domains to run at lower voltage and frequency while other domains run at
+voltage/frequency pairs that are higher.
+
+The set of discrete tuples consisting of frequency and voltage pairs that
+the device will support per domain are called Operating Performance Points or
+OPPs.
+
+As an example:
+Let us consider an MPU device which supports the following:
+{300MHz at minimum voltage of 1V}, {800MHz at minimum voltage of 1.2V},
+{1GHz at minimum voltage of 1.3V}
+
+We can represent these as three OPPs as the following {Hz, uV} tuples:
+{300000000, 1000000}
+{800000000, 1200000}
+{1000000000, 1300000}
+
+1.2 Operating Performance Points Library
+
+OPP library provides a set of helper functions to organize and query the OPP
+information. The library is located in drivers/base/power/opp.c and the header
+is located in include/linux/pm_opp.h. OPP library can be enabled by enabling
+CONFIG_PM_OPP from power management menuconfig menu. OPP library depends on
+CONFIG_PM as certain SoCs such as Texas Instrument's OMAP framework allows to
+optionally boot at a certain OPP without needing cpufreq.
+
+Typical usage of the OPP library is as follows:
+(users)		-> registers a set of default OPPs		-> (library)
+SoC framework	-> modifies on required cases certain OPPs	-> OPP layer
+		-> queries to search/retrieve information	->
+
+Architectures that provide a SoC framework for OPP should select ARCH_HAS_OPP
+to make the OPP layer available.
+
+OPP layer expects each domain to be represented by a unique device pointer. SoC
+framework registers a set of initial OPPs per device with the OPP layer. This
+list is expected to be an optimally small number typically around 5 per device.
+This initial list contains a set of OPPs that the framework expects to be safely
+enabled by default in the system.
+
+Note on OPP Availability:
+------------------------
+As the system proceeds to operate, SoC framework may choose to make certain
+OPPs available or not available on each device based on various external
+factors. Example usage: Thermal management or other exceptional situations where
+SoC framework might choose to disable a higher frequency OPP to safely continue
+operations until that OPP could be re-enabled if possible.
+
+OPP library facilitates this concept in it's implementation. The following
+operational functions operate only on available opps:
+opp_find_freq_{ceil, floor}, dev_pm_opp_get_voltage, dev_pm_opp_get_freq, dev_pm_opp_get_opp_count
+
+dev_pm_opp_find_freq_exact is meant to be used to find the opp pointer which can then
+be used for dev_pm_opp_enable/disable functions to make an opp available as required.
+
+WARNING: Users of OPP library should refresh their availability count using
+get_opp_count if dev_pm_opp_enable/disable functions are invoked for a device, the
+exact mechanism to trigger these or the notification mechanism to other
+dependent subsystems such as cpufreq are left to the discretion of the SoC
+specific framework which uses the OPP library. Similar care needs to be taken
+care to refresh the cpufreq table in cases of these operations.
+
+WARNING on OPP List locking mechanism:
+-------------------------------------------------
+OPP library uses RCU for exclusivity. RCU allows the query functions to operate
+in multiple contexts and this synchronization mechanism is optimal for a read
+intensive operations on data structure as the OPP library caters to.
+
+To ensure that the data retrieved are sane, the users such as SoC framework
+should ensure that the section of code operating on OPP queries are locked
+using RCU read locks. The opp_find_freq_{exact,ceil,floor},
+opp_get_{voltage, freq, opp_count} fall into this category.
+
+opp_{add,enable,disable} are updaters which use mutex and implement it's own
+RCU locking mechanisms. These functions should *NOT* be called under RCU locks
+and other contexts that prevent blocking functions in RCU or mutex operations
+from working.
+
+2. Initial OPP List Registration
+================================
+The SoC implementation calls dev_pm_opp_add function iteratively to add OPPs per
+device. It is expected that the SoC framework will register the OPP entries
+optimally- typical numbers range to be less than 5. The list generated by
+registering the OPPs is maintained by OPP library throughout the device
+operation. The SoC framework can subsequently control the availability of the
+OPPs dynamically using the dev_pm_opp_enable / disable functions.
+
+dev_pm_opp_add - Add a new OPP for a specific domain represented by the device pointer.
+	The OPP is defined using the frequency and voltage. Once added, the OPP
+	is assumed to be available and control of it's availability can be done
+	with the dev_pm_opp_enable/disable functions. OPP library internally stores
+	and manages this information in the opp struct. This function may be
+	used by SoC framework to define a optimal list as per the demands of
+	SoC usage environment.
+
+	WARNING: Do not use this function in interrupt context.
+
+	Example:
+	 soc_pm_init()
+	 {
+		/* Do things */
+		r = dev_pm_opp_add(mpu_dev, 1000000, 900000);
+		if (!r) {
+			pr_err("%s: unable to register mpu opp(%d)\n", r);
+			goto no_cpufreq;
+		}
+		/* Do cpufreq things */
+	 no_cpufreq:
+		/* Do remaining things */
+	 }
+
+3. OPP Search Functions
+=======================
+High level framework such as cpufreq operates on frequencies. To map the
+frequency back to the corresponding OPP, OPP library provides handy functions
+to search the OPP list that OPP library internally manages. These search
+functions return the matching pointer representing the opp if a match is
+found, else returns error. These errors are expected to be handled by standard
+error checks such as IS_ERR() and appropriate actions taken by the caller.
+
+dev_pm_opp_find_freq_exact - Search for an OPP based on an *exact* frequency and
+	availability. This function is especially useful to enable an OPP which
+	is not available by default.
+	Example: In a case when SoC framework detects a situation where a
+	higher frequency could be made available, it can use this function to
+	find the OPP prior to call the dev_pm_opp_enable to actually make it available.
+	 rcu_read_lock();
+	 opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false);
+	 rcu_read_unlock();
+	 /* dont operate on the pointer.. just do a sanity check.. */
+	 if (IS_ERR(opp)) {
+		pr_err("frequency not disabled!\n");
+		/* trigger appropriate actions.. */
+	 } else {
+		dev_pm_opp_enable(dev,1000000000);
+	 }
+
+	NOTE: This is the only search function that operates on OPPs which are
+	not available.
+
+dev_pm_opp_find_freq_floor - Search for an available OPP which is *at most* the
+	provided frequency. This function is useful while searching for a lesser
+	match OR operating on OPP information in the order of decreasing
+	frequency.
+	Example: To find the highest opp for a device:
+	 freq = ULONG_MAX;
+	 rcu_read_lock();
+	 dev_pm_opp_find_freq_floor(dev, &freq);
+	 rcu_read_unlock();
+
+dev_pm_opp_find_freq_ceil - Search for an available OPP which is *at least* the
+	provided frequency. This function is useful while searching for a
+	higher match OR operating on OPP information in the order of increasing
+	frequency.
+	Example 1: To find the lowest opp for a device:
+	 freq = 0;
+	 rcu_read_lock();
+	 dev_pm_opp_find_freq_ceil(dev, &freq);
+	 rcu_read_unlock();
+	Example 2: A simplified implementation of a SoC cpufreq_driver->target:
+	 soc_cpufreq_target(..)
+	 {
+		/* Do stuff like policy checks etc. */
+		/* Find the best frequency match for the req */
+		rcu_read_lock();
+		opp = dev_pm_opp_find_freq_ceil(dev, &freq);
+		rcu_read_unlock();
+		if (!IS_ERR(opp))
+			soc_switch_to_freq_voltage(freq);
+		else
+			/* do something when we can't satisfy the req */
+		/* do other stuff */
+	 }
+
+4. OPP Availability Control Functions
+=====================================
+A default OPP list registered with the OPP library may not cater to all possible
+situation. The OPP library provides a set of functions to modify the
+availability of a OPP within the OPP list. This allows SoC frameworks to have
+fine grained dynamic control of which sets of OPPs are operationally available.
+These functions are intended to *temporarily* remove an OPP in conditions such
+as thermal considerations (e.g. don't use OPPx until the temperature drops).
+
+WARNING: Do not use these functions in interrupt context.
+
+dev_pm_opp_enable - Make a OPP available for operation.
+	Example: Lets say that 1GHz OPP is to be made available only if the
+	SoC temperature is lower than a certain threshold. The SoC framework
+	implementation might choose to do something as follows:
+	 if (cur_temp < temp_low_thresh) {
+		/* Enable 1GHz if it was disabled */
+		rcu_read_lock();
+		opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false);
+		rcu_read_unlock();
+		/* just error check */
+		if (!IS_ERR(opp))
+			ret = dev_pm_opp_enable(dev, 1000000000);
+		else
+			goto try_something_else;
+	 }
+
+dev_pm_opp_disable - Make an OPP to be not available for operation
+	Example: Lets say that 1GHz OPP is to be disabled if the temperature
+	exceeds a threshold value. The SoC framework implementation might
+	choose to do something as follows:
+	 if (cur_temp > temp_high_thresh) {
+		/* Disable 1GHz if it was enabled */
+		rcu_read_lock();
+		opp = dev_pm_opp_find_freq_exact(dev, 1000000000, true);
+		rcu_read_unlock();
+		/* just error check */
+		if (!IS_ERR(opp))
+			ret = dev_pm_opp_disable(dev, 1000000000);
+		else
+			goto try_something_else;
+	 }
+
+5. OPP Data Retrieval Functions
+===============================
+Since OPP library abstracts away the OPP information, a set of functions to pull
+information from the OPP structure is necessary. Once an OPP pointer is
+retrieved using the search functions, the following functions can be used by SoC
+framework to retrieve the information represented inside the OPP layer.
+
+dev_pm_opp_get_voltage - Retrieve the voltage represented by the opp pointer.
+	Example: At a cpufreq transition to a different frequency, SoC
+	framework requires to set the voltage represented by the OPP using
+	the regulator framework to the Power Management chip providing the
+	voltage.
+	 soc_switch_to_freq_voltage(freq)
+	 {
+		/* do things */
+		rcu_read_lock();
+		opp = dev_pm_opp_find_freq_ceil(dev, &freq);
+		v = dev_pm_opp_get_voltage(opp);
+		rcu_read_unlock();
+		if (v)
+			regulator_set_voltage(.., v);
+		/* do other things */
+	 }
+
+dev_pm_opp_get_freq - Retrieve the freq represented by the opp pointer.
+	Example: Lets say the SoC framework uses a couple of helper functions
+	we could pass opp pointers instead of doing additional parameters to
+	handle quiet a bit of data parameters.
+	 soc_cpufreq_target(..)
+	 {
+		/* do things.. */
+		 max_freq = ULONG_MAX;
+		 rcu_read_lock();
+		 max_opp = dev_pm_opp_find_freq_floor(dev,&max_freq);
+		 requested_opp = dev_pm_opp_find_freq_ceil(dev,&freq);
+		 if (!IS_ERR(max_opp) && !IS_ERR(requested_opp))
+			r = soc_test_validity(max_opp, requested_opp);
+		 rcu_read_unlock();
+		/* do other things */
+	 }
+	 soc_test_validity(..)
+	 {
+		 if(dev_pm_opp_get_voltage(max_opp) < dev_pm_opp_get_voltage(requested_opp))
+			 return -EINVAL;
+		 if(dev_pm_opp_get_freq(max_opp) < dev_pm_opp_get_freq(requested_opp))
+			 return -EINVAL;
+		/* do things.. */
+	 }
+
+dev_pm_opp_get_opp_count - Retrieve the number of available opps for a device
+	Example: Lets say a co-processor in the SoC needs to know the available
+	frequencies in a table, the main processor can notify as following:
+	 soc_notify_coproc_available_frequencies()
+	 {
+		/* Do things */
+		rcu_read_lock();
+		num_available = dev_pm_opp_get_opp_count(dev);
+		speeds = kzalloc(sizeof(u32) * num_available, GFP_KERNEL);
+		/* populate the table in increasing order */
+		freq = 0;
+		while (!IS_ERR(opp = dev_pm_opp_find_freq_ceil(dev, &freq))) {
+			speeds[i] = freq;
+			freq++;
+			i++;
+		}
+		rcu_read_unlock();
+
+		soc_notify_coproc(AVAILABLE_FREQs, speeds, num_available);
+		/* Do other things */
+	 }
+
+6. Data Structures
+==================
+Typically an SoC contains multiple voltage domains which are variable. Each
+domain is represented by a device pointer. The relationship to OPP can be
+represented as follows:
+SoC
+ |- device 1
+ |	|- opp 1 (availability, freq, voltage)
+ |	|- opp 2 ..
+ ...	...
+ |	`- opp n ..
+ |- device 2
+ ...
+ `- device m
+
+OPP library maintains a internal list that the SoC framework populates and
+accessed by various functions as described above. However, the structures
+representing the actual OPPs and domains are internal to the OPP library itself
+to allow for suitable abstraction reusable across systems.
+
+struct dev_pm_opp - The internal data structure of OPP library which is used to
+	represent an OPP. In addition to the freq, voltage, availability
+	information, it also contains internal book keeping information required
+	for the OPP library to operate on.  Pointer to this structure is
+	provided back to the users such as SoC framework to be used as a
+	identifier for OPP in the interactions with OPP layer.
+
+	WARNING: The struct dev_pm_opp pointer should not be parsed or modified by the
+	users. The defaults of for an instance is populated by dev_pm_opp_add, but the
+	availability of the OPP can be modified by dev_pm_opp_enable/disable functions.
+
+struct device - This is used to identify a domain to the OPP layer. The
+	nature of the device and it's implementation is left to the user of
+	OPP library such as the SoC framework.
+
+Overall, in a simplistic view, the data structure operations is represented as
+following:
+
+Initialization / modification:
+            +-----+        /- dev_pm_opp_enable
+dev_pm_opp_add --> | opp | <-------
+  |         +-----+        \- dev_pm_opp_disable
+  \-------> domain_info(device)
+
+Search functions:
+             /-- dev_pm_opp_find_freq_ceil  ---\   +-----+
+domain_info<---- dev_pm_opp_find_freq_exact -----> | opp |
+             \-- dev_pm_opp_find_freq_floor ---/   +-----+
+
+Retrieval functions:
++-----+     /- dev_pm_opp_get_voltage
+| opp | <---
++-----+     \- dev_pm_opp_get_freq
+
+domain_info <- dev_pm_opp_get_opp_count
diff --git a/Documentation/power/pci.txt b/Documentation/power/pci.txt
index dd8fe43888d..62328d76b55 100644
--- a/Documentation/power/pci.txt
+++ b/Documentation/power/pci.txt
@@ -1,299 +1,1025 @@
-
 PCI Power Management
-~~~~~~~~~~~~~~~~~~~~
 
-An overview of the concepts and the related functions in the Linux kernel
+Copyright (c) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+
+An overview of concepts and the Linux kernel's interfaces related to PCI power
+management.  Based on previous work by Patrick Mochel <mochel@transmeta.com>
+(and others).
 
-Patrick Mochel <mochel@transmeta.com>
-(and others)
+This document only covers the aspects of power management specific to PCI
+devices.  For general description of the kernel's interfaces related to device
+power management refer to Documentation/power/devices.txt and
+Documentation/power/runtime_pm.txt.
 
 ---------------------------------------------------------------------------
 
-1. Overview
-2. How the PCI Subsystem Does Power Management
-3. PCI Utility Functions
-4. PCI Device Drivers
-5. Resources
-
-1. Overview
-~~~~~~~~~~~
-
-The PCI Power Management Specification was introduced between the PCI 2.1 and
-PCI 2.2 Specifications. It a standard interface for controlling various 
-power management operations.
-
-Implementation of the PCI PM Spec is optional, as are several sub-components of
-it. If a device supports the PCI PM Spec, the device will have an 8 byte
-capability field in its PCI configuration space. This field is used to describe
-and control the standard PCI power management features.
-
-The PCI PM spec defines 4 operating states for devices (D0 - D3) and for buses
-(B0 - B3). The higher the number, the less power the device consumes. However,
-the higher the number, the longer the latency is for the device to return to 
-an operational state (D0).
-
-There are actually two D3 states.  When someone talks about D3, they usually
-mean D3hot, which corresponds to an ACPI D2 state (power is reduced, the
-device may lose some context).  But they may also mean D3cold, which is an
-ACPI D3 state (power is fully off, all state was discarded); or both.
-
-Bus power management is not covered in this version of this document.
-
-Note that all PCI devices support D0 and D3cold by default, regardless of
-whether or not they implement any of the PCI PM spec.
-
-The possible state transitions that a device can undergo are:
-
-+---------------------------+
-| Current State | New State |
-+---------------------------+
-| D0            | D1, D2, D3|
-+---------------------------+
-| D1            | D2, D3    |
-+---------------------------+
-| D2            | D3        |
-+---------------------------+
-| D1, D2, D3    | D0        |
-+---------------------------+
-
-Note that when the system is entering a global suspend state, all devices will
-be placed into D3 and when resuming, all devices will be placed into D0.
-However, when the system is running, other state transitions are possible.
-
-2. How The PCI Subsystem Handles Power Management
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The PCI suspend/resume functionality is accessed indirectly via the Power
-Management subsystem. At boot, the PCI driver registers a power management
-callback with that layer. Upon entering a suspend state, the PM layer iterates
-through all of its registered callbacks. This currently takes place only during
-APM state transitions.
-
-Upon going to sleep, the PCI subsystem walks its device tree twice. Both times,
-it does a depth first walk of the device tree. The first walk saves each of the
-device's state and checks for devices that will prevent the system from entering
-a global power state. The next walk then places the devices in a low power
+1. Hardware and Platform Support for PCI Power Management
+2. PCI Subsystem and Device Power Management
+3. PCI Device Drivers and Power Management
+4. Resources
+
+
+1. Hardware and Platform Support for PCI Power Management
+=========================================================
+
+1.1. Native and Platform-Based Power Management
+-----------------------------------------------
+In general, power management is a feature allowing one to save energy by putting
+devices into states in which they draw less power (low-power states) at the
+price of reduced functionality or performance.
+
+Usually, a device is put into a low-power state when it is underutilized or
+completely inactive.  However, when it is necessary to use the device once
+again, it has to be put back into the "fully functional" state (full-power
+state).  This may happen when there are some data for the device to handle or
+as a result of an external event requiring the device to be active, which may
+be signaled by the device itself.
+
+PCI devices may be put into low-power states in two ways, by using the device
+capabilities introduced by the PCI Bus Power Management Interface Specification,
+or with the help of platform firmware, such as an ACPI BIOS.  In the first
+approach, that is referred to as the native PCI power management (native PCI PM)
+in what follows, the device power state is changed as a result of writing a
+specific value into one of its standard configuration registers.  The second
+approach requires the platform firmware to provide special methods that may be
+used by the kernel to change the device's power state.
+
+Devices supporting the native PCI PM usually can generate wakeup signals called
+Power Management Events (PMEs) to let the kernel know about external events
+requiring the device to be active.  After receiving a PME the kernel is supposed
+to put the device that sent it into the full-power state.  However, the PCI Bus
+Power Management Interface Specification doesn't define any standard method of
+delivering the PME from the device to the CPU and the operating system kernel.
+It is assumed that the platform firmware will perform this task and therefore,
+even though a PCI device is set up to generate PMEs, it also may be necessary to
+prepare the platform firmware for notifying the CPU of the PMEs coming from the
+device (e.g. by generating interrupts).
+
+In turn, if the methods provided by the platform firmware are used for changing
+the power state of a device, usually the platform also provides a method for
+preparing the device to generate wakeup signals.  In that case, however, it
+often also is necessary to prepare the device for generating PMEs using the
+native PCI PM mechanism, because the method provided by the platform depends on
+that.
+
+Thus in many situations both the native and the platform-based power management
+mechanisms have to be used simultaneously to obtain the desired result.
+
+1.2. Native PCI Power Management
+--------------------------------
+The PCI Bus Power Management Interface Specification (PCI PM Spec) was
+introduced between the PCI 2.1 and PCI 2.2 Specifications.  It defined a
+standard interface for performing various operations related to power
+management.
+
+The implementation of the PCI PM Spec is optional for conventional PCI devices,
+but it is mandatory for PCI Express devices.  If a device supports the PCI PM
+Spec, it has an 8 byte power management capability field in its PCI
+configuration space.  This field is used to describe and control the standard
+features related to the native PCI power management.
+
+The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses
+(B0-B3).  The higher the number, the less power is drawn by the device or bus
+in that state.  However, the higher the number, the longer the latency for
+the device or bus to return to the full-power state (D0 or B0, respectively).
+
+There are two variants of the D3 state defined by the specification.  The first
+one is D3hot, referred to as the software accessible D3, because devices can be
+programmed to go into it.  The second one, D3cold, is the state that PCI devices
+are in when the supply voltage (Vcc) is removed from them.  It is not possible
+to program a PCI device to go into D3cold, although there may be a programmable
+interface for putting the bus the device is on into a state in which Vcc is
+removed from all devices on the bus.
+
+PCI bus power management, however, is not supported by the Linux kernel at the
+time of this writing and therefore it is not covered by this document.
+
+Note that every PCI device can be in the full-power state (D0) or in D3cold,
+regardless of whether or not it implements the PCI PM Spec.  In addition to
+that, if the PCI PM Spec is implemented by the device, it must support D3hot
+as well as D0.  The support for the D1 and D2 power states is optional.
+
+PCI devices supporting the PCI PM Spec can be programmed to go to any of the
+supported low-power states (except for D3cold).  While in D1-D3hot the
+standard configuration registers of the device must be accessible to software
+(i.e. the device is required to respond to PCI configuration accesses), although
+its I/O and memory spaces are then disabled.  This allows the device to be
+programmatically put into D0.  Thus the kernel can switch the device back and
+forth between D0 and the supported low-power states (except for D3cold) and the
+possible power state transitions the device can undergo are the following:
+
++----------------------------+
+| Current State | New State  |
++----------------------------+
+| D0            | D1, D2, D3 |
++----------------------------+
+| D1            | D2, D3     |
++----------------------------+
+| D2            | D3         |
++----------------------------+
+| D1, D2, D3    | D0         |
++----------------------------+
+
+The transition from D3cold to D0 occurs when the supply voltage is provided to
+the device (i.e. power is restored).  In that case the device returns to D0 with
+a full power-on reset sequence and the power-on defaults are restored to the
+device by hardware just as at initial power up.
+
+PCI devices supporting the PCI PM Spec can be programmed to generate PMEs
+while in a low-power state (D1-D3), but they are not required to be capable
+of generating PMEs from all supported low-power states.  In particular, the
+capability of generating PMEs from D3cold is optional and depends on the
+presence of additional voltage (3.3Vaux) allowing the device to remain
+sufficiently active to generate a wakeup signal.
+
+1.3. ACPI Device Power Management
+---------------------------------
+The platform firmware support for the power management of PCI devices is
+system-specific.  However, if the system in question is compliant with the
+Advanced Configuration and Power Interface (ACPI) Specification, like the
+majority of x86-based systems, it is supposed to implement device power
+management interfaces defined by the ACPI standard.
+
+For this purpose the ACPI BIOS provides special functions called "control
+methods" that may be executed by the kernel to perform specific tasks, such as
+putting a device into a low-power state.  These control methods are encoded
+using special byte-code language called the ACPI Machine Language (AML) and
+stored in the machine's BIOS.  The kernel loads them from the BIOS and executes
+them as needed using an AML interpreter that translates the AML byte code into
+computations and memory or I/O space accesses.  This way, in theory, a BIOS
+writer can provide the kernel with a means to perform actions depending
+on the system design in a system-specific fashion.
+
+ACPI control methods may be divided into global control methods, that are not
+associated with any particular devices, and device control methods, that have
+to be defined separately for each device supposed to be handled with the help of
+the platform.  This means, in particular, that ACPI device control methods can
+only be used to handle devices that the BIOS writer knew about in advance.  The
+ACPI methods used for device power management fall into that category.
+
+The ACPI specification assumes that devices can be in one of four power states
+labeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM
+D0-D3 states (although the difference between D3hot and D3cold is not taken
+into account by ACPI).  Moreover, for each power state of a device there is a
+set of power resources that have to be enabled for the device to be put into
+that state.  These power resources are controlled (i.e. enabled or disabled)
+with the help of their own control methods, _ON and _OFF, that have to be
+defined individually for each of them.
+
+To put a device into the ACPI power state Dx (where x is a number between 0 and
+3 inclusive) the kernel is supposed to (1) enable the power resources required
+by the device in this state using their _ON control methods and (2) execute the
+_PSx control method defined for the device.  In addition to that, if the device
+is going to be put into a low-power state (D1-D3) and is supposed to generate
+wakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI
+3.0) control method defined for it has to be executed before _PSx.  Power
+resources that are not required by the device in the target power state and are
+not required any more by any other device should be disabled (by executing their
+_OFF control methods).  If the current power state of the device is D3, it can
+only be put into D0 this way.
+
+However, quite often the power states of devices are changed during a
+system-wide transition into a sleep state or back into the working state.  ACPI
+defines four system sleep states, S1, S2, S3, and S4, and denotes the system
+working state as S0.  In general, the target system sleep (or working) state
+determines the highest power (lowest number) state the device can be put
+into and the kernel is supposed to obtain this information by executing the
+device's _SxD control method (where x is a number between 0 and 4 inclusive).
+If the device is required to wake up the system from the target sleep state, the
+lowest power (highest number) state it can be put into is also determined by the
+target state of the system.  The kernel is then supposed to use the device's
+_SxW control method to obtain the number of that state.  It also is supposed to
+use the device's _PRW control method to learn which power resources need to be
+enabled for the device to be able to generate wakeup signals.
+
+1.4. Wakeup Signaling
+---------------------
+Wakeup signals generated by PCI devices, either as native PCI PMEs, or as
+a result of the execution of the _DSW (or _PSW) ACPI control method before
+putting the device into a low-power state, have to be caught and handled as
+appropriate.  If they are sent while the system is in the working state
+(ACPI S0), they should be translated into interrupts so that the kernel can
+put the devices generating them into the full-power state and take care of the
+events that triggered them.  In turn, if they are sent while the system is
+sleeping, they should cause the system's core logic to trigger wakeup.
+
+On ACPI-based systems wakeup signals sent by conventional PCI devices are
+converted into ACPI General-Purpose Events (GPEs) which are hardware signals
+from the system core logic generated in response to various events that need to
+be acted upon.  Every GPE is associated with one or more sources of potentially
+interesting events.  In particular, a GPE may be associated with a PCI device
+capable of signaling wakeup.  The information on the connections between GPEs
+and event sources is recorded in the system's ACPI BIOS from where it can be
+read by the kernel.
+
+If a PCI device known to the system's ACPI BIOS signals wakeup, the GPE
+associated with it (if there is one) is triggered.  The GPEs associated with PCI
+bridges may also be triggered in response to a wakeup signal from one of the
+devices below the bridge (this also is the case for root bridges) and, for
+example, native PCI PMEs from devices unknown to the system's ACPI BIOS may be
+handled this way.
+
+A GPE may be triggered when the system is sleeping (i.e. when it is in one of
+the ACPI S1-S4 states), in which case system wakeup is started by its core logic
+(the device that was the source of the signal causing the system wakeup to occur
+may be identified later).  The GPEs used in such situations are referred to as
+wakeup GPEs.
+
+Usually, however, GPEs are also triggered when the system is in the working
+state (ACPI S0) and in that case the system's core logic generates a System
+Control Interrupt (SCI) to notify the kernel of the event.  Then, the SCI
+handler identifies the GPE that caused the interrupt to be generated which,
+in turn, allows the kernel to identify the source of the event (that may be
+a PCI device signaling wakeup).  The GPEs used for notifying the kernel of
+events occurring while the system is in the working state are referred to as
+runtime GPEs.
+
+Unfortunately, there is no standard way of handling wakeup signals sent by
+conventional PCI devices on systems that are not ACPI-based, but there is one
+for PCI Express devices.  Namely, the PCI Express Base Specification introduced
+a native mechanism for converting native PCI PMEs into interrupts generated by
+root ports.  For conventional PCI devices native PMEs are out-of-band, so they
+are routed separately and they need not pass through bridges (in principle they
+may be routed directly to the system's core logic), but for PCI Express devices
+they are in-band messages that have to pass through the PCI Express hierarchy,
+including the root port on the path from the device to the Root Complex.  Thus
+it was possible to introduce a mechanism by which a root port generates an
+interrupt whenever it receives a PME message from one of the devices below it.
+The PCI Express Requester ID of the device that sent the PME message is then
+recorded in one of the root port's configuration registers from where it may be
+read by the interrupt handler allowing the device to be identified.  [PME
+messages sent by PCI Express endpoints integrated with the Root Complex don't
+pass through root ports, but instead they cause a Root Complex Event Collector
+(if there is one) to generate interrupts.]
+
+In principle the native PCI Express PME signaling may also be used on ACPI-based
+systems along with the GPEs, but to use it the kernel has to ask the system's
+ACPI BIOS to release control of root port configuration registers.  The ACPI
+BIOS, however, is not required to allow the kernel to control these registers
+and if it doesn't do that, the kernel must not modify their contents.  Of course
+the native PCI Express PME signaling cannot be used by the kernel in that case.
+
+
+2. PCI Subsystem and Device Power Management
+============================================
+
+2.1. Device Power Management Callbacks
+--------------------------------------
+The PCI Subsystem participates in the power management of PCI devices in a
+number of ways.  First of all, it provides an intermediate code layer between
+the device power management core (PM core) and PCI device drivers.
+Specifically, the pm field of the PCI subsystem's struct bus_type object,
+pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing
+pointers to several device power management callbacks:
+
+const struct dev_pm_ops pci_dev_pm_ops = {
+	.prepare = pci_pm_prepare,
+	.complete = pci_pm_complete,
+	.suspend = pci_pm_suspend,
+	.resume = pci_pm_resume,
+	.freeze = pci_pm_freeze,
+	.thaw = pci_pm_thaw,
+	.poweroff = pci_pm_poweroff,
+	.restore = pci_pm_restore,
+	.suspend_noirq = pci_pm_suspend_noirq,
+	.resume_noirq = pci_pm_resume_noirq,
+	.freeze_noirq = pci_pm_freeze_noirq,
+	.thaw_noirq = pci_pm_thaw_noirq,
+	.poweroff_noirq = pci_pm_poweroff_noirq,
+	.restore_noirq = pci_pm_restore_noirq,
+	.runtime_suspend = pci_pm_runtime_suspend,
+	.runtime_resume = pci_pm_runtime_resume,
+	.runtime_idle = pci_pm_runtime_idle,
+};
+
+These callbacks are executed by the PM core in various situations related to
+device power management and they, in turn, execute power management callbacks
+provided by PCI device drivers.  They also perform power management operations
+involving some standard configuration registers of PCI devices that device
+drivers need not know or care about.
+
+The structure representing a PCI device, struct pci_dev, contains several fields
+that these callbacks operate on:
+
+struct pci_dev {
+	...
+	pci_power_t     current_state;  /* Current operating state. */
+	int		pm_cap;		/* PM capability offset in the
+					   configuration space */
+	unsigned int	pme_support:5;	/* Bitmask of states from which PME#
+					   can be generated */
+	unsigned int	pme_interrupt:1;/* Is native PCIe PME signaling used? */
+	unsigned int	d1_support:1;	/* Low power state D1 is supported */
+	unsigned int	d2_support:1;	/* Low power state D2 is supported */
+	unsigned int	no_d1d2:1;	/* D1 and D2 are forbidden */
+	unsigned int	wakeup_prepared:1;  /* Device prepared for wake up */
+	unsigned int	d3_delay;	/* D3->D0 transition time in ms */
+	...
+};
+
+They also indirectly use some fields of the struct device that is embedded in
+struct pci_dev.
+
+2.2. Device Initialization
+--------------------------
+The PCI subsystem's first task related to device power management is to
+prepare the device for power management and initialize the fields of struct
+pci_dev used for this purpose.  This happens in two functions defined in
+drivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init().
+
+The first of these functions checks if the device supports native PCI PM
+and if that's the case the offset of its power management capability structure
+in the configuration space is stored in the pm_cap field of the device's struct
+pci_dev object.  Next, the function checks which PCI low-power states are
+supported by the device and from which low-power states the device can generate
+native PCI PMEs.  The power management fields of the device's struct pci_dev and
+the struct device embedded in it are updated accordingly and the generation of
+PMEs by the device is disabled.
+
+The second function checks if the device can be prepared to signal wakeup with
+the help of the platform firmware, such as the ACPI BIOS.  If that is the case,
+the function updates the wakeup fields in struct device embedded in the
+device's struct pci_dev and uses the firmware-provided method to prevent the
+device from signaling wakeup.
+
+At this point the device is ready for power management.  For driverless devices,
+however, this functionality is limited to a few basic operations carried out
+during system-wide transitions to a sleep state and back to the working state.
+
+2.3. Runtime Device Power Management
+------------------------------------
+The PCI subsystem plays a vital role in the runtime power management of PCI
+devices.  For this purpose it uses the general runtime power management
+(runtime PM) framework described in Documentation/power/runtime_pm.txt.
+Namely, it provides subsystem-level callbacks:
+
+	pci_pm_runtime_suspend()
+	pci_pm_runtime_resume()
+	pci_pm_runtime_idle()
+
+that are executed by the core runtime PM routines.  It also implements the
+entire mechanics necessary for handling runtime wakeup signals from PCI devices
+in low-power states, which at the time of this writing works for both the native
+PCI Express PME signaling and the ACPI GPE-based wakeup signaling described in
+Section 1.
+
+First, a PCI device is put into a low-power state, or suspended, with the help
+of pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call
+pci_pm_runtime_suspend() to do the actual job.  For this to work, the device's
+driver has to provide a pm->runtime_suspend() callback (see below), which is
+run by pci_pm_runtime_suspend() as the first action.  If the driver's callback
+returns successfully, the device's standard configuration registers are saved,
+the device is prepared to generate wakeup signals and, finally, it is put into
+the target low-power state.
+
+The low-power state to put the device into is the lowest-power (highest number)
+state from which it can signal wakeup.  The exact method of signaling wakeup is
+system-dependent and is determined by the PCI subsystem on the basis of the
+reported capabilities of the device and the platform firmware.  To prepare the
+device for signaling wakeup and put it into the selected low-power state, the
+PCI subsystem can use the platform firmware as well as the device's native PCI
+PM capabilities, if supported.
+
+It is expected that the device driver's pm->runtime_suspend() callback will
+not attempt to prepare the device for signaling wakeup or to put it into a
+low-power state.  The driver ought to leave these tasks to the PCI subsystem
+that has all of the information necessary to perform them.
+
+A suspended device is brought back into the "active" state, or resumed,
+with the help of pm_request_resume() or pm_runtime_resume() which both call
+pci_pm_runtime_resume() for PCI devices.  Again, this only works if the device's
+driver provides a pm->runtime_resume() callback (see below).  However, before
+the driver's callback is executed, pci_pm_runtime_resume() brings the device
+back into the full-power state, prevents it from signaling wakeup while in that
+state and restores its standard configuration registers.  Thus the driver's
+callback need not worry about the PCI-specific aspects of the device resume.
+
+Note that generally pci_pm_runtime_resume() may be called in two different
+situations.  First, it may be called at the request of the device's driver, for
+example if there are some data for it to process.  Second, it may be called
+as a result of a wakeup signal from the device itself (this sometimes is
+referred to as "remote wakeup").  Of course, for this purpose the wakeup signal
+is handled in one of the ways described in Section 1 and finally converted into
+a notification for the PCI subsystem after the source device has been
+identified.
+
+The pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle()
+and pm_request_idle(), executes the device driver's pm->runtime_idle()
+callback, if defined, and if that callback doesn't return error code (or is not
+present at all), suspends the device with the help of pm_runtime_suspend().
+Sometimes pci_pm_runtime_idle() is called automatically by the PM core (for
+example, it is called right after the device has just been resumed), in which
+cases it is expected to suspend the device if that makes sense.  Usually,
+however, the PCI subsystem doesn't really know if the device really can be
+suspended, so it lets the device's driver decide by running its
+pm->runtime_idle() callback.
+
+2.4. System-Wide Power Transitions
+----------------------------------
+There are a few different types of system-wide power transitions, described in
+Documentation/power/devices.txt.  Each of them requires devices to be handled
+in a specific way and the PM core executes subsystem-level power management
+callbacks for this purpose.  They are executed in phases such that each phase
+involves executing the same subsystem-level callback for every device belonging
+to the given subsystem before the next phase begins.  These phases always run
+after tasks have been frozen.
+
+2.4.1. System Suspend
+
+When the system is going into a sleep state in which the contents of memory will
+be preserved, such as one of the ACPI sleep states S1-S3, the phases are:
+
+	prepare, suspend, suspend_noirq.
+
+The following PCI bus type's callbacks, respectively, are used in these phases:
+
+	pci_pm_prepare()
+	pci_pm_suspend()
+	pci_pm_suspend_noirq()
+
+The pci_pm_prepare() routine first puts the device into the "fully functional"
+state with the help of pm_runtime_resume().  Then, it executes the device
+driver's pm->prepare() callback if defined (i.e. if the driver's struct
+dev_pm_ops object is present and the prepare pointer in that object is valid).
+
+The pci_pm_suspend() routine first checks if the device's driver implements
+legacy PCI suspend routines (see Section 3), in which case the driver's legacy
+suspend callback is executed, if present, and its result is returned.  Next, if
+the device's driver doesn't provide a struct dev_pm_ops object (containing
+pointers to the driver's callbacks), pci_pm_default_suspend() is called, which
+simply turns off the device's bus master capability and runs
+pcibios_disable_device() to disable it, unless the device is a bridge (PCI
+bridges are ignored by this routine).  Next, the device driver's pm->suspend()
+callback is executed, if defined, and its result is returned if it fails.
+Finally, pci_fixup_device() is called to apply hardware suspend quirks related
+to the device if necessary.
+
+Note that the suspend phase is carried out asynchronously for PCI devices, so
+the pci_pm_suspend() callback may be executed in parallel for any pair of PCI
+devices that don't depend on each other in a known way (i.e. none of the paths
+in the device tree from the root bridge to a leaf device contains both of them).
+
+The pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() has
+been called, which means that the device driver's interrupt handler won't be
+invoked while this routine is running.  It first checks if the device's driver
+implements legacy PCI suspends routines (Section 3), in which case the legacy
+late suspend routine is called and its result is returned (the standard
+configuration registers of the device are saved if the driver's callback hasn't
+done that).  Second, if the device driver's struct dev_pm_ops object is not
+present, the device's standard configuration registers are saved and the routine
+returns success.  Otherwise the device driver's pm->suspend_noirq() callback is
+executed, if present, and its result is returned if it fails.  Next, if the
+device's standard configuration registers haven't been saved yet (one of the
+device driver's callbacks executed before might do that), pci_pm_suspend_noirq()
+saves them, prepares the device to signal wakeup (if necessary) and puts it into
+a low-power state.
+
+The low-power state to put the device into is the lowest-power (highest number)
+state from which it can signal wakeup while the system is in the target sleep
+state.  Just like in the runtime PM case described above, the mechanism of
+signaling wakeup is system-dependent and determined by the PCI subsystem, which
+is also responsible for preparing the device to signal wakeup from the system's
+target sleep state as appropriate.
+
+PCI device drivers (that don't implement legacy power management callbacks) are
+generally not expected to prepare devices for signaling wakeup or to put them
+into low-power states.  However, if one of the driver's suspend callbacks
+(pm->suspend() or pm->suspend_noirq()) saves the device's standard configuration
+registers, pci_pm_suspend_noirq() will assume that the device has been prepared
+to signal wakeup and put into a low-power state by the driver (the driver is
+then assumed to have used the helper functions provided by the PCI subsystem for
+this purpose).  PCI device drivers are not encouraged to do that, but in some
+rare cases doing that in the driver may be the optimum approach.
+
+2.4.2. System Resume
+
+When the system is undergoing a transition from a sleep state in which the
+contents of memory have been preserved, such as one of the ACPI sleep states
+S1-S3, into the working state (ACPI S0), the phases are:
+
+	resume_noirq, resume, complete.
+
+The following PCI bus type's callbacks, respectively, are executed in these
+phases:
+
+	pci_pm_resume_noirq()
+	pci_pm_resume()
+	pci_pm_complete()
+
+The pci_pm_resume_noirq() routine first puts the device into the full-power
+state, restores its standard configuration registers and applies early resume
+hardware quirks related to the device, if necessary.  This is done
+unconditionally, regardless of whether or not the device's driver implements
+legacy PCI power management callbacks (this way all PCI devices are in the
+full-power state and their standard configuration registers have been restored
+when their interrupt handlers are invoked for the first time during resume,
+which allows the kernel to avoid problems with the handling of shared interrupts
+by drivers whose devices are still suspended).  If legacy PCI power management
+callbacks (see Section 3) are implemented by the device's driver, the legacy
+early resume callback is executed and its result is returned.  Otherwise, the
+device driver's pm->resume_noirq() callback is executed, if defined, and its
+result is returned.
+
+The pci_pm_resume() routine first checks if the device's standard configuration
+registers have been restored and restores them if that's not the case (this
+only is necessary in the error path during a failing suspend).  Next, resume
+hardware quirks related to the device are applied, if necessary, and if the
+device's driver implements legacy PCI power management callbacks (see
+Section 3), the driver's legacy resume callback is executed and its result is
+returned.  Otherwise, the device's wakeup signaling mechanisms are blocked and
+its driver's pm->resume() callback is executed, if defined (the callback's
+result is then returned).
+
+The resume phase is carried out asynchronously for PCI devices, like the
+suspend phase described above, which means that if two PCI devices don't depend
+on each other in a known way, the pci_pm_resume() routine may be executed for
+the both of them in parallel.
+
+The pci_pm_complete() routine only executes the device driver's pm->complete()
+callback, if defined.
+
+2.4.3. System Hibernation
+
+System hibernation is more complicated than system suspend, because it requires
+a system image to be created and written into a persistent storage medium.  The
+image is created atomically and all devices are quiesced, or frozen, before that
+happens.
+
+The freezing of devices is carried out after enough memory has been freed (at
+the time of this writing the image creation requires at least 50% of system RAM
+to be free) in the following three phases:
+
+	prepare, freeze, freeze_noirq
+
+that correspond to the PCI bus type's callbacks:
+
+	pci_pm_prepare()
+	pci_pm_freeze()
+	pci_pm_freeze_noirq()
+
+This means that the prepare phase is exactly the same as for system suspend.
+The other two phases, however, are different.
+
+The pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runs
+the device driver's pm->freeze() callback, if defined, instead of pm->suspend(),
+and it doesn't apply the suspend-related hardware quirks.  It is executed
+asynchronously for different PCI devices that don't depend on each other in a
+known way.
+
+The pci_pm_freeze_noirq() routine, in turn, is similar to
+pci_pm_suspend_noirq(), but it calls the device driver's pm->freeze_noirq()
+routine instead of pm->suspend_noirq().  It also doesn't attempt to prepare the
+device for signaling wakeup and put it into a low-power state.  Still, it saves
+the device's standard configuration registers if they haven't been saved by one
+of the driver's callbacks.
+
+Once the image has been created, it has to be saved.  However, at this point all
+devices are frozen and they cannot handle I/O, while their ability to handle
+I/O is obviously necessary for the image saving.  Thus they have to be brought
+back to the fully functional state and this is done in the following phases:
+
+	thaw_noirq, thaw, complete
+
+using the following PCI bus type's callbacks:
+
+	pci_pm_thaw_noirq()
+	pci_pm_thaw()
+	pci_pm_complete()
+
+respectively.
+
+The first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq(),
+but it doesn't put the device into the full power state and doesn't attempt to
+restore its standard configuration registers.  It also executes the device
+driver's pm->thaw_noirq() callback, if defined, instead of pm->resume_noirq().
+
+The pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the device
+driver's pm->thaw() callback instead of pm->resume().  It is executed
+asynchronously for different PCI devices that don't depend on each other in a
+known way.
+
+The complete phase it the same as for system resume.
+
+After saving the image, devices need to be powered down before the system can
+enter the target sleep state (ACPI S4 for ACPI-based systems).  This is done in
+three phases:
+
+	prepare, poweroff, poweroff_noirq
+
+where the prepare phase is exactly the same as for system suspend.  The other
+two phases are analogous to the suspend and suspend_noirq phases, respectively.
+The PCI subsystem-level callbacks they correspond to
+
+	pci_pm_poweroff()
+	pci_pm_poweroff_noirq()
+
+work in analogy with pci_pm_suspend() and pci_pm_poweroff_noirq(), respectively,
+although they don't attempt to save the device's standard configuration
+registers.
+
+2.4.4. System Restore
+
+System restore requires a hibernation image to be loaded into memory and the
+pre-hibernation memory contents to be restored before the pre-hibernation system
+activity can be resumed.
+
+As described in Documentation/power/devices.txt, the hibernation image is loaded
+into memory by a fresh instance of the kernel, called the boot kernel, which in
+turn is loaded and run by a boot loader in the usual way.  After the boot kernel
+has loaded the image, it needs to replace its own code and data with the code
+and data of the "hibernated" kernel stored within the image, called the image
+kernel.  For this purpose all devices are frozen just like before creating
+the image during hibernation, in the
+
+	prepare, freeze, freeze_noirq
+
+phases described above.  However, the devices affected by these phases are only
+those having drivers in the boot kernel; other devices will still be in whatever
+state the boot loader left them.
+
+Should the restoration of the pre-hibernation memory contents fail, the boot
+kernel would go through the "thawing" procedure described above, using the
+thaw_noirq, thaw, and complete phases (that will only affect the devices having
+drivers in the boot kernel), and then continue running normally.
+
+If the pre-hibernation memory contents are restored successfully, which is the
+usual situation, control is passed to the image kernel, which then becomes
+responsible for bringing the system back to the working state.  To achieve this,
+it must restore the devices' pre-hibernation functionality, which is done much
+like waking up from the memory sleep state, although it involves different
+phases:
+
+	restore_noirq, restore, complete
+
+The first two of these are analogous to the resume_noirq and resume phases
+described above, respectively, and correspond to the following PCI subsystem
+callbacks:
+
+	pci_pm_restore_noirq()
+	pci_pm_restore()
+
+These callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(),
+respectively, but they execute the device driver's pm->restore_noirq() and
+pm->restore() callbacks, if available.
+
+The complete phase is carried out in exactly the same way as during system
+resume.
+
+
+3. PCI Device Drivers and Power Management
+==========================================
+
+3.1. Power Management Callbacks
+-------------------------------
+PCI device drivers participate in power management by providing callbacks to be
+executed by the PCI subsystem's power management routines described above and by
+controlling the runtime power management of their devices.
+
+At the time of this writing there are two ways to define power management
+callbacks for a PCI device driver, the recommended one, based on using a
+dev_pm_ops structure described in Documentation/power/devices.txt, and the
+"legacy" one, in which the .suspend(), .suspend_late(), .resume_early(), and
+.resume() callbacks from struct pci_driver are used.  The legacy approach,
+however, doesn't allow one to define runtime power management callbacks and is
+not really suitable for any new drivers.  Therefore it is not covered by this
+document (refer to the source code to learn more about it).
+
+It is recommended that all PCI device drivers define a struct dev_pm_ops object
+containing pointers to power management (PM) callbacks that will be executed by
+the PCI subsystem's PM routines in various circumstances.  A pointer to the
+driver's struct dev_pm_ops object has to be assigned to the driver.pm field in
+its struct pci_driver object.  Once that has happened, the "legacy" PM callbacks
+in struct pci_driver are ignored (even if they are not NULL).
+
+The PM callbacks in struct dev_pm_ops are not mandatory and if they are not
+defined (i.e. the respective fields of struct dev_pm_ops are unset) the PCI
+subsystem will handle the device in a simplified default manner.  If they are
+defined, though, they are expected to behave as described in the following
+subsections.
+
+3.1.1. prepare()
+
+The prepare() callback is executed during system suspend, during hibernation
+(when a hibernation image is about to be created), during power-off after
+saving a hibernation image and during system restore, when a hibernation image
+has just been loaded into memory.
+
+This callback is only necessary if the driver's device has children that in
+general may be registered at any time.  In that case the role of the prepare()
+callback is to prevent new children of the device from being registered until
+one of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run.
+
+In addition to that the prepare() callback may carry out some operations
+preparing the device to be suspended, although it should not allocate memory
+(if additional memory is required to suspend the device, it has to be
+preallocated earlier, for example in a suspend/hibernate notifier as described
+in Documentation/power/notifiers.txt).
+
+3.1.2. suspend()
+
+The suspend() callback is only executed during system suspend, after prepare()
+callbacks have been executed for all devices in the system.
+
+This callback is expected to quiesce the device and prepare it to be put into a
+low-power state by the PCI subsystem.  It is not required (in fact it even is
+not recommended) that a PCI driver's suspend() callback save the standard
+configuration registers of the device, prepare it for waking up the system, or
+put it into a low-power state.  All of these operations can very well be taken
+care of by the PCI subsystem, without the driver's participation.
+
+However, in some rare case it is convenient to carry out these operations in
+a PCI driver.  Then, pci_save_state(), pci_prepare_to_sleep(), and
+pci_set_power_state() should be used to save the device's standard configuration
+registers, to prepare it for system wakeup (if necessary), and to put it into a
+low-power state, respectively.  Moreover, if the driver calls pci_save_state(),
+the PCI subsystem will not execute either pci_prepare_to_sleep(), or
+pci_set_power_state() for its device, so the driver is then responsible for
+handling the device as appropriate.
+
+While the suspend() callback is being executed, the driver's interrupt handler
+can be invoked to handle an interrupt from the device, so all suspend-related
+operations relying on the driver's ability to handle interrupts should be
+carried out in this callback.
+
+3.1.3. suspend_noirq()
+
+The suspend_noirq() callback is only executed during system suspend, after
+suspend() callbacks have been executed for all devices in the system and
+after device interrupts have been disabled by the PM core.
+
+The difference between suspend_noirq() and suspend() is that the driver's
+interrupt handler will not be invoked while suspend_noirq() is running.  Thus
+suspend_noirq() can carry out operations that would cause race conditions to
+arise if they were performed in suspend().
+
+3.1.4. freeze()
+
+The freeze() callback is hibernation-specific and is executed in two situations,
+during hibernation, after prepare() callbacks have been executed for all devices
+in preparation for the creation of a system image, and during restore,
+after a system image has been loaded into memory from persistent storage and the
+prepare() callbacks have been executed for all devices.
+
+The role of this callback is analogous to the role of the suspend() callback
+described above.  In fact, they only need to be different in the rare cases when
+the driver takes the responsibility for putting the device into a low-power
 state.
 
-The first walk allows a graceful recovery in the event of a failure, since none
-of the devices have actually been powered down.
-
-In both walks, in particular the second, all children of a bridge are touched
-before the actual bridge itself. This allows the bridge to retain power while
-its children are being accessed.
-
-Upon resuming from sleep, just the opposite must be true: all bridges must be
-powered on and restored before their children are powered on. This is easily
-accomplished with a breadth-first walk of the PCI device tree.
-
-
-3. PCI Utility Functions
-~~~~~~~~~~~~~~~~~~~~~~~~
-
-These are helper functions designed to be called by individual device drivers.
-Assuming that a device behaves as advertised, these should be applicable in most
-cases. However, results may vary.
-
-Note that these functions are never implicitly called for the driver. The driver
-is always responsible for deciding when and if to call these.
-
-
-pci_save_state
---------------
-
-Usage:
-	pci_save_state(struct pci_dev *dev);
-
-Description:
-	Save first 64 bytes of PCI config space, along with any additional
-	PCI-Express or PCI-X information.
-
-
-pci_restore_state
------------------
-
-Usage:
-	pci_restore_state(struct pci_dev *dev);
-
-Description:
-	Restore previously saved config space.
-
-
-pci_set_power_state
--------------------
-
-Usage:
-	pci_set_power_state(struct pci_dev *dev, pci_power_t state);
-
-Description:
-	Transition device to low power state using PCI PM Capabilities
-	registers.
-
-	Will fail under one of the following conditions:
-	- If state is less than current state, but not D0 (illegal transition)
-	- Device doesn't support PM Capabilities
-	- Device does not support requested state
-
-
-pci_enable_wake
----------------
-
-Usage:
-	pci_enable_wake(struct pci_dev *dev, pci_power_t state, int enable);
-
-Description:
-	Enable device to generate PME# during low power state using PCI PM 
-	Capabilities.
-
-	Checks whether if device supports generating PME# from requested state
-	and fail if it does not, unless enable == 0 (request is to disable wake
-	events, which is implicit if it doesn't even support it in the first
-	place).
-
-	Note that the PMC Register in the device's PM Capabilities has a bitmask
-	of the states it supports generating PME# from. D3hot is bit 3 and
-	D3cold is bit 4. So, while a value of 4 as the state may not seem
-	semantically correct, it is. 
-
-
-4. PCI Device Drivers
-~~~~~~~~~~~~~~~~~~~~~
-
-These functions are intended for use by individual drivers, and are defined in 
-struct pci_driver:
-
-        int  (*suspend) (struct pci_dev *dev, pm_message_t state);
-        int  (*resume) (struct pci_dev *dev);
-
-
-suspend
--------
-
-Usage:
-
-if (dev->driver && dev->driver->suspend)
-	dev->driver->suspend(dev,state);
-
-A driver uses this function to actually transition the device into a low power
-state. This should include disabling I/O, IRQs, and bus-mastering, as well as
-physically transitioning the device to a lower power state; it may also include
-calls to pci_enable_wake().
-
-Bus mastering may be disabled by doing:
-
-pci_disable_device(dev);
-
-For devices that support the PCI PM Spec, this may be used to set the device's
-power state to match the suspend() parameter:
-
-pci_set_power_state(dev,state);
-
-The driver is also responsible for disabling any other device-specific features
-(e.g blanking screen, turning off on-card memory, etc).
-
-The driver should be sure to track the current state of the device, as it may
-obviate the need for some operations.
-
-The driver should update the current_state field in its pci_dev structure in
-this function, except for PM-capable devices when pci_set_power_state is used.
-
-resume
-------
-
-Usage:
-
-if (dev->driver && dev->driver->resume)
-	dev->driver->resume(dev)
+In that cases the freeze() callback should not prepare the device system wakeup
+or put it into a low-power state.  Still, either it or freeze_noirq() should
+save the device's standard configuration registers using pci_save_state().
 
-The resume callback may be called from any power state, and is always meant to
-transition the device to the D0 state. 
+3.1.5. freeze_noirq()
 
-The driver is responsible for reenabling any features of the device that had
-been disabled during previous suspend calls, such as IRQs and bus mastering,
-as well as calling pci_restore_state().
+The freeze_noirq() callback is hibernation-specific.  It is executed during
+hibernation, after prepare() and freeze() callbacks have been executed for all
+devices in preparation for the creation of a system image, and during restore,
+after a system image has been loaded into memory and after prepare() and
+freeze() callbacks have been executed for all devices.  It is always executed
+after device interrupts have been disabled by the PM core.
 
-If the device is currently in D3, it may need to be reinitialized in resume().
+The role of this callback is analogous to the role of the suspend_noirq()
+callback described above and it very rarely is necessary to define
+freeze_noirq().
 
-  * Some types of devices, like bus controllers, will preserve context in D3hot
-    (using Vcc power).  Their drivers will often want to avoid re-initializing
-    them after re-entering D0 (perhaps to avoid resetting downstream devices).
+The difference between freeze_noirq() and freeze() is analogous to the
+difference between suspend_noirq() and suspend().
 
-  * Other kinds of devices in D3hot will discard device context as part of a
-    soft reset when re-entering the D0 state.
-    
-  * Devices resuming from D3cold always go through a power-on reset.  Some
-    device context can also be preserved using Vaux power.
+3.1.6. poweroff()
 
-  * Some systems hide D3cold resume paths from drivers.  For example, on PCs
-    the resume path for suspend-to-disk often runs BIOS powerup code, which
-    will sometimes re-initialize the device.
+The poweroff() callback is hibernation-specific.  It is executed when the system
+is about to be powered off after saving a hibernation image to a persistent
+storage.  prepare() callbacks are executed for all devices before poweroff() is
+called.
 
-To handle resets during D3 to D0 transitions, it may be convenient to share
-device initialization code between probe() and resume().  Device parameters
-can also be saved before the driver suspends into D3, avoiding re-probe.
+The role of this callback is analogous to the role of the suspend() and freeze()
+callbacks described above, although it does not need to save the contents of
+the device's registers.  In particular, if the driver wants to put the device
+into a low-power state itself instead of allowing the PCI subsystem to do that,
+the poweroff() callback should use pci_prepare_to_sleep() and
+pci_set_power_state() to prepare the device for system wakeup and to put it
+into a low-power state, respectively, but it need not save the device's standard
+configuration registers.
 
-If the device supports the PCI PM Spec, it can use this to physically transition
-the device to D0:
+3.1.7. poweroff_noirq()
 
-pci_set_power_state(dev,0);
+The poweroff_noirq() callback is hibernation-specific.  It is executed after
+poweroff() callbacks have been executed for all devices in the system.
 
-Note that if the entire system is transitioning out of a global sleep state, all
-devices will be placed in the D0 state, so this is not necessary. However, in
-the event that the device is placed in the D3 state during normal operation,
-this call is necessary. It is impossible to determine which of the two events is
-taking place in the driver, so it is always a good idea to make that call.
+The role of this callback is analogous to the role of the suspend_noirq() and
+freeze_noirq() callbacks described above, but it does not need to save the
+contents of the device's registers.
 
-The driver should take note of the state that it is resuming from in order to
-ensure correct (and speedy) operation.
+The difference between poweroff_noirq() and poweroff() is analogous to the
+difference between suspend_noirq() and suspend().
 
-The driver should update the current_state field in its pci_dev structure in
-this function, except for PM-capable devices when pci_set_power_state is used.
+3.1.8. resume_noirq()
 
+The resume_noirq() callback is only executed during system resume, after the
+PM core has enabled the non-boot CPUs.  The driver's interrupt handler will not
+be invoked while resume_noirq() is running, so this callback can carry out
+operations that might race with the interrupt handler.
 
+Since the PCI subsystem unconditionally puts all devices into the full power
+state in the resume_noirq phase of system resume and restores their standard
+configuration registers, resume_noirq() is usually not necessary.  In general
+it should only be used for performing operations that would lead to race
+conditions if carried out by resume().
 
-A reference implementation
--------------------------
-.suspend()
-{
-	/* driver specific operations */
+3.1.9. resume()
 
-	/* Disable IRQ */
-	free_irq();
-	/* If using MSI */
-	pci_disable_msi();
+The resume() callback is only executed during system resume, after
+resume_noirq() callbacks have been executed for all devices in the system and
+device interrupts have been enabled by the PM core.
 
-	pci_save_state();
-	pci_enable_wake();
-	/* Disable IO/bus master/irq router */
-	pci_disable_device();
-	pci_set_power_state(pci_choose_state());
-}
+This callback is responsible for restoring the pre-suspend configuration of the
+device and bringing it back to the fully functional state.  The device should be
+able to process I/O in a usual way after resume() has returned.
 
-.resume()
-{
-	pci_set_power_state(PCI_D0);
-	pci_restore_state();
-	/* device's irq possibly is changed, driver should take care */
-	pci_enable_device();
-	pci_set_master();
+3.1.10. thaw_noirq()
 
-	/* if using MSI, device's vector possibly is changed */
-	pci_enable_msi();
+The thaw_noirq() callback is hibernation-specific.  It is executed after a
+system image has been created and the non-boot CPUs have been enabled by the PM
+core, in the thaw_noirq phase of hibernation.  It also may be executed if the
+loading of a hibernation image fails during system restore (it is then executed
+after enabling the non-boot CPUs).  The driver's interrupt handler will not be
+invoked while thaw_noirq() is running.
 
-	request_irq();
-	/* driver specific operations; */
-}
+The role of this callback is analogous to the role of resume_noirq().  The
+difference between these two callbacks is that thaw_noirq() is executed after
+freeze() and freeze_noirq(), so in general it does not need to modify the
+contents of the device's registers.
 
-This is a typical implementation. Drivers can slightly change the order
-of the operations in the implementation, ignore some operations or add
-more driver specific operations in it, but drivers should do something like
-this on the whole.
+3.1.11. thaw()
 
-5. Resources
-~~~~~~~~~~~~
+The thaw() callback is hibernation-specific.  It is executed after thaw_noirq()
+callbacks have been executed for all devices in the system and after device
+interrupts have been enabled by the PM core.
 
-PCI Local Bus Specification 
-PCI Bus Power Management Interface Specification
+This callback is responsible for restoring the pre-freeze configuration of
+the device, so that it will work in a usual way after thaw() has returned.
 
-  http://www.pcisig.com
+3.1.12. restore_noirq()
 
+The restore_noirq() callback is hibernation-specific.  It is executed in the
+restore_noirq phase of hibernation, when the boot kernel has passed control to
+the image kernel and the non-boot CPUs have been enabled by the image kernel's
+PM core.
+
+This callback is analogous to resume_noirq() with the exception that it cannot
+make any assumption on the previous state of the device, even if the BIOS (or
+generally the platform firmware) is known to preserve that state over a
+suspend-resume cycle.
+
+For the vast majority of PCI device drivers there is no difference between
+resume_noirq() and restore_noirq().
+
+3.1.13. restore()
+
+The restore() callback is hibernation-specific.  It is executed after
+restore_noirq() callbacks have been executed for all devices in the system and
+after the PM core has enabled device drivers' interrupt handlers to be invoked.
+
+This callback is analogous to resume(), just like restore_noirq() is analogous
+to resume_noirq().  Consequently, the difference between restore_noirq() and
+restore() is analogous to the difference between resume_noirq() and resume().
+
+For the vast majority of PCI device drivers there is no difference between
+resume() and restore().
+
+3.1.14. complete()
+
+The complete() callback is executed in the following situations:
+  - during system resume, after resume() callbacks have been executed for all
+    devices,
+  - during hibernation, before saving the system image, after thaw() callbacks
+    have been executed for all devices,
+  - during system restore, when the system is going back to its pre-hibernation
+    state, after restore() callbacks have been executed for all devices.
+It also may be executed if the loading of a hibernation image into memory fails
+(in that case it is run after thaw() callbacks have been executed for all
+devices that have drivers in the boot kernel).
+
+This callback is entirely optional, although it may be necessary if the
+prepare() callback performs operations that need to be reversed.
+
+3.1.15. runtime_suspend()
+
+The runtime_suspend() callback is specific to device runtime power management
+(runtime PM).  It is executed by the PM core's runtime PM framework when the
+device is about to be suspended (i.e. quiesced and put into a low-power state)
+at run time.
+
+This callback is responsible for freezing the device and preparing it to be
+put into a low-power state, but it must allow the PCI subsystem to perform all
+of the PCI-specific actions necessary for suspending the device.
+
+3.1.16. runtime_resume()
+
+The runtime_resume() callback is specific to device runtime PM.  It is executed
+by the PM core's runtime PM framework when the device is about to be resumed
+(i.e. put into the full-power state and programmed to process I/O normally) at
+run time.
+
+This callback is responsible for restoring the normal functionality of the
+device after it has been put into the full-power state by the PCI subsystem.
+The device is expected to be able to process I/O in the usual way after
+runtime_resume() has returned.
+
+3.1.17. runtime_idle()
+
+The runtime_idle() callback is specific to device runtime PM.  It is executed
+by the PM core's runtime PM framework whenever it may be desirable to suspend
+the device according to the PM core's information.  In particular, it is
+automatically executed right after runtime_resume() has returned in case the
+resume of the device has happened as a result of a spurious event.
+
+This callback is optional, but if it is not implemented or if it returns 0, the
+PCI subsystem will call pm_runtime_suspend() for the device, which in turn will
+cause the driver's runtime_suspend() callback to be executed.
+
+3.1.18. Pointing Multiple Callback Pointers to One Routine
+
+Although in principle each of the callbacks described in the previous
+subsections can be defined as a separate function, it often is convenient to
+point two or more members of struct dev_pm_ops to the same routine.  There are
+a few convenience macros that can be used for this purpose.
+
+The SIMPLE_DEV_PM_OPS macro declares a struct dev_pm_ops object with one
+suspend routine pointed to by the .suspend(), .freeze(), and .poweroff()
+members and one resume routine pointed to by the .resume(), .thaw(), and
+.restore() members.  The other function pointers in this struct dev_pm_ops are
+unset.
+
+The UNIVERSAL_DEV_PM_OPS macro is similar to SIMPLE_DEV_PM_OPS, but it
+additionally sets the .runtime_resume() pointer to the same value as
+.resume() (and .thaw(), and .restore()) and the .runtime_suspend() pointer to
+the same value as .suspend() (and .freeze() and .poweroff()).
+
+The SET_SYSTEM_SLEEP_PM_OPS can be used inside of a declaration of struct
+dev_pm_ops to indicate that one suspend routine is to be pointed to by the
+.suspend(), .freeze(), and .poweroff() members and one resume routine is to
+be pointed to by the .resume(), .thaw(), and .restore() members.
+
+3.2. Device Runtime Power Management
+------------------------------------
+In addition to providing device power management callbacks PCI device drivers
+are responsible for controlling the runtime power management (runtime PM) of
+their devices.
+
+The PCI device runtime PM is optional, but it is recommended that PCI device
+drivers implement it at least in the cases where there is a reliable way of
+verifying that the device is not used (like when the network cable is detached
+from an Ethernet adapter or there are no devices attached to a USB controller).
+
+To support the PCI runtime PM the driver first needs to implement the
+runtime_suspend() and runtime_resume() callbacks.  It also may need to implement
+the runtime_idle() callback to prevent the device from being suspended again
+every time right after the runtime_resume() callback has returned
+(alternatively, the runtime_suspend() callback will have to check if the
+device should really be suspended and return -EAGAIN if that is not the case).
+
+The runtime PM of PCI devices is disabled by default.  It is also blocked by
+pci_pm_init() that runs the pm_runtime_forbid() helper function.  If a PCI
+driver implements the runtime PM callbacks and intends to use the runtime PM
+framework provided by the PM core and the PCI subsystem, it should enable this
+feature by executing the pm_runtime_enable() helper function.  However, the
+driver should not call the pm_runtime_allow() helper function unblocking
+the runtime PM of the device.  Instead, it should allow user space or some
+platform-specific code to do that (user space can do it via sysfs), although
+once it has called pm_runtime_enable(), it must be prepared to handle the
+runtime PM of the device correctly as soon as pm_runtime_allow() is called
+(which may happen at any time).  [It also is possible that user space causes
+pm_runtime_allow() to be called via sysfs before the driver is loaded, so in
+fact the driver has to be prepared to handle the runtime PM of the device as
+soon as it calls pm_runtime_enable().]
+
+The runtime PM framework works by processing requests to suspend or resume
+devices, or to check if they are idle (in which cases it is reasonable to
+subsequently request that they be suspended).  These requests are represented
+by work items put into the power management workqueue, pm_wq.  Although there
+are a few situations in which power management requests are automatically
+queued by the PM core (for example, after processing a request to resume a
+device the PM core automatically queues a request to check if the device is
+idle), device drivers are generally responsible for queuing power management
+requests for their devices.  For this purpose they should use the runtime PM
+helper functions provided by the PM core, discussed in
+Documentation/power/runtime_pm.txt.
+
+Devices can also be suspended and resumed synchronously, without placing a
+request into pm_wq.  In the majority of cases this also is done by their
+drivers that use helper functions provided by the PM core for this purpose.
+
+For more information on the runtime PM of devices refer to
+Documentation/power/runtime_pm.txt.
+
+
+4. Resources
+============
+
+PCI Local Bus Specification, Rev. 3.0
+PCI Bus Power Management Interface Specification, Rev. 1.2
+Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b
+PCI Express Base Specification, Rev. 2.0
+Documentation/power/devices.txt
+Documentation/power/runtime_pm.txt
diff --git a/Documentation/power/pm.txt b/Documentation/power/pm.txt
deleted file mode 100644
index be841507e43..00000000000
--- a/Documentation/power/pm.txt
+++ /dev/null
@@ -1,257 +0,0 @@
-               Linux Power Management Support
-
-This document briefly describes how to use power management with your
-Linux system and how to add power management support to Linux drivers.
-
-APM or ACPI?
-------------
-If you have a relatively recent x86 mobile, desktop, or server system,
-odds are it supports either Advanced Power Management (APM) or
-Advanced Configuration and Power Interface (ACPI).  ACPI is the newer
-of the two technologies and puts power management in the hands of the
-operating system, allowing for more intelligent power management than
-is possible with BIOS controlled APM.
-
-The best way to determine which, if either, your system supports is to
-build a kernel with both ACPI and APM enabled (as of 2.3.x ACPI is
-enabled by default).  If a working ACPI implementation is found, the
-ACPI driver will override and disable APM, otherwise the APM driver
-will be used.
-
-No, sorry, you cannot have both ACPI and APM enabled and running at
-once.  Some people with broken ACPI or broken APM implementations
-would like to use both to get a full set of working features, but you
-simply cannot mix and match the two.  Only one power management
-interface can be in control of the machine at once.  Think about it..
-
-User-space Daemons
-------------------
-Both APM and ACPI rely on user-space daemons, apmd and acpid
-respectively, to be completely functional.  Obtain both of these
-daemons from your Linux distribution or from the Internet (see below)
-and be sure that they are started sometime in the system boot process.
-Go ahead and start both.  If ACPI or APM is not available on your
-system the associated daemon will exit gracefully.
-
-  apmd:   http://worldvisions.ca/~apenwarr/apmd/
-  acpid:  http://acpid.sf.net/
-
-Driver Interface -- OBSOLETE, DO NOT USE!
-----------------*************************
-
-Note: pm_register(), pm_access(), pm_dev_idle() and friends are
-obsolete. Please do not use them. Instead you should properly hook
-your driver into the driver model, and use its suspend()/resume()
-callbacks to do this kind of stuff.
-
-If you are writing a new driver or maintaining an old driver, it
-should include power management support.  Without power management
-support, a single driver may prevent a system with power management
-capabilities from ever being able to suspend (safely).
-
-Overview:
-1) Register each instance of a device with "pm_register"
-2) Call "pm_access" before accessing the hardware.
-   (this will ensure that the hardware is awake and ready)
-3) Your "pm_callback" is called before going into a
-   suspend state (ACPI D1-D3) or after resuming (ACPI D0)
-   from a suspend.
-4) Call "pm_dev_idle" when the device is not being used
-   (optional but will improve device idle detection)
-5) When unloaded, unregister the device with "pm_unregister"
-
-/*
- * Description: Register a device with the power-management subsystem
- *
- * Parameters:
- *   type - device type (PCI device, system device, ...)
- *   id - instance number or unique identifier
- *   cback - request handler callback (suspend, resume, ...)
- *
- * Returns: Registered PM device or NULL on error
- *
- * Examples:
- *   dev = pm_register(PM_SYS_DEV, PM_SYS_VGA, vga_callback);
- *
- *   struct pci_dev *pci_dev = pci_find_dev(...);
- *   dev = pm_register(PM_PCI_DEV, PM_PCI_ID(pci_dev), callback);
- */
-struct pm_dev *pm_register(pm_dev_t type, unsigned long id, pm_callback cback);
-
-/*
- * Description: Unregister a device with the power management subsystem
- *
- * Parameters:
- *   dev - PM device previously returned from pm_register
- */
-void pm_unregister(struct pm_dev *dev);
-
-/*
- * Description: Unregister all devices with a matching callback function
- *
- * Parameters:
- *   cback - previously registered request callback
- *
- * Notes: Provided for easier porting from old APM interface
- */
-void pm_unregister_all(pm_callback cback);
-
-/*
- * Power management request callback
- *
- * Parameters:
- *   dev - PM device previously returned from pm_register
- *   rqst - request type
- *   data - data, if any, associated with the request
- *
- * Returns: 0 if the request is successful
- *          EINVAL if the request is not supported
- *          EBUSY if the device is now busy and cannot handle the request
- *          ENOMEM if the device was unable to handle the request due to memory
- *
- * Details: The device request callback will be called before the
- *          device/system enters a suspend state (ACPI D1-D3) or
- *          or after the device/system resumes from suspend (ACPI D0).
- *          For PM_SUSPEND, the ACPI D-state being entered is passed
- *          as the "data" argument to the callback.  The device
- *          driver should save (PM_SUSPEND) or restore (PM_RESUME)
- *          device context when the request callback is called.
- *
- *          Once a driver returns 0 (success) from a suspend
- *          request, it should not process any further requests or
- *          access the device hardware until a call to "pm_access" is made.
- */
-typedef int (*pm_callback)(struct pm_dev *dev, pm_request_t rqst, void *data);
-
-Driver Details
---------------
-This is just a quick Q&A as a stopgap until a real driver writers'
-power management guide is available.
-
-Q: When is a device suspended?
-
-Devices can be suspended based on direct user request (eg. laptop lid
-closes), system power policy (eg.  sleep after 30 minutes of console
-inactivity), or device power policy (eg. power down device after 5
-minutes of inactivity)
-
-Q: Must a driver honor a suspend request?
-
-No, a driver can return -EBUSY from a suspend request and this
-will stop the system from suspending.  When a suspend request
-fails, all suspended devices are resumed and the system continues
-to run.  Suspend can be retried at a later time.
-
-Q: Can the driver block suspend/resume requests?
-
-Yes, a driver can delay its return from a suspend or resume
-request until the device is ready to handle requests.  It
-is advantageous to return as quickly as possible from a
-request as suspend/resume are done serially.
-
-Q: What context is a suspend/resume initiated from?
-
-A suspend or resume is initiated from a kernel thread context.
-It is safe to block, allocate memory, initiate requests
-or anything else you can do within the kernel.
-
-Q: Will requests continue to arrive after a suspend?
-
-Possibly.  It is the driver's responsibility to queue(*),
-fail, or drop any requests that arrive after returning
-success to a suspend request.  It is important that the
-driver not access its device until after it receives
-a resume request as the device's bus may no longer
-be active.
-
-(*) If a driver queues requests for processing after
-    resume be aware that the device, network, etc.
-    might be in a different state than at suspend time.
-    It's probably better to drop requests unless
-    the driver is a storage device.
-
-Q: Do I have to manage bus-specific power management registers
-
-No.  It is the responsibility of the bus driver to manage
-PCI, USB, etc. power management registers.  The bus driver
-or the power management subsystem will also enable any
-wake-on functionality that the device has.
-
-Q: So, really, what do I need to do to support suspend/resume?
-
-You need to save any device context that would
-be lost if the device was powered off and then restore
-it at resume time.  When ACPI is active, there are
-three levels of device suspend states; D1, D2, and D3.
-(The suspend state is passed as the "data" argument
-to the device callback.)  With D3, the device is powered
-off and loses all context, D1 and D2 are shallower power
-states and require less device context to be saved.  To
-play it safe, just save everything at suspend and restore
-everything at resume.
-
-Q: Where do I store device context for suspend?
-
-Anywhere in memory, kmalloc a buffer or store it
-in the device descriptor.  You are guaranteed that the
-contents of memory will be restored and accessible
-before resume, even when the system suspends to disk.
-
-Q: What do I need to do for ACPI vs. APM vs. etc?
-
-Drivers need not be aware of the specific power management
-technology that is active.  They just need to be aware
-of when the overlying power management system requests
-that they suspend or resume.
-
-Q: What about device dependencies?
-
-When a driver registers a device, the power management
-subsystem uses the information provided to build a
-tree of device dependencies (eg. USB device X is on
-USB controller Y which is on PCI bus Z)  When power
-management wants to suspend a device, it first sends
-a suspend request to its driver, then the bus driver,
-and so on up to the system bus.  Device resumes
-proceed in the opposite direction.
-
-Q: Who do I contact for additional information about
-   enabling power management for my specific driver/device?
-
-ACPI Development mailing list: linux-acpi@vger.kernel.org
-
-System Interface -- OBSOLETE, DO NOT USE!
-----------------*************************
-If you are providing new power management support to Linux (ie.
-adding support for something like APM or ACPI), you should
-communicate with drivers through the existing generic power
-management interface.
-
-/*
- * Send a request to all devices
- *
- * Parameters:
- *   rqst - request type
- *   data - data, if any, associated with the request
- *
- * Returns: 0 if the request is successful
- *          See "pm_callback" return for errors
- *
- * Details: Walk list of registered devices and call pm_send
- *          for each until complete or an error is encountered.
- *          If an error is encountered for a suspend request,
- *          return all devices to the state they were in before
- *          the suspend request.
- */
-int pm_send_all(pm_request_t rqst, void *data);
-
-/*
- * Find a matching device
- *
- * Parameters:
- *   type - device type (PCI device, system device, or 0 to match all devices)
- *   from - previous match or NULL to start from the beginning
- *
- * Returns: Matching device or NULL if none found
- */
-struct pm_dev *pm_find(pm_dev_t type, struct pm_dev *from);
diff --git a/Documentation/power/pm_qos_interface.txt b/Documentation/power/pm_qos_interface.txt
index 49adb1a3351..a5da5c7e712 100644
--- a/Documentation/power/pm_qos_interface.txt
+++ b/Documentation/power/pm_qos_interface.txt
@@ -1,11 +1,21 @@
-PM quality of Service interface.
+PM Quality Of Service Interface.
 
 This interface provides a kernel and user mode interface for registering
 performance expectations by drivers, subsystems and user space applications on
 one of the parameters.
 
-Currently we have {cpu_dma_latency, network_latency, network_throughput} as the
-initial set of pm_qos parameters.
+Two different PM QoS frameworks are available:
+1. PM QoS classes for cpu_dma_latency, network_latency, network_throughput.
+2. the per-device PM QoS framework provides the API to manage the per-device latency
+constraints and PM QoS flags.
+
+Each parameters have defined units:
+ * latency: usec
+ * timeout: usec
+ * throughput: kbs (kilo bit / sec)
+
+
+1. PM QoS framework
 
 The infrastructure exposes multiple misc device nodes one per implemented
 parameter.  The set of parameters implement is defined by pm_qos_power_init()
@@ -13,47 +23,200 @@ and pm_qos_params.h.  This is done because having the available parameters
 being runtime configurable or changeable from a driver was seen as too easy to
 abuse.
 
-For each parameter a list of performance requirements is maintained along with
+For each parameter a list of performance requests is maintained along with
 an aggregated target value.  The aggregated target value is updated with
-changes to the requirement list or elements of the list.  Typically the
-aggregated target value is simply the max or min of the requirement values held
+changes to the request list or elements of the list.  Typically the
+aggregated target value is simply the max or min of the request values held
 in the parameter list elements.
+Note: the aggregated target value is implemented as an atomic variable so that
+reading the aggregated value does not require any locking mechanism.
+
 
 From kernel mode the use of this interface is simple:
-pm_qos_add_requirement(param_id, name, target_value):
-Will insert a named element in the list for that identified PM_QOS parameter
-with the target value.  Upon change to this list the new target is recomputed
-and any registered notifiers are called only if the target value is now
-different.
 
-pm_qos_update_requirement(param_id, name, new_target_value):
-Will search the list identified by the param_id for the named list element and
-then update its target value, calling the notification tree if the aggregated
-target is changed.  with that name is already registered.
+void pm_qos_add_request(handle, param_class, target_value):
+Will insert an element into the list for that identified PM QoS class with the
+target value.  Upon change to this list the new target is recomputed and any
+registered notifiers are called only if the target value is now different.
+Clients of pm_qos need to save the returned handle for future use in other
+pm_qos API functions.
+
+void pm_qos_update_request(handle, new_target_value):
+Will update the list element pointed to by the handle with the new target value
+and recompute the new aggregated target, calling the notification tree if the
+target is changed.
 
-pm_qos_remove_requirement(param_id, name):
-Will search the identified list for the named element and remove it, after
-removal it will update the aggregate target and call the notification tree if
-the target was changed as a result of removing the named requirement.
+void pm_qos_remove_request(handle):
+Will remove the element.  After removal it will update the aggregate target and
+call the notification tree if the target was changed as a result of removing
+the request.
+
+int pm_qos_request(param_class):
+Returns the aggregated value for a given PM QoS class.
+
+int pm_qos_request_active(handle):
+Returns if the request is still active, i.e. it has not been removed from a
+PM QoS class constraints list.
+
+int pm_qos_add_notifier(param_class, notifier):
+Adds a notification callback function to the PM QoS class. The callback is
+called when the aggregated value for the PM QoS class is changed.
+
+int pm_qos_remove_notifier(int param_class, notifier):
+Removes the notification callback function for the PM QoS class.
 
 
 From user mode:
-Only processes can register a pm_qos requirement.  To provide for automatic
-cleanup for process the interface requires the process to register its
-parameter requirements in the following way:
+Only processes can register a pm_qos request.  To provide for automatic
+cleanup of a process, the interface requires the process to register its
+parameter requests in the following way:
 
 To register the default pm_qos target for the specific parameter, the process
 must open one of /dev/[cpu_dma_latency, network_latency, network_throughput]
 
 As long as the device node is held open that process has a registered
-requirement on the parameter.  The name of the requirement is "process_<PID>"
-derived from the current->pid from within the open system call.
+request on the parameter.
 
-To change the requested target value the process needs to write a s32 value to
-the open device node.  This translates to a pm_qos_update_requirement call.
+To change the requested target value the process needs to write an s32 value to
+the open device node.  Alternatively the user mode program could write a hex
+string for the value using 10 char long format e.g. "0x12345678".  This
+translates to a pm_qos_update_request call.
 
 To remove the user mode request for a target value simply close the device
 node.
 
 
-
+2. PM QoS per-device latency and flags framework
+
+For each device, there are three lists of PM QoS requests. Two of them are
+maintained along with the aggregated targets of resume latency and active
+state latency tolerance (in microseconds) and the third one is for PM QoS flags.
+Values are updated in response to changes of the request list.
+
+The target values of resume latency and active state latency tolerance are
+simply the minimum of the request values held in the parameter list elements.
+The PM QoS flags aggregate value is a gather (bitwise OR) of all list elements'
+values.  Two device PM QoS flags are defined currently: PM_QOS_FLAG_NO_POWER_OFF
+and PM_QOS_FLAG_REMOTE_WAKEUP.
+
+Note: The aggregated target values are implemented in such a way that reading
+the aggregated value does not require any locking mechanism.
+
+
+From kernel mode the use of this interface is the following:
+
+int dev_pm_qos_add_request(device, handle, type, value):
+Will insert an element into the list for that identified device with the
+target value.  Upon change to this list the new target is recomputed and any
+registered notifiers are called only if the target value is now different.
+Clients of dev_pm_qos need to save the handle for future use in other
+dev_pm_qos API functions.
+
+int dev_pm_qos_update_request(handle, new_value):
+Will update the list element pointed to by the handle with the new target value
+and recompute the new aggregated target, calling the notification trees if the
+target is changed.
+
+int dev_pm_qos_remove_request(handle):
+Will remove the element.  After removal it will update the aggregate target and
+call the notification trees if the target was changed as a result of removing
+the request.
+
+s32 dev_pm_qos_read_value(device):
+Returns the aggregated value for a given device's constraints list.
+
+enum pm_qos_flags_status dev_pm_qos_flags(device, mask)
+Check PM QoS flags of the given device against the given mask of flags.
+The meaning of the return values is as follows:
+	PM_QOS_FLAGS_ALL: All flags from the mask are set
+	PM_QOS_FLAGS_SOME: Some flags from the mask are set
+	PM_QOS_FLAGS_NONE: No flags from the mask are set
+	PM_QOS_FLAGS_UNDEFINED: The device's PM QoS structure has not been
+			initialized or the list of requests is empty.
+
+int dev_pm_qos_add_ancestor_request(dev, handle, type, value)
+Add a PM QoS request for the first direct ancestor of the given device whose
+power.ignore_children flag is unset (for DEV_PM_QOS_RESUME_LATENCY requests)
+or whose power.set_latency_tolerance callback pointer is not NULL (for
+DEV_PM_QOS_LATENCY_TOLERANCE requests).
+
+int dev_pm_qos_expose_latency_limit(device, value)
+Add a request to the device's PM QoS list of resume latency constraints and
+create a sysfs attribute pm_qos_resume_latency_us under the device's power
+directory allowing user space to manipulate that request.
+
+void dev_pm_qos_hide_latency_limit(device)
+Drop the request added by dev_pm_qos_expose_latency_limit() from the device's
+PM QoS list of resume latency constraints and remove sysfs attribute
+pm_qos_resume_latency_us from the device's power directory.
+
+int dev_pm_qos_expose_flags(device, value)
+Add a request to the device's PM QoS list of flags and create sysfs attributes
+pm_qos_no_power_off and pm_qos_remote_wakeup under the device's power directory
+allowing user space to change these flags' value.
+
+void dev_pm_qos_hide_flags(device)
+Drop the request added by dev_pm_qos_expose_flags() from the device's PM QoS list
+of flags and remove sysfs attributes pm_qos_no_power_off and pm_qos_remote_wakeup
+under the device's power directory.
+
+Notification mechanisms:
+The per-device PM QoS framework has 2 different and distinct notification trees:
+a per-device notification tree and a global notification tree.
+
+int dev_pm_qos_add_notifier(device, notifier):
+Adds a notification callback function for the device.
+The callback is called when the aggregated value of the device constraints list
+is changed (for resume latency device PM QoS only).
+
+int dev_pm_qos_remove_notifier(device, notifier):
+Removes the notification callback function for the device.
+
+int dev_pm_qos_add_global_notifier(notifier):
+Adds a notification callback function in the global notification tree of the
+framework.
+The callback is called when the aggregated value for any device is changed
+(for resume latency device PM QoS only).
+
+int dev_pm_qos_remove_global_notifier(notifier):
+Removes the notification callback function from the global notification tree
+of the framework.
+
+
+Active state latency tolerance
+
+This device PM QoS type is used to support systems in which hardware may switch
+to energy-saving operation modes on the fly.  In those systems, if the operation
+mode chosen by the hardware attempts to save energy in an overly aggressive way,
+it may cause excess latencies to be visible to software, causing it to miss
+certain protocol requirements or target frame or sample rates etc.
+
+If there is a latency tolerance control mechanism for a given device available
+to software, the .set_latency_tolerance callback in that device's dev_pm_info
+structure should be populated.  The routine pointed to by it is should implement
+whatever is necessary to transfer the effective requirement value to the
+hardware.
+
+Whenever the effective latency tolerance changes for the device, its
+.set_latency_tolerance() callback will be executed and the effective value will
+be passed to it.  If that value is negative, which means that the list of
+latency tolerance requirements for the device is empty, the callback is expected
+to switch the underlying hardware latency tolerance control mechanism to an
+autonomous mode if available.  If that value is PM_QOS_LATENCY_ANY, in turn, and
+the hardware supports a special "no requirement" setting, the callback is
+expected to use it.  That allows software to prevent the hardware from
+automatically updating the device's latency tolerance in response to its power
+state changes (e.g. during transitions from D3cold to D0), which generally may
+be done in the autonomous latency tolerance control mode.
+
+If .set_latency_tolerance() is present for the device, sysfs attribute
+pm_qos_latency_tolerance_us will be present in the devivce's power directory.
+Then, user space can use that attribute to specify its latency tolerance
+requirement for the device, if any.  Writing "any" to it means "no requirement,
+but do not let the hardware control latency tolerance" and writing "auto" to it
+allows the hardware to be switched to the autonomous mode if there are no other
+requirements from the kernel side in the device's list.
+
+Kernel code can use the functions described above along with the
+DEV_PM_QOS_LATENCY_TOLERANCE device PM QoS type to add, remove and update
+latency tolerance requirements for devices.
diff --git a/Documentation/power/power_supply_class.txt b/Documentation/power/power_supply_class.txt
index a8686e5a685..89a8816990f 100644
--- a/Documentation/power/power_supply_class.txt
+++ b/Documentation/power/power_supply_class.txt
@@ -76,9 +76,19 @@ STATUS - this attribute represents operating status (charging, full,
 discharging (i.e. powering a load), etc.). This corresponds to
 BATTERY_STATUS_* values, as defined in battery.h.
 
+CHARGE_TYPE - batteries can typically charge at different rates.
+This defines trickle and fast charges.  For batteries that
+are already charged or discharging, 'n/a' can be displayed (or
+'unknown', if the status is not known).
+
+AUTHENTIC - indicates the power supply (battery or charger) connected
+to the platform is authentic(1) or non authentic(0).
+
 HEALTH - represents health of the battery, values corresponds to
 POWER_SUPPLY_HEALTH_*, defined in battery.h.
 
+VOLTAGE_OCV - open circuit voltage of the battery.
+
 VOLTAGE_MAX_DESIGN, VOLTAGE_MIN_DESIGN - design values for maximal and
 minimal power supply voltages. Maximal/minimal means values of voltages
 when battery considered "full"/"empty" at normal conditions. Yes, there is
@@ -101,12 +111,35 @@ of charge when battery became full/empty". It also could mean "value of
 charge when battery considered full/empty at given conditions (temperature,
 age)". I.e. these attributes represents real thresholds, not design values.
 
+CHARGE_COUNTER - the current charge counter (in µAh).  This could easily
+be negative; there is no empty or full value.  It is only useful for
+relative, time-based measurements.
+
+CONSTANT_CHARGE_CURRENT - constant charge current programmed by charger.
+CONSTANT_CHARGE_CURRENT_MAX - maximum charge current supported by the
+power supply object.
+
+CONSTANT_CHARGE_VOLTAGE - constant charge voltage programmed by charger.
+CONSTANT_CHARGE_VOLTAGE_MAX - maximum charge voltage supported by the
+power supply object.
+
+CHARGE_CONTROL_LIMIT - current charge control limit setting
+CHARGE_CONTROL_LIMIT_MAX - maximum charge control limit setting
+
 ENERGY_FULL, ENERGY_EMPTY - same as above but for energy.
 
 CAPACITY - capacity in percents.
+CAPACITY_ALERT_MIN - minimum capacity alert value in percents.
+CAPACITY_ALERT_MAX - maximum capacity alert value in percents.
+CAPACITY_LEVEL - capacity level. This corresponds to
+POWER_SUPPLY_CAPACITY_LEVEL_*.
 
 TEMP - temperature of the power supply.
+TEMP_ALERT_MIN - minimum battery temperature alert.
+TEMP_ALERT_MAX - maximum battery temperature alert.
 TEMP_AMBIENT - ambient temperature.
+TEMP_AMBIENT_ALERT_MIN - minimum ambient temperature alert.
+TEMP_AMBIENT_ALERT_MAX - maximum ambient temperature alert.
 
 TIME_TO_EMPTY - seconds left for battery to be considered empty (i.e.
 while battery powers a load)
diff --git a/Documentation/power/powercap/powercap.txt b/Documentation/power/powercap/powercap.txt
new file mode 100644
index 00000000000..1e6ef164e07
--- /dev/null
+++ b/Documentation/power/powercap/powercap.txt
@@ -0,0 +1,236 @@
+Power Capping Framework
+==================================
+
+The power capping framework provides a consistent interface between the kernel
+and the user space that allows power capping drivers to expose the settings to
+user space in a uniform way.
+
+Terminology
+=========================
+The framework exposes power capping devices to user space via sysfs in the
+form of a tree of objects. The objects at the root level of the tree represent
+'control types', which correspond to different methods of power capping.  For
+example, the intel-rapl control type represents the Intel "Running Average
+Power Limit" (RAPL) technology, whereas the 'idle-injection' control type
+corresponds to the use of idle injection for controlling power.
+
+Power zones represent different parts of the system, which can be controlled and
+monitored using the power capping method determined by the control type the
+given zone belongs to. They each contain attributes for monitoring power, as
+well as controls represented in the form of power constraints.  If the parts of
+the system represented by different power zones are hierarchical (that is, one
+bigger part consists of multiple smaller parts that each have their own power
+controls), those power zones may also be organized in a hierarchy with one
+parent power zone containing multiple subzones and so on to reflect the power
+control topology of the system.  In that case, it is possible to apply power
+capping to a set of devices together using the parent power zone and if more
+fine grained control is required, it can be applied through the subzones.
+
+
+Example sysfs interface tree:
+
+/sys/devices/virtual/powercap
+??? intel-rapl
+    ??? intel-rapl:0
+    ?   ??? constraint_0_name
+    ?   ??? constraint_0_power_limit_uw
+    ?   ??? constraint_0_time_window_us
+    ?   ??? constraint_1_name
+    ?   ??? constraint_1_power_limit_uw
+    ?   ??? constraint_1_time_window_us
+    ?   ??? device -> ../../intel-rapl
+    ?   ??? energy_uj
+    ?   ??? intel-rapl:0:0
+    ?   ?   ??? constraint_0_name
+    ?   ?   ??? constraint_0_power_limit_uw
+    ?   ?   ??? constraint_0_time_window_us
+    ?   ?   ??? constraint_1_name
+    ?   ?   ??? constraint_1_power_limit_uw
+    ?   ?   ??? constraint_1_time_window_us
+    ?   ?   ??? device -> ../../intel-rapl:0
+    ?   ?   ??? energy_uj
+    ?   ?   ??? max_energy_range_uj
+    ?   ?   ??? name
+    ?   ?   ??? enabled
+    ?   ?   ??? power
+    ?   ?   ?   ??? async
+    ?   ?   ?   []
+    ?   ?   ??? subsystem -> ../../../../../../class/power_cap
+    ?   ?   ??? uevent
+    ?   ??? intel-rapl:0:1
+    ?   ?   ??? constraint_0_name
+    ?   ?   ??? constraint_0_power_limit_uw
+    ?   ?   ??? constraint_0_time_window_us
+    ?   ?   ??? constraint_1_name
+    ?   ?   ??? constraint_1_power_limit_uw
+    ?   ?   ??? constraint_1_time_window_us
+    ?   ?   ??? device -> ../../intel-rapl:0
+    ?   ?   ??? energy_uj
+    ?   ?   ??? max_energy_range_uj
+    ?   ?   ??? name
+    ?   ?   ??? enabled
+    ?   ?   ??? power
+    ?   ?   ?   ??? async
+    ?   ?   ?   []
+    ?   ?   ??? subsystem -> ../../../../../../class/power_cap
+    ?   ?   ??? uevent
+    ?   ??? max_energy_range_uj
+    ?   ??? max_power_range_uw
+    ?   ??? name
+    ?   ??? enabled
+    ?   ??? power
+    ?   ?   ??? async
+    ?   ?   []
+    ?   ??? subsystem -> ../../../../../class/power_cap
+    ?   ??? enabled
+    ?   ??? uevent
+    ??? intel-rapl:1
+    ?   ??? constraint_0_name
+    ?   ??? constraint_0_power_limit_uw
+    ?   ??? constraint_0_time_window_us
+    ?   ??? constraint_1_name
+    ?   ??? constraint_1_power_limit_uw
+    ?   ??? constraint_1_time_window_us
+    ?   ??? device -> ../../intel-rapl
+    ?   ??? energy_uj
+    ?   ??? intel-rapl:1:0
+    ?   ?   ??? constraint_0_name
+    ?   ?   ??? constraint_0_power_limit_uw
+    ?   ?   ??? constraint_0_time_window_us
+    ?   ?   ??? constraint_1_name
+    ?   ?   ??? constraint_1_power_limit_uw
+    ?   ?   ??? constraint_1_time_window_us
+    ?   ?   ??? device -> ../../intel-rapl:1
+    ?   ?   ??? energy_uj
+    ?   ?   ??? max_energy_range_uj
+    ?   ?   ??? name
+    ?   ?   ??? enabled
+    ?   ?   ??? power
+    ?   ?   ?   ??? async
+    ?   ?   ?   []
+    ?   ?   ??? subsystem -> ../../../../../../class/power_cap
+    ?   ?   ??? uevent
+    ?   ??? intel-rapl:1:1
+    ?   ?   ??? constraint_0_name
+    ?   ?   ??? constraint_0_power_limit_uw
+    ?   ?   ??? constraint_0_time_window_us
+    ?   ?   ??? constraint_1_name
+    ?   ?   ??? constraint_1_power_limit_uw
+    ?   ?   ??? constraint_1_time_window_us
+    ?   ?   ??? device -> ../../intel-rapl:1
+    ?   ?   ??? energy_uj
+    ?   ?   ??? max_energy_range_uj
+    ?   ?   ??? name
+    ?   ?   ??? enabled
+    ?   ?   ??? power
+    ?   ?   ?   ??? async
+    ?   ?   ?   []
+    ?   ?   ??? subsystem -> ../../../../../../class/power_cap
+    ?   ?   ??? uevent
+    ?   ??? max_energy_range_uj
+    ?   ??? max_power_range_uw
+    ?   ??? name
+    ?   ??? enabled
+    ?   ??? power
+    ?   ?   ??? async
+    ?   ?   []
+    ?   ??? subsystem -> ../../../../../class/power_cap
+    ?   ??? uevent
+    ??? power
+    ?   ??? async
+    ?   []
+    ??? subsystem -> ../../../../class/power_cap
+    ??? enabled
+    ??? uevent
+
+The above example illustrates a case in which the Intel RAPL technology,
+available in Intel® IA-64 and IA-32 Processor Architectures, is used. There is one
+control type called intel-rapl which contains two power zones, intel-rapl:0 and
+intel-rapl:1, representing CPU packages.  Each of these power zones contains
+two subzones, intel-rapl:j:0 and intel-rapl:j:1 (j = 0, 1), representing the
+"core" and the "uncore" parts of the given CPU package, respectively.  All of
+the zones and subzones contain energy monitoring attributes (energy_uj,
+max_energy_range_uj) and constraint attributes (constraint_*) allowing controls
+to be applied (the constraints in the 'package' power zones apply to the whole
+CPU packages and the subzone constraints only apply to the respective parts of
+the given package individually). Since Intel RAPL doesn't provide instantaneous
+power value, there is no power_uw attribute.
+
+In addition to that, each power zone contains a name attribute, allowing the
+part of the system represented by that zone to be identified.
+For example:
+
+cat /sys/class/power_cap/intel-rapl/intel-rapl:0/name
+package-0
+
+The Intel RAPL technology allows two constraints, short term and long term,
+with two different time windows to be applied to each power zone.  Thus for
+each zone there are 2 attributes representing the constraint names, 2 power
+limits and 2 attributes representing the sizes of the time windows. Such that,
+constraint_j_* attributes correspond to the jth constraint (j = 0,1).
+
+For example:
+	constraint_0_name
+	constraint_0_power_limit_uw
+	constraint_0_time_window_us
+	constraint_1_name
+	constraint_1_power_limit_uw
+	constraint_1_time_window_us
+
+Power Zone Attributes
+=================================
+Monitoring attributes
+----------------------
+
+energy_uj (rw): Current energy counter in micro joules. Write "0" to reset.
+If the counter can not be reset, then this attribute is read only.
+
+max_energy_range_uj (ro): Range of the above energy counter in micro-joules.
+
+power_uw (ro): Current power in micro watts.
+
+max_power_range_uw (ro): Range of the above power value in micro-watts.
+
+name (ro): Name of this power zone.
+
+It is possible that some domains have both power ranges and energy counter ranges;
+however, only one is mandatory.
+
+Constraints
+----------------
+constraint_X_power_limit_uw (rw): Power limit in micro watts, which should be
+applicable for the time window specified by "constraint_X_time_window_us".
+
+constraint_X_time_window_us (rw): Time window in micro seconds.
+
+constraint_X_name (ro): An optional name of the constraint
+
+constraint_X_max_power_uw(ro): Maximum allowed power in micro watts.
+
+constraint_X_min_power_uw(ro): Minimum allowed power in micro watts.
+
+constraint_X_max_time_window_us(ro): Maximum allowed time window in micro seconds.
+
+constraint_X_min_time_window_us(ro): Minimum allowed time window in micro seconds.
+
+Except power_limit_uw and time_window_us other fields are optional.
+
+Common zone and control type attributes
+----------------------------------------
+enabled (rw): Enable/Disable controls at zone level or for all zones using
+a control type.
+
+Power Cap Client Driver Interface
+==================================
+The API summary:
+
+Call powercap_register_control_type() to register control type object.
+Call powercap_register_zone() to register a power zone (under a given
+control type), either as a top-level power zone or as a subzone of another
+power zone registered earlier.
+The number of constraints in a power zone and the corresponding callbacks have
+to be defined prior to calling powercap_register_zone() to register that zone.
+
+To Free a power zone call powercap_unregister_zone().
+To free a control type object call powercap_unregister_control_type().
+Detailed API can be generated using kernel-doc on include/linux/powercap.h.
diff --git a/Documentation/power/regulator/consumer.txt b/Documentation/power/regulator/consumer.txt
new file mode 100644
index 00000000000..55c4175d809
--- /dev/null
+++ b/Documentation/power/regulator/consumer.txt
@@ -0,0 +1,182 @@
+Regulator Consumer Driver Interface
+===================================
+
+This text describes the regulator interface for consumer device drivers.
+Please see overview.txt for a description of the terms used in this text.
+
+
+1. Consumer Regulator Access (static & dynamic drivers)
+=======================================================
+
+A consumer driver can get access to its supply regulator by calling :-
+
+regulator = regulator_get(dev, "Vcc");
+
+The consumer passes in its struct device pointer and power supply ID. The core
+then finds the correct regulator by consulting a machine specific lookup table.
+If the lookup is successful then this call will return a pointer to the struct
+regulator that supplies this consumer.
+
+To release the regulator the consumer driver should call :-
+
+regulator_put(regulator);
+
+Consumers can be supplied by more than one regulator e.g. codec consumer with
+analog and digital supplies :-
+
+digital = regulator_get(dev, "Vcc");  /* digital core */
+analog = regulator_get(dev, "Avdd");  /* analog */
+
+The regulator access functions regulator_get() and regulator_put() will
+usually be called in your device drivers probe() and remove() respectively.
+
+
+2. Regulator Output Enable & Disable (static & dynamic drivers)
+====================================================================
+
+A consumer can enable its power supply by calling:-
+
+int regulator_enable(regulator);
+
+NOTE: The supply may already be enabled before regulator_enabled() is called.
+This may happen if the consumer shares the regulator or the regulator has been
+previously enabled by bootloader or kernel board initialization code.
+
+A consumer can determine if a regulator is enabled by calling :-
+
+int regulator_is_enabled(regulator);
+
+This will return > zero when the regulator is enabled.
+
+
+A consumer can disable its supply when no longer needed by calling :-
+
+int regulator_disable(regulator);
+
+NOTE: This may not disable the supply if it's shared with other consumers. The
+regulator will only be disabled when the enabled reference count is zero.
+
+Finally, a regulator can be forcefully disabled in the case of an emergency :-
+
+int regulator_force_disable(regulator);
+
+NOTE: this will immediately and forcefully shutdown the regulator output. All
+consumers will be powered off.
+
+
+3. Regulator Voltage Control & Status (dynamic drivers)
+======================================================
+
+Some consumer drivers need to be able to dynamically change their supply
+voltage to match system operating points. e.g. CPUfreq drivers can scale
+voltage along with frequency to save power, SD drivers may need to select the
+correct card voltage, etc.
+
+Consumers can control their supply voltage by calling :-
+
+int regulator_set_voltage(regulator, min_uV, max_uV);
+
+Where min_uV and max_uV are the minimum and maximum acceptable voltages in
+microvolts.
+
+NOTE: this can be called when the regulator is enabled or disabled. If called
+when enabled, then the voltage changes instantly, otherwise the voltage
+configuration changes and the voltage is physically set when the regulator is
+next enabled.
+
+The regulators configured voltage output can be found by calling :-
+
+int regulator_get_voltage(regulator);
+
+NOTE: get_voltage() will return the configured output voltage whether the
+regulator is enabled or disabled and should NOT be used to determine regulator
+output state. However this can be used in conjunction with is_enabled() to
+determine the regulator physical output voltage.
+
+
+4. Regulator Current Limit Control & Status (dynamic drivers)
+===========================================================
+
+Some consumer drivers need to be able to dynamically change their supply
+current limit to match system operating points. e.g. LCD backlight driver can
+change the current limit to vary the backlight brightness, USB drivers may want
+to set the limit to 500mA when supplying power.
+
+Consumers can control their supply current limit by calling :-
+
+int regulator_set_current_limit(regulator, min_uA, max_uA);
+
+Where min_uA and max_uA are the minimum and maximum acceptable current limit in
+microamps.
+
+NOTE: this can be called when the regulator is enabled or disabled. If called
+when enabled, then the current limit changes instantly, otherwise the current
+limit configuration changes and the current limit is physically set when the
+regulator is next enabled.
+
+A regulators current limit can be found by calling :-
+
+int regulator_get_current_limit(regulator);
+
+NOTE: get_current_limit() will return the current limit whether the regulator
+is enabled or disabled and should not be used to determine regulator current
+load.
+
+
+5. Regulator Operating Mode Control & Status (dynamic drivers)
+=============================================================
+
+Some consumers can further save system power by changing the operating mode of
+their supply regulator to be more efficient when the consumers operating state
+changes. e.g. consumer driver is idle and subsequently draws less current
+
+Regulator operating mode can be changed indirectly or directly.
+
+Indirect operating mode control.
+--------------------------------
+Consumer drivers can request a change in their supply regulator operating mode
+by calling :-
+
+int regulator_set_optimum_mode(struct regulator *regulator, int load_uA);
+
+This will cause the core to recalculate the total load on the regulator (based
+on all its consumers) and change operating mode (if necessary and permitted)
+to best match the current operating load.
+
+The load_uA value can be determined from the consumers datasheet. e.g.most
+datasheets have tables showing the max current consumed in certain situations.
+
+Most consumers will use indirect operating mode control since they have no
+knowledge of the regulator or whether the regulator is shared with other
+consumers.
+
+Direct operating mode control.
+------------------------------
+Bespoke or tightly coupled drivers may want to directly control regulator
+operating mode depending on their operating point. This can be achieved by
+calling :-
+
+int regulator_set_mode(struct regulator *regulator, unsigned int mode);
+unsigned int regulator_get_mode(struct regulator *regulator);
+
+Direct mode will only be used by consumers that *know* about the regulator and
+are not sharing the regulator with other consumers.
+
+
+6. Regulator Events
+===================
+Regulators can notify consumers of external events. Events could be received by
+consumers under regulator stress or failure conditions.
+
+Consumers can register interest in regulator events by calling :-
+
+int regulator_register_notifier(struct regulator *regulator,
+			      struct notifier_block *nb);
+
+Consumers can uregister interest by calling :-
+
+int regulator_unregister_notifier(struct regulator *regulator,
+				struct notifier_block *nb);
+
+Regulators use the kernel notifier framework to send event to their interested
+consumers.
diff --git a/Documentation/power/regulator/design.txt b/Documentation/power/regulator/design.txt
new file mode 100644
index 00000000000..f9b56b72b78
--- /dev/null
+++ b/Documentation/power/regulator/design.txt
@@ -0,0 +1,33 @@
+Regulator API design notes
+==========================
+
+This document provides a brief, partially structured, overview of some
+of the design considerations which impact the regulator API design.
+
+Safety
+------
+
+ - Errors in regulator configuration can have very serious consequences
+   for the system, potentially including lasting hardware damage.
+ - It is not possible to automatically determine the power confugration
+   of the system - software-equivalent variants of the same chip may
+   have different power requirments, and not all components with power
+   requirements are visible to software.
+
+  => The API should make no changes to the hardware state unless it has
+     specific knowledge that these changes are safe to do perform on
+     this particular system.
+
+Consumer use cases
+------------------
+
+ - The overwhelming majority of devices in a system will have no
+   requirement to do any runtime configuration of their power beyond
+   being able to turn it on or off.
+
+ - Many of the power supplies in the system will be shared between many
+   different consumers.
+
+  => The consumer API should be structured so that these use cases are
+     very easy to handle and so that consumers will work with shared
+     supplies without any additional effort.
diff --git a/Documentation/power/regulator/machine.txt b/Documentation/power/regulator/machine.txt
new file mode 100644
index 00000000000..ce63af0a8e3
--- /dev/null
+++ b/Documentation/power/regulator/machine.txt
@@ -0,0 +1,100 @@
+Regulator Machine Driver Interface
+===================================
+
+The regulator machine driver interface is intended for board/machine specific
+initialisation code to configure the regulator subsystem.
+
+Consider the following machine :-
+
+  Regulator-1 -+-> Regulator-2 --> [Consumer A @ 1.8 - 2.0V]
+               |
+               +-> [Consumer B @ 3.3V]
+
+The drivers for consumers A & B must be mapped to the correct regulator in
+order to control their power supply. This mapping can be achieved in machine
+initialisation code by creating a struct regulator_consumer_supply for
+each regulator.
+
+struct regulator_consumer_supply {
+	const char *dev_name;	/* consumer dev_name() */
+	const char *supply;	/* consumer supply - e.g. "vcc" */
+};
+
+e.g. for the machine above
+
+static struct regulator_consumer_supply regulator1_consumers[] = {
+{
+	.dev_name	= "dev_name(consumer B)",
+	.supply		= "Vcc",
+},};
+
+static struct regulator_consumer_supply regulator2_consumers[] = {
+{
+	.dev	= "dev_name(consumer A"),
+	.supply	= "Vcc",
+},};
+
+This maps Regulator-1 to the 'Vcc' supply for Consumer B and maps Regulator-2
+to the 'Vcc' supply for Consumer A.
+
+Constraints can now be registered by defining a struct regulator_init_data
+for each regulator power domain. This structure also maps the consumers
+to their supply regulator :-
+
+static struct regulator_init_data regulator1_data = {
+	.constraints = {
+		.name = "Regulator-1",
+		.min_uV = 3300000,
+		.max_uV = 3300000,
+		.valid_modes_mask = REGULATOR_MODE_NORMAL,
+	},
+	.num_consumer_supplies = ARRAY_SIZE(regulator1_consumers),
+	.consumer_supplies = regulator1_consumers,
+};
+
+The name field should be set to something that is usefully descriptive
+for the board for configuration of supplies for other regulators and
+for use in logging and other diagnostic output.  Normally the name
+used for the supply rail in the schematic is a good choice.  If no
+name is provided then the subsystem will choose one.
+
+Regulator-1 supplies power to Regulator-2. This relationship must be registered
+with the core so that Regulator-1 is also enabled when Consumer A enables its
+supply (Regulator-2). The supply regulator is set by the supply_regulator
+field below and co:-
+
+static struct regulator_init_data regulator2_data = {
+	.supply_regulator = "Regulator-1",
+	.constraints = {
+		.min_uV = 1800000,
+		.max_uV = 2000000,
+		.valid_ops_mask = REGULATOR_CHANGE_VOLTAGE,
+		.valid_modes_mask = REGULATOR_MODE_NORMAL,
+	},
+	.num_consumer_supplies = ARRAY_SIZE(regulator2_consumers),
+	.consumer_supplies = regulator2_consumers,
+};
+
+Finally the regulator devices must be registered in the usual manner.
+
+static struct platform_device regulator_devices[] = {
+{
+	.name = "regulator",
+	.id = DCDC_1,
+	.dev = {
+		.platform_data = &regulator1_data,
+	},
+},
+{
+	.name = "regulator",
+	.id = DCDC_2,
+	.dev = {
+		.platform_data = &regulator2_data,
+	},
+},
+};
+/* register regulator 1 device */
+platform_device_register(&regulator_devices[0]);
+
+/* register regulator 2 device */
+platform_device_register(&regulator_devices[1]);
diff --git a/Documentation/power/regulator/overview.txt b/Documentation/power/regulator/overview.txt
new file mode 100644
index 00000000000..8ed17587a74
--- /dev/null
+++ b/Documentation/power/regulator/overview.txt
@@ -0,0 +1,171 @@
+Linux voltage and current regulator framework
+=============================================
+
+About
+=====
+
+This framework is designed to provide a standard kernel interface to control
+voltage and current regulators.
+
+The intention is to allow systems to dynamically control regulator power output
+in order to save power and prolong battery life. This applies to both voltage
+regulators (where voltage output is controllable) and current sinks (where
+current limit is controllable).
+
+(C) 2008  Wolfson Microelectronics PLC.
+Author: Liam Girdwood <lrg@slimlogic.co.uk>
+
+
+Nomenclature
+============
+
+Some terms used in this document:-
+
+  o Regulator    - Electronic device that supplies power to other devices.
+                   Most regulators can enable and disable their output whilst
+                   some can control their output voltage and or current.
+
+                   Input Voltage -> Regulator -> Output Voltage
+
+
+  o PMIC         - Power Management IC. An IC that contains numerous regulators
+                   and often contains other subsystems.
+
+
+  o Consumer     - Electronic device that is supplied power by a regulator.
+                   Consumers can be classified into two types:-
+
+                   Static: consumer does not change its supply voltage or
+                   current limit. It only needs to enable or disable it's
+                   power supply. Its supply voltage is set by the hardware,
+                   bootloader, firmware or kernel board initialisation code.
+
+                   Dynamic: consumer needs to change it's supply voltage or
+                   current limit to meet operation demands.
+
+
+  o Power Domain - Electronic circuit that is supplied its input power by the
+                   output power of a regulator, switch or by another power
+                   domain.
+
+                   The supply regulator may be behind a switch(s). i.e.
+
+                   Regulator -+-> Switch-1 -+-> Switch-2 --> [Consumer A]
+                              |             |
+                              |             +-> [Consumer B], [Consumer C]
+                              |
+                              +-> [Consumer D], [Consumer E]
+
+                   That is one regulator and three power domains:
+
+                   Domain 1: Switch-1, Consumers D & E.
+                   Domain 2: Switch-2, Consumers B & C.
+                   Domain 3: Consumer A.
+
+                   and this represents a "supplies" relationship:
+
+                   Domain-1 --> Domain-2 --> Domain-3.
+
+                   A power domain may have regulators that are supplied power
+                   by other regulators. i.e.
+
+                   Regulator-1 -+-> Regulator-2 -+-> [Consumer A]
+                                |
+                                +-> [Consumer B]
+
+                   This gives us two regulators and two power domains:
+
+                   Domain 1: Regulator-2, Consumer B.
+                   Domain 2: Consumer A.
+
+                   and a "supplies" relationship:
+
+                   Domain-1 --> Domain-2
+
+
+  o Constraints  - Constraints are used to define power levels for performance
+                   and hardware protection. Constraints exist at three levels:
+
+                   Regulator Level: This is defined by the regulator hardware
+                   operating parameters and is specified in the regulator
+                   datasheet. i.e.
+
+                     - voltage output is in the range 800mV -> 3500mV.
+                     - regulator current output limit is 20mA @ 5V but is
+                       10mA @ 10V.
+
+                   Power Domain Level: This is defined in software by kernel
+                   level board initialisation code. It is used to constrain a
+                   power domain to a particular power range. i.e.
+
+                     - Domain-1 voltage is 3300mV
+                     - Domain-2 voltage is 1400mV -> 1600mV
+                     - Domain-3 current limit is 0mA -> 20mA.
+
+                   Consumer Level: This is defined by consumer drivers
+                   dynamically setting voltage or current limit levels.
+
+                   e.g. a consumer backlight driver asks for a current increase
+                   from 5mA to 10mA to increase LCD illumination. This passes
+                   to through the levels as follows :-
+
+                   Consumer: need to increase LCD brightness. Lookup and
+                   request next current mA value in brightness table (the
+                   consumer driver could be used on several different
+                   personalities based upon the same reference device).
+
+                   Power Domain: is the new current limit within the domain
+                   operating limits for this domain and system state (e.g.
+                   battery power, USB power)
+
+                   Regulator Domains: is the new current limit within the
+                   regulator operating parameters for input/output voltage.
+
+                   If the regulator request passes all the constraint tests
+                   then the new regulator value is applied.
+
+
+Design
+======
+
+The framework is designed and targeted at SoC based devices but may also be
+relevant to non SoC devices and is split into the following four interfaces:-
+
+
+   1. Consumer driver interface.
+
+      This uses a similar API to the kernel clock interface in that consumer
+      drivers can get and put a regulator (like they can with clocks atm) and
+      get/set voltage, current limit, mode, enable and disable. This should
+      allow consumers complete control over their supply voltage and current
+      limit. This also compiles out if not in use so drivers can be reused in
+      systems with no regulator based power control.
+
+        See Documentation/power/regulator/consumer.txt
+
+   2. Regulator driver interface.
+
+      This allows regulator drivers to register their regulators and provide
+      operations to the core. It also has a notifier call chain for propagating
+      regulator events to clients.
+
+        See Documentation/power/regulator/regulator.txt
+
+   3. Machine interface.
+
+      This interface is for machine specific code and allows the creation of
+      voltage/current domains (with constraints) for each regulator. It can
+      provide regulator constraints that will prevent device damage through
+      overvoltage or over current caused by buggy client drivers. It also
+      allows the creation of a regulator tree whereby some regulators are
+      supplied by others (similar to a clock tree).
+
+        See Documentation/power/regulator/machine.txt
+
+   4. Userspace ABI.
+
+      The framework also exports a lot of useful voltage/current/opmode data to
+      userspace via sysfs. This could be used to help monitor device power
+      consumption and status.
+
+        See Documentation/ABI/testing/sysfs-class-regulator
diff --git a/Documentation/power/regulator/regulator.txt b/Documentation/power/regulator/regulator.txt
new file mode 100644
index 00000000000..13902778ae4
--- /dev/null
+++ b/Documentation/power/regulator/regulator.txt
@@ -0,0 +1,30 @@
+Regulator Driver Interface
+==========================
+
+The regulator driver interface is relatively simple and designed to allow
+regulator drivers to register their services with the core framework.
+
+
+Registration
+============
+
+Drivers can register a regulator by calling :-
+
+struct regulator_dev *regulator_register(struct regulator_desc *regulator_desc,
+					 const struct regulator_config *config);
+
+This will register the regulators capabilities and operations to the regulator
+core.
+
+Regulators can be unregistered by calling :-
+
+void regulator_unregister(struct regulator_dev *rdev);
+
+
+Regulator Events
+================
+Regulators can send events (e.g. over temp, under voltage, etc) to consumer
+drivers by calling :-
+
+int regulator_notifier_call_chain(struct regulator_dev *rdev,
+				  unsigned long event, void *data);
diff --git a/Documentation/power/runtime_pm.txt b/Documentation/power/runtime_pm.txt
new file mode 100644
index 00000000000..f32ce541957
--- /dev/null
+++ b/Documentation/power/runtime_pm.txt
@@ -0,0 +1,908 @@
+Runtime Power Management Framework for I/O Devices
+
+(C) 2009-2011 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
+(C) 2010 Alan Stern <stern@rowland.harvard.edu>
+(C) 2014 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
+1. Introduction
+
+Support for runtime power management (runtime PM) of I/O devices is provided
+at the power management core (PM core) level by means of:
+
+* The power management workqueue pm_wq in which bus types and device drivers can
+  put their PM-related work items.  It is strongly recommended that pm_wq be
+  used for queuing all work items related to runtime PM, because this allows
+  them to be synchronized with system-wide power transitions (suspend to RAM,
+  hibernation and resume from system sleep states).  pm_wq is declared in
+  include/linux/pm_runtime.h and defined in kernel/power/main.c.
+
+* A number of runtime PM fields in the 'power' member of 'struct device' (which
+  is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can
+  be used for synchronizing runtime PM operations with one another.
+
+* Three device runtime PM callbacks in 'struct dev_pm_ops' (defined in
+  include/linux/pm.h).
+
+* A set of helper functions defined in drivers/base/power/runtime.c that can be
+  used for carrying out runtime PM operations in such a way that the
+  synchronization between them is taken care of by the PM core.  Bus types and
+  device drivers are encouraged to use these functions.
+
+The runtime PM callbacks present in 'struct dev_pm_ops', the device runtime PM
+fields of 'struct dev_pm_info' and the core helper functions provided for
+runtime PM are described below.
+
+2. Device Runtime PM Callbacks
+
+There are three device runtime PM callbacks defined in 'struct dev_pm_ops':
+
+struct dev_pm_ops {
+	...
+	int (*runtime_suspend)(struct device *dev);
+	int (*runtime_resume)(struct device *dev);
+	int (*runtime_idle)(struct device *dev);
+	...
+};
+
+The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks
+are executed by the PM core for the device's subsystem that may be either of
+the following:
+
+  1. PM domain of the device, if the device's PM domain object, dev->pm_domain,
+     is present.
+
+  2. Device type of the device, if both dev->type and dev->type->pm are present.
+
+  3. Device class of the device, if both dev->class and dev->class->pm are
+     present.
+
+  4. Bus type of the device, if both dev->bus and dev->bus->pm are present.
+
+If the subsystem chosen by applying the above rules doesn't provide the relevant
+callback, the PM core will invoke the corresponding driver callback stored in
+dev->driver->pm directly (if present).
+
+The PM core always checks which callback to use in the order given above, so the
+priority order of callbacks from high to low is: PM domain, device type, class
+and bus type.  Moreover, the high-priority one will always take precedence over
+a low-priority one.  The PM domain, bus type, device type and class callbacks
+are referred to as subsystem-level callbacks in what follows.
+
+By default, the callbacks are always invoked in process context with interrupts
+enabled.  However, the pm_runtime_irq_safe() helper function can be used to tell
+the PM core that it is safe to run the ->runtime_suspend(), ->runtime_resume()
+and ->runtime_idle() callbacks for the given device in atomic context with
+interrupts disabled.  This implies that the callback routines in question must
+not block or sleep, but it also means that the synchronous helper functions
+listed at the end of Section 4 may be used for that device within an interrupt
+handler or generally in an atomic context.
+
+The subsystem-level suspend callback, if present, is _entirely_ _responsible_
+for handling the suspend of the device as appropriate, which may, but need not
+include executing the device driver's own ->runtime_suspend() callback (from the
+PM core's point of view it is not necessary to implement a ->runtime_suspend()
+callback in a device driver as long as the subsystem-level suspend callback
+knows what to do to handle the device).
+
+  * Once the subsystem-level suspend callback (or the driver suspend callback,
+    if invoked directly) has completed successfully for the given device, the PM
+    core regards the device as suspended, which need not mean that it has been
+    put into a low power state.  It is supposed to mean, however, that the
+    device will not process data and will not communicate with the CPU(s) and
+    RAM until the appropriate resume callback is executed for it.  The runtime
+    PM status of a device after successful execution of the suspend callback is
+    'suspended'.
+
+  * If the suspend callback returns -EBUSY or -EAGAIN, the device's runtime PM
+    status remains 'active', which means that the device _must_ be fully
+    operational afterwards.
+
+  * If the suspend callback returns an error code different from -EBUSY and
+    -EAGAIN, the PM core regards this as a fatal error and will refuse to run
+    the helper functions described in Section 4 for the device until its status
+    is directly set to  either'active', or 'suspended' (the PM core provides
+    special helper functions for this purpose).
+
+In particular, if the driver requires remote wakeup capability (i.e. hardware
+mechanism allowing the device to request a change of its power state, such as
+PCI PME) for proper functioning and device_run_wake() returns 'false' for the
+device, then ->runtime_suspend() should return -EBUSY.  On the other hand, if
+device_run_wake() returns 'true' for the device and the device is put into a
+low-power state during the execution of the suspend callback, it is expected
+that remote wakeup will be enabled for the device.  Generally, remote wakeup
+should be enabled for all input devices put into low-power states at run time.
+
+The subsystem-level resume callback, if present, is _entirely_ _responsible_ for
+handling the resume of the device as appropriate, which may, but need not
+include executing the device driver's own ->runtime_resume() callback (from the
+PM core's point of view it is not necessary to implement a ->runtime_resume()
+callback in a device driver as long as the subsystem-level resume callback knows
+what to do to handle the device).
+
+  * Once the subsystem-level resume callback (or the driver resume callback, if
+    invoked directly) has completed successfully, the PM core regards the device
+    as fully operational, which means that the device _must_ be able to complete
+    I/O operations as needed.  The runtime PM status of the device is then
+    'active'.
+
+  * If the resume callback returns an error code, the PM core regards this as a
+    fatal error and will refuse to run the helper functions described in Section
+    4 for the device, until its status is directly set to either 'active', or
+    'suspended' (by means of special helper functions provided by the PM core
+    for this purpose).
+
+The idle callback (a subsystem-level one, if present, or the driver one) is
+executed by the PM core whenever the device appears to be idle, which is
+indicated to the PM core by two counters, the device's usage counter and the
+counter of 'active' children of the device.
+
+  * If any of these counters is decreased using a helper function provided by
+    the PM core and it turns out to be equal to zero, the other counter is
+    checked.  If that counter also is equal to zero, the PM core executes the
+    idle callback with the device as its argument.
+
+The action performed by the idle callback is totally dependent on the subsystem
+(or driver) in question, but the expected and recommended action is to check
+if the device can be suspended (i.e. if all of the conditions necessary for
+suspending the device are satisfied) and to queue up a suspend request for the
+device in that case.  If there is no idle callback, or if the callback returns
+0, then the PM core will attempt to carry out a runtime suspend of the device,
+also respecting devices configured for autosuspend.  In essence this means a
+call to pm_runtime_autosuspend() (do note that drivers needs to update the
+device last busy mark, pm_runtime_mark_last_busy(), to control the delay under
+this circumstance).  To prevent this (for example, if the callback routine has
+started a delayed suspend), the routine must return a non-zero value.  Negative
+error return codes are ignored by the PM core.
+
+The helper functions provided by the PM core, described in Section 4, guarantee
+that the following constraints are met with respect to runtime PM callbacks for
+one device:
+
+(1) The callbacks are mutually exclusive (e.g. it is forbidden to execute
+    ->runtime_suspend() in parallel with ->runtime_resume() or with another
+    instance of ->runtime_suspend() for the same device) with the exception that
+    ->runtime_suspend() or ->runtime_resume() can be executed in parallel with
+    ->runtime_idle() (although ->runtime_idle() will not be started while any
+    of the other callbacks is being executed for the same device).
+
+(2) ->runtime_idle() and ->runtime_suspend() can only be executed for 'active'
+    devices (i.e. the PM core will only execute ->runtime_idle() or
+    ->runtime_suspend() for the devices the runtime PM status of which is
+    'active').
+
+(3) ->runtime_idle() and ->runtime_suspend() can only be executed for a device
+    the usage counter of which is equal to zero _and_ either the counter of
+    'active' children of which is equal to zero, or the 'power.ignore_children'
+    flag of which is set.
+
+(4) ->runtime_resume() can only be executed for 'suspended' devices  (i.e. the
+    PM core will only execute ->runtime_resume() for the devices the runtime
+    PM status of which is 'suspended').
+
+Additionally, the helper functions provided by the PM core obey the following
+rules:
+
+  * If ->runtime_suspend() is about to be executed or there's a pending request
+    to execute it, ->runtime_idle() will not be executed for the same device.
+
+  * A request to execute or to schedule the execution of ->runtime_suspend()
+    will cancel any pending requests to execute ->runtime_idle() for the same
+    device.
+
+  * If ->runtime_resume() is about to be executed or there's a pending request
+    to execute it, the other callbacks will not be executed for the same device.
+
+  * A request to execute ->runtime_resume() will cancel any pending or
+    scheduled requests to execute the other callbacks for the same device,
+    except for scheduled autosuspends.
+
+3. Runtime PM Device Fields
+
+The following device runtime PM fields are present in 'struct dev_pm_info', as
+defined in include/linux/pm.h:
+
+  struct timer_list suspend_timer;
+    - timer used for scheduling (delayed) suspend and autosuspend requests
+
+  unsigned long timer_expires;
+    - timer expiration time, in jiffies (if this is different from zero, the
+      timer is running and will expire at that time, otherwise the timer is not
+      running)
+
+  struct work_struct work;
+    - work structure used for queuing up requests (i.e. work items in pm_wq)
+
+  wait_queue_head_t wait_queue;
+    - wait queue used if any of the helper functions needs to wait for another
+      one to complete
+
+  spinlock_t lock;
+    - lock used for synchronisation
+
+  atomic_t usage_count;
+    - the usage counter of the device
+
+  atomic_t child_count;
+    - the count of 'active' children of the device
+
+  unsigned int ignore_children;
+    - if set, the value of child_count is ignored (but still updated)
+
+  unsigned int disable_depth;
+    - used for disabling the helper funcions (they work normally if this is
+      equal to zero); the initial value of it is 1 (i.e. runtime PM is
+      initially disabled for all devices)
+
+  int runtime_error;
+    - if set, there was a fatal error (one of the callbacks returned error code
+      as described in Section 2), so the helper funtions will not work until
+      this flag is cleared; this is the error code returned by the failing
+      callback
+
+  unsigned int idle_notification;
+    - if set, ->runtime_idle() is being executed
+
+  unsigned int request_pending;
+    - if set, there's a pending request (i.e. a work item queued up into pm_wq)
+
+  enum rpm_request request;
+    - type of request that's pending (valid if request_pending is set)
+
+  unsigned int deferred_resume;
+    - set if ->runtime_resume() is about to be run while ->runtime_suspend() is
+      being executed for that device and it is not practical to wait for the
+      suspend to complete; means "start a resume as soon as you've suspended"
+
+  unsigned int run_wake;
+    - set if the device is capable of generating runtime wake-up events
+
+  enum rpm_status runtime_status;
+    - the runtime PM status of the device; this field's initial value is
+      RPM_SUSPENDED, which means that each device is initially regarded by the
+      PM core as 'suspended', regardless of its real hardware status
+
+  unsigned int runtime_auto;
+    - if set, indicates that the user space has allowed the device driver to
+      power manage the device at run time via the /sys/devices/.../power/control
+      interface; it may only be modified with the help of the pm_runtime_allow()
+      and pm_runtime_forbid() helper functions
+
+  unsigned int no_callbacks;
+    - indicates that the device does not use the runtime PM callbacks (see
+      Section 8); it may be modified only by the pm_runtime_no_callbacks()
+      helper function
+
+  unsigned int irq_safe;
+    - indicates that the ->runtime_suspend() and ->runtime_resume() callbacks
+      will be invoked with the spinlock held and interrupts disabled
+
+  unsigned int use_autosuspend;
+    - indicates that the device's driver supports delayed autosuspend (see
+      Section 9); it may be modified only by the
+      pm_runtime{_dont}_use_autosuspend() helper functions
+
+  unsigned int timer_autosuspends;
+    - indicates that the PM core should attempt to carry out an autosuspend
+      when the timer expires rather than a normal suspend
+
+  int autosuspend_delay;
+    - the delay time (in milliseconds) to be used for autosuspend
+
+  unsigned long last_busy;
+    - the time (in jiffies) when the pm_runtime_mark_last_busy() helper
+      function was last called for this device; used in calculating inactivity
+      periods for autosuspend
+
+All of the above fields are members of the 'power' member of 'struct device'.
+
+4. Runtime PM Device Helper Functions
+
+The following runtime PM helper functions are defined in
+drivers/base/power/runtime.c and include/linux/pm_runtime.h:
+
+  void pm_runtime_init(struct device *dev);
+    - initialize the device runtime PM fields in 'struct dev_pm_info'
+
+  void pm_runtime_remove(struct device *dev);
+    - make sure that the runtime PM of the device will be disabled after
+      removing the device from device hierarchy
+
+  int pm_runtime_idle(struct device *dev);
+    - execute the subsystem-level idle callback for the device; returns an
+      error code on failure, where -EINPROGRESS means that ->runtime_idle() is
+      already being executed; if there is no callback or the callback returns 0
+      then run pm_runtime_autosuspend(dev) and return its result
+
+  int pm_runtime_suspend(struct device *dev);
+    - execute the subsystem-level suspend callback for the device; returns 0 on
+      success, 1 if the device's runtime PM status was already 'suspended', or
+      error code on failure, where -EAGAIN or -EBUSY means it is safe to attempt
+      to suspend the device again in future and -EACCES means that
+      'power.disable_depth' is different from 0
+
+  int pm_runtime_autosuspend(struct device *dev);
+    - same as pm_runtime_suspend() except that the autosuspend delay is taken
+      into account; if pm_runtime_autosuspend_expiration() says the delay has
+      not yet expired then an autosuspend is scheduled for the appropriate time
+      and 0 is returned
+
+  int pm_runtime_resume(struct device *dev);
+    - execute the subsystem-level resume callback for the device; returns 0 on
+      success, 1 if the device's runtime PM status was already 'active' or
+      error code on failure, where -EAGAIN means it may be safe to attempt to
+      resume the device again in future, but 'power.runtime_error' should be
+      checked additionally, and -EACCES means that 'power.disable_depth' is
+      different from 0
+
+  int pm_request_idle(struct device *dev);
+    - submit a request to execute the subsystem-level idle callback for the
+      device (the request is represented by a work item in pm_wq); returns 0 on
+      success or error code if the request has not been queued up
+
+  int pm_request_autosuspend(struct device *dev);
+    - schedule the execution of the subsystem-level suspend callback for the
+      device when the autosuspend delay has expired; if the delay has already
+      expired then the work item is queued up immediately
+
+  int pm_schedule_suspend(struct device *dev, unsigned int delay);
+    - schedule the execution of the subsystem-level suspend callback for the
+      device in future, where 'delay' is the time to wait before queuing up a
+      suspend work item in pm_wq, in milliseconds (if 'delay' is zero, the work
+      item is queued up immediately); returns 0 on success, 1 if the device's PM
+      runtime status was already 'suspended', or error code if the request
+      hasn't been scheduled (or queued up if 'delay' is 0); if the execution of
+      ->runtime_suspend() is already scheduled and not yet expired, the new
+      value of 'delay' will be used as the time to wait
+
+  int pm_request_resume(struct device *dev);
+    - submit a request to execute the subsystem-level resume callback for the
+      device (the request is represented by a work item in pm_wq); returns 0 on
+      success, 1 if the device's runtime PM status was already 'active', or
+      error code if the request hasn't been queued up
+
+  void pm_runtime_get_noresume(struct device *dev);
+    - increment the device's usage counter
+
+  int pm_runtime_get(struct device *dev);
+    - increment the device's usage counter, run pm_request_resume(dev) and
+      return its result
+
+  int pm_runtime_get_sync(struct device *dev);
+    - increment the device's usage counter, run pm_runtime_resume(dev) and
+      return its result
+
+  void pm_runtime_put_noidle(struct device *dev);
+    - decrement the device's usage counter
+
+  int pm_runtime_put(struct device *dev);
+    - decrement the device's usage counter; if the result is 0 then run
+      pm_request_idle(dev) and return its result
+
+  int pm_runtime_put_autosuspend(struct device *dev);
+    - decrement the device's usage counter; if the result is 0 then run
+      pm_request_autosuspend(dev) and return its result
+
+  int pm_runtime_put_sync(struct device *dev);
+    - decrement the device's usage counter; if the result is 0 then run
+      pm_runtime_idle(dev) and return its result
+
+  int pm_runtime_put_sync_suspend(struct device *dev);
+    - decrement the device's usage counter; if the result is 0 then run
+      pm_runtime_suspend(dev) and return its result
+
+  int pm_runtime_put_sync_autosuspend(struct device *dev);
+    - decrement the device's usage counter; if the result is 0 then run
+      pm_runtime_autosuspend(dev) and return its result
+
+  void pm_runtime_enable(struct device *dev);
+    - decrement the device's 'power.disable_depth' field; if that field is equal
+      to zero, the runtime PM helper functions can execute subsystem-level
+      callbacks described in Section 2 for the device
+
+  int pm_runtime_disable(struct device *dev);
+    - increment the device's 'power.disable_depth' field (if the value of that
+      field was previously zero, this prevents subsystem-level runtime PM
+      callbacks from being run for the device), make sure that all of the
+      pending runtime PM operations on the device are either completed or
+      canceled; returns 1 if there was a resume request pending and it was
+      necessary to execute the subsystem-level resume callback for the device
+      to satisfy that request, otherwise 0 is returned
+
+  int pm_runtime_barrier(struct device *dev);
+    - check if there's a resume request pending for the device and resume it
+      (synchronously) in that case, cancel any other pending runtime PM requests
+      regarding it and wait for all runtime PM operations on it in progress to
+      complete; returns 1 if there was a resume request pending and it was
+      necessary to execute the subsystem-level resume callback for the device to
+      satisfy that request, otherwise 0 is returned
+
+  void pm_suspend_ignore_children(struct device *dev, bool enable);
+    - set/unset the power.ignore_children flag of the device
+
+  int pm_runtime_set_active(struct device *dev);
+    - clear the device's 'power.runtime_error' flag, set the device's runtime
+      PM status to 'active' and update its parent's counter of 'active'
+      children as appropriate (it is only valid to use this function if
+      'power.runtime_error' is set or 'power.disable_depth' is greater than
+      zero); it will fail and return error code if the device has a parent
+      which is not active and the 'power.ignore_children' flag of which is unset
+
+  void pm_runtime_set_suspended(struct device *dev);
+    - clear the device's 'power.runtime_error' flag, set the device's runtime
+      PM status to 'suspended' and update its parent's counter of 'active'
+      children as appropriate (it is only valid to use this function if
+      'power.runtime_error' is set or 'power.disable_depth' is greater than
+      zero)
+
+  bool pm_runtime_active(struct device *dev);
+    - return true if the device's runtime PM status is 'active' or its
+      'power.disable_depth' field is not equal to zero, or false otherwise
+
+  bool pm_runtime_suspended(struct device *dev);
+    - return true if the device's runtime PM status is 'suspended' and its
+      'power.disable_depth' field is equal to zero, or false otherwise
+
+  bool pm_runtime_status_suspended(struct device *dev);
+    - return true if the device's runtime PM status is 'suspended'
+
+  bool pm_runtime_suspended_if_enabled(struct device *dev);
+    - return true if the device's runtime PM status is 'suspended' and its
+      'power.disable_depth' field is equal to 1
+
+  void pm_runtime_allow(struct device *dev);
+    - set the power.runtime_auto flag for the device and decrease its usage
+      counter (used by the /sys/devices/.../power/control interface to
+      effectively allow the device to be power managed at run time)
+
+  void pm_runtime_forbid(struct device *dev);
+    - unset the power.runtime_auto flag for the device and increase its usage
+      counter (used by the /sys/devices/.../power/control interface to
+      effectively prevent the device from being power managed at run time)
+
+  void pm_runtime_no_callbacks(struct device *dev);
+    - set the power.no_callbacks flag for the device and remove the runtime
+      PM attributes from /sys/devices/.../power (or prevent them from being
+      added when the device is registered)
+
+  void pm_runtime_irq_safe(struct device *dev);
+    - set the power.irq_safe flag for the device, causing the runtime-PM
+      callbacks to be invoked with interrupts off
+
+  void pm_runtime_mark_last_busy(struct device *dev);
+    - set the power.last_busy field to the current time
+
+  void pm_runtime_use_autosuspend(struct device *dev);
+    - set the power.use_autosuspend flag, enabling autosuspend delays
+
+  void pm_runtime_dont_use_autosuspend(struct device *dev);
+    - clear the power.use_autosuspend flag, disabling autosuspend delays
+
+  void pm_runtime_set_autosuspend_delay(struct device *dev, int delay);
+    - set the power.autosuspend_delay value to 'delay' (expressed in
+      milliseconds); if 'delay' is negative then runtime suspends are
+      prevented
+
+  unsigned long pm_runtime_autosuspend_expiration(struct device *dev);
+    - calculate the time when the current autosuspend delay period will expire,
+      based on power.last_busy and power.autosuspend_delay; if the delay time
+      is 1000 ms or larger then the expiration time is rounded up to the
+      nearest second; returns 0 if the delay period has already expired or
+      power.use_autosuspend isn't set, otherwise returns the expiration time
+      in jiffies
+
+It is safe to execute the following helper functions from interrupt context:
+
+pm_request_idle()
+pm_request_autosuspend()
+pm_schedule_suspend()
+pm_request_resume()
+pm_runtime_get_noresume()
+pm_runtime_get()
+pm_runtime_put_noidle()
+pm_runtime_put()
+pm_runtime_put_autosuspend()
+pm_runtime_enable()
+pm_suspend_ignore_children()
+pm_runtime_set_active()
+pm_runtime_set_suspended()
+pm_runtime_suspended()
+pm_runtime_mark_last_busy()
+pm_runtime_autosuspend_expiration()
+
+If pm_runtime_irq_safe() has been called for a device then the following helper
+functions may also be used in interrupt context:
+
+pm_runtime_idle()
+pm_runtime_suspend()
+pm_runtime_autosuspend()
+pm_runtime_resume()
+pm_runtime_get_sync()
+pm_runtime_put_sync()
+pm_runtime_put_sync_suspend()
+pm_runtime_put_sync_autosuspend()
+
+5. Runtime PM Initialization, Device Probing and Removal
+
+Initially, the runtime PM is disabled for all devices, which means that the
+majority of the runtime PM helper funtions described in Section 4 will return
+-EAGAIN until pm_runtime_enable() is called for the device.
+
+In addition to that, the initial runtime PM status of all devices is
+'suspended', but it need not reflect the actual physical state of the device.
+Thus, if the device is initially active (i.e. it is able to process I/O), its
+runtime PM status must be changed to 'active', with the help of
+pm_runtime_set_active(), before pm_runtime_enable() is called for the device.
+
+However, if the device has a parent and the parent's runtime PM is enabled,
+calling pm_runtime_set_active() for the device will affect the parent, unless
+the parent's 'power.ignore_children' flag is set.  Namely, in that case the
+parent won't be able to suspend at run time, using the PM core's helper
+functions, as long as the child's status is 'active', even if the child's
+runtime PM is still disabled (i.e. pm_runtime_enable() hasn't been called for
+the child yet or pm_runtime_disable() has been called for it).  For this reason,
+once pm_runtime_set_active() has been called for the device, pm_runtime_enable()
+should be called for it too as soon as reasonably possible or its runtime PM
+status should be changed back to 'suspended' with the help of
+pm_runtime_set_suspended().
+
+If the default initial runtime PM status of the device (i.e. 'suspended')
+reflects the actual state of the device, its bus type's or its driver's
+->probe() callback will likely need to wake it up using one of the PM core's
+helper functions described in Section 4.  In that case, pm_runtime_resume()
+should be used.  Of course, for this purpose the device's runtime PM has to be
+enabled earlier by calling pm_runtime_enable().
+
+It may be desirable to suspend the device once ->probe() has finished.
+Therefore the driver core uses the asyncronous pm_request_idle() to submit a
+request to execute the subsystem-level idle callback for the device at that
+time.  A driver that makes use of the runtime autosuspend feature, may want to
+update the last busy mark before returning from ->probe().
+
+Moreover, the driver core prevents runtime PM callbacks from racing with the bus
+notifier callback in __device_release_driver(), which is necessary, because the
+notifier is used by some subsystems to carry out operations affecting the
+runtime PM functionality.  It does so by calling pm_runtime_get_sync() before
+driver_sysfs_remove() and the BUS_NOTIFY_UNBIND_DRIVER notifications.  This
+resumes the device if it's in the suspended state and prevents it from
+being suspended again while those routines are being executed.
+
+To allow bus types and drivers to put devices into the suspended state by
+calling pm_runtime_suspend() from their ->remove() routines, the driver core
+executes pm_runtime_put_sync() after running the BUS_NOTIFY_UNBIND_DRIVER
+notifications in __device_release_driver().  This requires bus types and
+drivers to make their ->remove() callbacks avoid races with runtime PM directly,
+but also it allows of more flexibility in the handling of devices during the
+removal of their drivers.
+
+The user space can effectively disallow the driver of the device to power manage
+it at run time by changing the value of its /sys/devices/.../power/control
+attribute to "on", which causes pm_runtime_forbid() to be called.  In principle,
+this mechanism may also be used by the driver to effectively turn off the
+runtime power management of the device until the user space turns it on.
+Namely, during the initialization the driver can make sure that the runtime PM
+status of the device is 'active' and call pm_runtime_forbid().  It should be
+noted, however, that if the user space has already intentionally changed the
+value of /sys/devices/.../power/control to "auto" to allow the driver to power
+manage the device at run time, the driver may confuse it by using
+pm_runtime_forbid() this way.
+
+6. Runtime PM and System Sleep
+
+Runtime PM and system sleep (i.e., system suspend and hibernation, also known
+as suspend-to-RAM and suspend-to-disk) interact with each other in a couple of
+ways.  If a device is active when a system sleep starts, everything is
+straightforward.  But what should happen if the device is already suspended?
+
+The device may have different wake-up settings for runtime PM and system sleep.
+For example, remote wake-up may be enabled for runtime suspend but disallowed
+for system sleep (device_may_wakeup(dev) returns 'false').  When this happens,
+the subsystem-level system suspend callback is responsible for changing the
+device's wake-up setting (it may leave that to the device driver's system
+suspend routine).  It may be necessary to resume the device and suspend it again
+in order to do so.  The same is true if the driver uses different power levels
+or other settings for runtime suspend and system sleep.
+
+During system resume, the simplest approach is to bring all devices back to full
+power, even if they had been suspended before the system suspend began.  There
+are several reasons for this, including:
+
+  * The device might need to switch power levels, wake-up settings, etc.
+
+  * Remote wake-up events might have been lost by the firmware.
+
+  * The device's children may need the device to be at full power in order
+    to resume themselves.
+
+  * The driver's idea of the device state may not agree with the device's
+    physical state.  This can happen during resume from hibernation.
+
+  * The device might need to be reset.
+
+  * Even though the device was suspended, if its usage counter was > 0 then most
+    likely it would need a runtime resume in the near future anyway.
+
+If the device had been suspended before the system suspend began and it's
+brought back to full power during resume, then its runtime PM status will have
+to be updated to reflect the actual post-system sleep status.  The way to do
+this is:
+
+	pm_runtime_disable(dev);
+	pm_runtime_set_active(dev);
+	pm_runtime_enable(dev);
+
+The PM core always increments the runtime usage counter before calling the
+->suspend() callback and decrements it after calling the ->resume() callback.
+Hence disabling runtime PM temporarily like this will not cause any runtime
+suspend attempts to be permanently lost.  If the usage count goes to zero
+following the return of the ->resume() callback, the ->runtime_idle() callback
+will be invoked as usual.
+
+On some systems, however, system sleep is not entered through a global firmware
+or hardware operation.  Instead, all hardware components are put into low-power
+states directly by the kernel in a coordinated way.  Then, the system sleep
+state effectively follows from the states the hardware components end up in
+and the system is woken up from that state by a hardware interrupt or a similar
+mechanism entirely under the kernel's control.  As a result, the kernel never
+gives control away and the states of all devices during resume are precisely
+known to it.  If that is the case and none of the situations listed above takes
+place (in particular, if the system is not waking up from hibernation), it may
+be more efficient to leave the devices that had been suspended before the system
+suspend began in the suspended state.
+
+To this end, the PM core provides a mechanism allowing some coordination between
+different levels of device hierarchy.  Namely, if a system suspend .prepare()
+callback returns a positive number for a device, that indicates to the PM core
+that the device appears to be runtime-suspended and its state is fine, so it
+may be left in runtime suspend provided that all of its descendants are also
+left in runtime suspend.  If that happens, the PM core will not execute any
+system suspend and resume callbacks for all of those devices, except for the
+complete callback, which is then entirely responsible for handling the device
+as appropriate.  This only applies to system suspend transitions that are not
+related to hibernation (see Documentation/power/devices.txt for more
+information).
+
+The PM core does its best to reduce the probability of race conditions between
+the runtime PM and system suspend/resume (and hibernation) callbacks by carrying
+out the following operations:
+
+  * During system suspend pm_runtime_get_noresume() is called for every device
+    right before executing the subsystem-level .prepare() callback for it and
+    pm_runtime_barrier() is called for every device right before executing the
+    subsystem-level .suspend() callback for it.  In addition to that the PM core
+    calls  __pm_runtime_disable() with 'false' as the second argument for every
+    device right before executing the subsystem-level .suspend_late() callback
+    for it.
+
+  * During system resume pm_runtime_enable() and pm_runtime_put() are called for
+    every device right after executing the subsystem-level .resume_early()
+    callback and right after executing the subsystem-level .complete() callback
+    for it, respectively.
+
+7. Generic subsystem callbacks
+
+Subsystems may wish to conserve code space by using the set of generic power
+management callbacks provided by the PM core, defined in
+driver/base/power/generic_ops.c:
+
+  int pm_generic_runtime_suspend(struct device *dev);
+    - invoke the ->runtime_suspend() callback provided by the driver of this
+      device and return its result, or return 0 if not defined
+
+  int pm_generic_runtime_resume(struct device *dev);
+    - invoke the ->runtime_resume() callback provided by the driver of this
+      device and return its result, or return 0 if not defined
+
+  int pm_generic_suspend(struct device *dev);
+    - if the device has not been suspended at run time, invoke the ->suspend()
+      callback provided by its driver and return its result, or return 0 if not
+      defined
+
+  int pm_generic_suspend_noirq(struct device *dev);
+    - if pm_runtime_suspended(dev) returns "false", invoke the ->suspend_noirq()
+      callback provided by the device's driver and return its result, or return
+      0 if not defined
+
+  int pm_generic_resume(struct device *dev);
+    - invoke the ->resume() callback provided by the driver of this device and,
+      if successful, change the device's runtime PM status to 'active'
+
+  int pm_generic_resume_noirq(struct device *dev);
+    - invoke the ->resume_noirq() callback provided by the driver of this device
+
+  int pm_generic_freeze(struct device *dev);
+    - if the device has not been suspended at run time, invoke the ->freeze()
+      callback provided by its driver and return its result, or return 0 if not
+      defined
+
+  int pm_generic_freeze_noirq(struct device *dev);
+    - if pm_runtime_suspended(dev) returns "false", invoke the ->freeze_noirq()
+      callback provided by the device's driver and return its result, or return
+      0 if not defined
+
+  int pm_generic_thaw(struct device *dev);
+    - if the device has not been suspended at run time, invoke the ->thaw()
+      callback provided by its driver and return its result, or return 0 if not
+      defined
+
+  int pm_generic_thaw_noirq(struct device *dev);
+    - if pm_runtime_suspended(dev) returns "false", invoke the ->thaw_noirq()
+      callback provided by the device's driver and return its result, or return
+      0 if not defined
+
+  int pm_generic_poweroff(struct device *dev);
+    - if the device has not been suspended at run time, invoke the ->poweroff()
+      callback provided by its driver and return its result, or return 0 if not
+      defined
+
+  int pm_generic_poweroff_noirq(struct device *dev);
+    - if pm_runtime_suspended(dev) returns "false", run the ->poweroff_noirq()
+      callback provided by the device's driver and return its result, or return
+      0 if not defined
+
+  int pm_generic_restore(struct device *dev);
+    - invoke the ->restore() callback provided by the driver of this device and,
+      if successful, change the device's runtime PM status to 'active'
+
+  int pm_generic_restore_noirq(struct device *dev);
+    - invoke the ->restore_noirq() callback provided by the device's driver
+
+These functions are the defaults used by the PM core, if a subsystem doesn't
+provide its own callbacks for ->runtime_idle(), ->runtime_suspend(),
+->runtime_resume(), ->suspend(), ->suspend_noirq(), ->resume(),
+->resume_noirq(), ->freeze(), ->freeze_noirq(), ->thaw(), ->thaw_noirq(),
+->poweroff(), ->poweroff_noirq(), ->restore(), ->restore_noirq() in the
+subsystem-level dev_pm_ops structure.
+
+Device drivers that wish to use the same function as a system suspend, freeze,
+poweroff and runtime suspend callback, and similarly for system resume, thaw,
+restore, and runtime resume, can achieve this with the help of the
+UNIVERSAL_DEV_PM_OPS macro defined in include/linux/pm.h (possibly setting its
+last argument to NULL).
+
+8. "No-Callback" Devices
+
+Some "devices" are only logical sub-devices of their parent and cannot be
+power-managed on their own.  (The prototype example is a USB interface.  Entire
+USB devices can go into low-power mode or send wake-up requests, but neither is
+possible for individual interfaces.)  The drivers for these devices have no
+need of runtime PM callbacks; if the callbacks did exist, ->runtime_suspend()
+and ->runtime_resume() would always return 0 without doing anything else and
+->runtime_idle() would always call pm_runtime_suspend().
+
+Subsystems can tell the PM core about these devices by calling
+pm_runtime_no_callbacks().  This should be done after the device structure is
+initialized and before it is registered (although after device registration is
+also okay).  The routine will set the device's power.no_callbacks flag and
+prevent the non-debugging runtime PM sysfs attributes from being created.
+
+When power.no_callbacks is set, the PM core will not invoke the
+->runtime_idle(), ->runtime_suspend(), or ->runtime_resume() callbacks.
+Instead it will assume that suspends and resumes always succeed and that idle
+devices should be suspended.
+
+As a consequence, the PM core will never directly inform the device's subsystem
+or driver about runtime power changes.  Instead, the driver for the device's
+parent must take responsibility for telling the device's driver when the
+parent's power state changes.
+
+9. Autosuspend, or automatically-delayed suspends
+
+Changing a device's power state isn't free; it requires both time and energy.
+A device should be put in a low-power state only when there's some reason to
+think it will remain in that state for a substantial time.  A common heuristic
+says that a device which hasn't been used for a while is liable to remain
+unused; following this advice, drivers should not allow devices to be suspended
+at runtime until they have been inactive for some minimum period.  Even when
+the heuristic ends up being non-optimal, it will still prevent devices from
+"bouncing" too rapidly between low-power and full-power states.
+
+The term "autosuspend" is an historical remnant.  It doesn't mean that the
+device is automatically suspended (the subsystem or driver still has to call
+the appropriate PM routines); rather it means that runtime suspends will
+automatically be delayed until the desired period of inactivity has elapsed.
+
+Inactivity is determined based on the power.last_busy field.  Drivers should
+call pm_runtime_mark_last_busy() to update this field after carrying out I/O,
+typically just before calling pm_runtime_put_autosuspend().  The desired length
+of the inactivity period is a matter of policy.  Subsystems can set this length
+initially by calling pm_runtime_set_autosuspend_delay(), but after device
+registration the length should be controlled by user space, using the
+/sys/devices/.../power/autosuspend_delay_ms attribute.
+
+In order to use autosuspend, subsystems or drivers must call
+pm_runtime_use_autosuspend() (preferably before registering the device), and
+thereafter they should use the various *_autosuspend() helper functions instead
+of the non-autosuspend counterparts:
+
+	Instead of: pm_runtime_suspend    use: pm_runtime_autosuspend;
+	Instead of: pm_schedule_suspend   use: pm_request_autosuspend;
+	Instead of: pm_runtime_put        use: pm_runtime_put_autosuspend;
+	Instead of: pm_runtime_put_sync   use: pm_runtime_put_sync_autosuspend.
+
+Drivers may also continue to use the non-autosuspend helper functions; they
+will behave normally, not taking the autosuspend delay into account.
+Similarly, if the power.use_autosuspend field isn't set then the autosuspend
+helper functions will behave just like the non-autosuspend counterparts.
+
+Under some circumstances a driver or subsystem may want to prevent a device
+from autosuspending immediately, even though the usage counter is zero and the
+autosuspend delay time has expired.  If the ->runtime_suspend() callback
+returns -EAGAIN or -EBUSY, and if the next autosuspend delay expiration time is
+in the future (as it normally would be if the callback invoked
+pm_runtime_mark_last_busy()), the PM core will automatically reschedule the
+autosuspend.  The ->runtime_suspend() callback can't do this rescheduling
+itself because no suspend requests of any kind are accepted while the device is
+suspending (i.e., while the callback is running).
+
+The implementation is well suited for asynchronous use in interrupt contexts.
+However such use inevitably involves races, because the PM core can't
+synchronize ->runtime_suspend() callbacks with the arrival of I/O requests.
+This synchronization must be handled by the driver, using its private lock.
+Here is a schematic pseudo-code example:
+
+	foo_read_or_write(struct foo_priv *foo, void *data)
+	{
+		lock(&foo->private_lock);
+		add_request_to_io_queue(foo, data);
+		if (foo->num_pending_requests++ == 0)
+			pm_runtime_get(&foo->dev);
+		if (!foo->is_suspended)
+			foo_process_next_request(foo);
+		unlock(&foo->private_lock);
+	}
+
+	foo_io_completion(struct foo_priv *foo, void *req)
+	{
+		lock(&foo->private_lock);
+		if (--foo->num_pending_requests == 0) {
+			pm_runtime_mark_last_busy(&foo->dev);
+			pm_runtime_put_autosuspend(&foo->dev);
+		} else {
+			foo_process_next_request(foo);
+		}
+		unlock(&foo->private_lock);
+		/* Send req result back to the user ... */
+	}
+
+	int foo_runtime_suspend(struct device *dev)
+	{
+		struct foo_priv foo = container_of(dev, ...);
+		int ret = 0;
+
+		lock(&foo->private_lock);
+		if (foo->num_pending_requests > 0) {
+			ret = -EBUSY;
+		} else {
+			/* ... suspend the device ... */
+			foo->is_suspended = 1;
+		}
+		unlock(&foo->private_lock);
+		return ret;
+	}
+
+	int foo_runtime_resume(struct device *dev)
+	{
+		struct foo_priv foo = container_of(dev, ...);
+
+		lock(&foo->private_lock);
+		/* ... resume the device ... */
+		foo->is_suspended = 0;
+		pm_runtime_mark_last_busy(&foo->dev);
+		if (foo->num_pending_requests > 0)
+			foo_process_next_request(foo);
+		unlock(&foo->private_lock);
+		return 0;
+	}
+
+The important point is that after foo_io_completion() asks for an autosuspend,
+the foo_runtime_suspend() callback may race with foo_read_or_write().
+Therefore foo_runtime_suspend() has to check whether there are any pending I/O
+requests (while holding the private lock) before allowing the suspend to
+proceed.
+
+In addition, the power.autosuspend_delay field can be changed by user space at
+any time.  If a driver cares about this, it can call
+pm_runtime_autosuspend_expiration() from within the ->runtime_suspend()
+callback while holding its private lock.  If the function returns a nonzero
+value then the delay has not yet expired and the callback should return
+-EAGAIN.
diff --git a/Documentation/power/s2ram.txt b/Documentation/power/s2ram.txt
index b05f512130e..1bdfa044377 100644
--- a/Documentation/power/s2ram.txt
+++ b/Documentation/power/s2ram.txt
@@ -49,8 +49,33 @@ machine that doesn't boot) is:
    device (lspci and /sys/devices/pci* is your friend), and see if you can
    fix it, disable it, or trace into its resume function.
 
+   If no device matches the hash (or any matches appear to be false positives),
+   the culprit may be a device from a loadable kernel module that is not loaded
+   until after the hash is checked. You can check the hash against the current
+   devices again after more modules are loaded using sysfs:
+
+	cat /sys/power/pm_trace_dev_match
+
 For example, the above happens to be the VGA device on my EVO, which I
 used to run with "radeonfb" (it's an ATI Radeon mobility). It turns out
 that "radeonfb" simply cannot resume that device - it tries to set the
 PLL's, and it just _hangs_. Using the regular VGA console and letting X
 resume it instead works fine.
+
+NOTE
+====
+pm_trace uses the system's Real Time Clock (RTC) to save the magic number.
+Reason for this is that the RTC is the only reliably available piece of
+hardware during resume operations where a value can be set that will
+survive a reboot.
+
+Consequence is that after a resume (even if it is successful) your system
+clock will have a value corresponding to the magic number instead of the
+correct date/time! It is therefore advisable to use a program like ntp-date
+or rdate to reset the correct date/time from an external time source when
+using this trace option.
+
+As the clock keeps ticking it is also essential that the reboot is done
+quickly after the resume failure. The trace option does not use the seconds
+or the low order bits of the minutes of the RTC, but a too long delay will
+corrupt the magic value.
diff --git a/Documentation/power/states.txt b/Documentation/power/states.txt
index 34800cc521b..50f3ef9177c 100644
--- a/Documentation/power/states.txt
+++ b/Documentation/power/states.txt
@@ -1,54 +1,87 @@
+System Power Management Sleep States
 
-System Power Management States
+(C) 2014 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
 
+The kernel supports up to four system sleep states generically, although three
+of them depend on the platform support code to implement the low-level details
+for each state.
 
-The kernel supports three power management states generically, though
-each is dependent on platform support code to implement the low-level
-details for each state. This file describes each state, what they are
-commonly called, what ACPI state they map to, and what string to write
-to /sys/power/state to enter that state
+The states are represented by strings that can be read or written to the
+/sys/power/state file.  Those strings may be "mem", "standby", "freeze" and
+"disk", where the last one always represents hibernation (Suspend-To-Disk) and
+the meaning of the remaining ones depends on the relative_sleep_states command
+line argument.
+
+For relative_sleep_states=1, the strings "mem", "standby" and "freeze" label the
+available non-hibernation sleep states from the deepest to the shallowest,
+respectively.  In that case, "mem" is always present in /sys/power/state,
+because there is at least one non-hibernation sleep state in every system.  If
+the given system supports two non-hibernation sleep states, "standby" is present
+in /sys/power/state in addition to "mem".  If the system supports three
+non-hibernation sleep states, "freeze" will be present in /sys/power/state in
+addition to "mem" and "standby".
+
+For relative_sleep_states=0, which is the default, the following descriptions
+apply.
+
+state:		Suspend-To-Idle
+ACPI state:	S0
+Label:		"freeze"
+
+This state is a generic, pure software, light-weight, system sleep state.
+It allows more energy to be saved relative to runtime idle by freezing user
+space and putting all I/O devices into low-power states (possibly
+lower-power than available at run time), such that the processors can
+spend more time in their idle states.
+
+This state can be used for platforms without Power-On Suspend/Suspend-to-RAM
+support, or it can be used in addition to Suspend-to-RAM (memory sleep)
+to provide reduced resume latency.  It is always supported.
 
 
 State:		Standby / Power-On Suspend
 ACPI State:	S1
-String:		"standby"
+Label:		"standby"
 
-This state offers minimal, though real, power savings, while providing
-a very low-latency transition back to a working system. No operating
-state is lost (the CPU retains power), so the system easily starts up
+This state, if supported, offers moderate, though real, power savings, while
+providing a relatively low-latency transition back to a working system.  No
+operating state is lost (the CPU retains power), so the system easily starts up
 again where it left off. 
 
-We try to put devices in a low-power state equivalent to D1, which
-also offers low power savings, but low resume latency. Not all devices
-support D1, and those that don't are left on. 
-
-A transition from Standby to the On state should take about 1-2
-seconds. 
+In addition to freezing user space and putting all I/O devices into low-power
+states, which is done for Suspend-To-Idle too, nonboot CPUs are taken offline
+and all low-level system functions are suspended during transitions into this
+state.  For this reason, it should allow more energy to be saved relative to
+Suspend-To-Idle, but the resume latency will generally be greater than for that
+state.
 
 
 State:		Suspend-to-RAM
 ACPI State:	S3
-String:		"mem"
+Label:		"mem"
 
-This state offers significant power savings as everything in the
-system is put into a low-power state, except for memory, which is
-placed in self-refresh mode to retain its contents. 
+This state, if supported, offers significant power savings as everything in the
+system is put into a low-power state, except for memory, which should be placed
+into the self-refresh mode to retain its contents.  All of the steps carried out
+when entering Power-On Suspend are also carried out during transitions to STR.
+Additional operations may take place depending on the platform capabilities.  In
+particular, on ACPI systems the kernel passes control to the BIOS (platform
+firmware) as the last step during STR transitions and that usually results in
+powering down some more low-level components that aren't directly controlled by
+the kernel.
 
-System and device state is saved and kept in memory. All devices are
-suspended and put into D3. In many cases, all peripheral buses lose
-power when entering STR, so devices must be able to handle the
-transition back to the On state. 
+System and device state is saved and kept in memory.  All devices are suspended
+and put into low-power states.  In many cases, all peripheral buses lose power
+when entering STR, so devices must be able to handle the transition back to the
+"on" state.
 
-For at least ACPI, STR requires some minimal boot-strapping code to
-resume the system from STR. This may be true on other platforms. 
-
-A transition from Suspend-to-RAM to the On state should take about
-3-5 seconds. 
+For at least ACPI, STR requires some minimal boot-strapping code to resume the
+system from it.  This may be the case on other platforms too.
 
 
 State:		Suspend-to-disk
 ACPI State:	S4
-String:		"disk"
+Label:		"disk"
 
 This state offers the greatest power savings, and can be used even in
 the absence of low-level platform support for power management. This
@@ -62,19 +95,15 @@ setup via another operating system for it to use. Despite the
 inconvenience, this method requires minimal work by the kernel, since
 the firmware will also handle restoring memory contents on resume. 
 
-For suspend-to-disk, a mechanism called swsusp called 'swsusp' (Swap
-Suspend) is used to write memory contents to free swap space.
-swsusp has some restrictive requirements, but should work in most
-cases. Some, albeit outdated, documentation can be found in
-Documentation/power/swsusp.txt. Alternatively, userspace can do most
-of the actual suspend to disk work, see userland-swsusp.txt.
+For suspend-to-disk, a mechanism called 'swsusp' (Swap Suspend) is used
+to write memory contents to free swap space. swsusp has some restrictive
+requirements, but should work in most cases. Some, albeit outdated,
+documentation can be found in Documentation/power/swsusp.txt.
+Alternatively, userspace can do most of the actual suspend to disk work,
+see userland-swsusp.txt.
 
 Once memory state is written to disk, the system may either enter a
 low-power state (like ACPI S4), or it may simply power down. Powering
 down offers greater savings, and allows this mechanism to work on any
 system. However, entering a real low-power state allows the user to
 trigger wake up events (e.g. pressing a key or opening a laptop lid).
-
-A transition from Suspend-to-Disk to the On state should take about 30
-seconds, though it's typically a bit more with the current
-implementation. 
diff --git a/Documentation/power/suspend-and-cpuhotplug.txt b/Documentation/power/suspend-and-cpuhotplug.txt
new file mode 100644
index 00000000000..2850df3bf95
--- /dev/null
+++ b/Documentation/power/suspend-and-cpuhotplug.txt
@@ -0,0 +1,275 @@
+Interaction of Suspend code (S3) with the CPU hotplug infrastructure
+
+     (C) 2011 - 2014 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
+
+
+I. How does the regular CPU hotplug code differ from how the Suspend-to-RAM
+   infrastructure uses it internally? And where do they share common code?
+
+Well, a picture is worth a thousand words... So ASCII art follows :-)
+
+[This depicts the current design in the kernel, and focusses only on the
+interactions involving the freezer and CPU hotplug and also tries to explain
+the locking involved. It outlines the notifications involved as well.
+But please note that here, only the call paths are illustrated, with the aim
+of describing where they take different paths and where they share code.
+What happens when regular CPU hotplug and Suspend-to-RAM race with each other
+is not depicted here.]
+
+On a high level, the suspend-resume cycle goes like this:
+
+|Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw |
+|tasks |    |     cpus      |    |          |    |     cpus     |    |tasks|
+
+
+More details follow:
+
+                                Suspend call path
+                                -----------------
+
+                                  Write 'mem' to
+                                /sys/power/state
+                                    sysfs file
+                                        |
+                                        v
+                               Acquire pm_mutex lock
+                                        |
+                                        v
+                             Send PM_SUSPEND_PREPARE
+                                   notifications
+                                        |
+                                        v
+                                   Freeze tasks
+                                        |
+                                        |
+                                        v
+                              disable_nonboot_cpus()
+                                   /* start */
+                                        |
+                                        v
+                            Acquire cpu_add_remove_lock
+                                        |
+                                        v
+                             Iterate over CURRENTLY
+                                   online CPUs
+                                        |
+                                        |
+                                        |                ----------
+                                        v                          | L
+             ======>               _cpu_down()                     |
+            |              [This takes cpuhotplug.lock             |
+  Common    |               before taking down the CPU             |
+   code     |               and releases it when done]             | O
+            |            While it is at it, notifications          |
+            |            are sent when notable events occur,       |
+             ======>     by running all registered callbacks.      |
+                                        |                          | O
+                                        |                          |
+                                        |                          |
+                                        v                          |
+                            Note down these cpus in                | P
+                                frozen_cpus mask         ----------
+                                        |
+                                        v
+                           Disable regular cpu hotplug
+                        by setting cpu_hotplug_disabled=1
+                                        |
+                                        v
+                            Release cpu_add_remove_lock
+                                        |
+                                        v
+                       /* disable_nonboot_cpus() complete */
+                                        |
+                                        v
+                                   Do suspend
+
+
+
+Resuming back is likewise, with the counterparts being (in the order of
+execution during resume):
+* enable_nonboot_cpus() which involves:
+   |  Acquire cpu_add_remove_lock
+   |  Reset cpu_hotplug_disabled to 0, thereby enabling regular cpu hotplug
+   |  Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop]
+   |  Release cpu_add_remove_lock
+   v
+
+* thaw tasks
+* send PM_POST_SUSPEND notifications
+* Release pm_mutex lock.
+
+
+It is to be noted here that the pm_mutex lock is acquired at the very
+beginning, when we are just starting out to suspend, and then released only
+after the entire cycle is complete (i.e., suspend + resume).
+
+
+
+                          Regular CPU hotplug call path
+                          -----------------------------
+
+                                Write 0 (or 1) to
+                       /sys/devices/system/cpu/cpu*/online
+                                    sysfs file
+                                        |
+                                        |
+                                        v
+                                    cpu_down()
+                                        |
+                                        v
+                           Acquire cpu_add_remove_lock
+                                        |
+                                        v
+                          If cpu_hotplug_disabled is 1
+                                return gracefully
+                                        |
+                                        |
+                                        v
+             ======>                _cpu_down()
+            |              [This takes cpuhotplug.lock
+  Common    |               before taking down the CPU
+   code     |               and releases it when done]
+            |            While it is at it, notifications
+            |           are sent when notable events occur,
+             ======>    by running all registered callbacks.
+                                        |
+                                        |
+                                        v
+                          Release cpu_add_remove_lock
+                               [That's it!, for
+                              regular CPU hotplug]
+
+
+
+So, as can be seen from the two diagrams (the parts marked as "Common code"),
+regular CPU hotplug and the suspend code path converge at the _cpu_down() and
+_cpu_up() functions. They differ in the arguments passed to these functions,
+in that during regular CPU hotplug, 0 is passed for the 'tasks_frozen'
+argument. But during suspend, since the tasks are already frozen by the time
+the non-boot CPUs are offlined or onlined, the _cpu_*() functions are called
+with the 'tasks_frozen' argument set to 1.
+[See below for some known issues regarding this.]
+
+
+Important files and functions/entry points:
+------------------------------------------
+
+kernel/power/process.c : freeze_processes(), thaw_processes()
+kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish()
+kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus()
+
+
+
+II. What are the issues involved in CPU hotplug?
+    -------------------------------------------
+
+There are some interesting situations involving CPU hotplug and microcode
+update on the CPUs, as discussed below:
+
+[Please bear in mind that the kernel requests the microcode images from
+userspace, using the request_firmware() function defined in
+drivers/base/firmware_class.c]
+
+
+a. When all the CPUs are identical:
+
+   This is the most common situation and it is quite straightforward: we want
+   to apply the same microcode revision to each of the CPUs.
+   To give an example of x86, the collect_cpu_info() function defined in
+   arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU
+   and thereby in applying the correct microcode revision to it.
+   But note that the kernel does not maintain a common microcode image for the
+   all CPUs, in order to handle case 'b' described below.
+
+
+b. When some of the CPUs are different than the rest:
+
+   In this case since we probably need to apply different microcode revisions
+   to different CPUs, the kernel maintains a copy of the correct microcode
+   image for each CPU (after appropriate CPU type/model discovery using
+   functions such as collect_cpu_info()).
+
+
+c. When a CPU is physically hot-unplugged and a new (and possibly different
+   type of) CPU is hot-plugged into the system:
+
+   In the current design of the kernel, whenever a CPU is taken offline during
+   a regular CPU hotplug operation, upon receiving the CPU_DEAD notification
+   (which is sent by the CPU hotplug code), the microcode update driver's
+   callback for that event reacts by freeing the kernel's copy of the
+   microcode image for that CPU.
+
+   Hence, when a new CPU is brought online, since the kernel finds that it
+   doesn't have the microcode image, it does the CPU type/model discovery
+   afresh and then requests the userspace for the appropriate microcode image
+   for that CPU, which is subsequently applied.
+
+   For example, in x86, the mc_cpu_callback() function (which is the microcode
+   update driver's callback registered for CPU hotplug events) calls
+   microcode_update_cpu() which would call microcode_init_cpu() in this case,
+   instead of microcode_resume_cpu() when it finds that the kernel doesn't
+   have a valid microcode image. This ensures that the CPU type/model
+   discovery is performed and the right microcode is applied to the CPU after
+   getting it from userspace.
+
+
+d. Handling microcode update during suspend/hibernate:
+
+   Strictly speaking, during a CPU hotplug operation which does not involve
+   physically removing or inserting CPUs, the CPUs are not actually powered
+   off during a CPU offline. They are just put to the lowest C-states possible.
+   Hence, in such a case, it is not really necessary to re-apply microcode
+   when the CPUs are brought back online, since they wouldn't have lost the
+   image during the CPU offline operation.
+
+   This is the usual scenario encountered during a resume after a suspend.
+   However, in the case of hibernation, since all the CPUs are completely
+   powered off, during restore it becomes necessary to apply the microcode
+   images to all the CPUs.
+
+   [Note that we don't expect someone to physically pull out nodes and insert
+   nodes with a different type of CPUs in-between a suspend-resume or a
+   hibernate/restore cycle.]
+
+   In the current design of the kernel however, during a CPU offline operation
+   as part of the suspend/hibernate cycle (the CPU_DEAD_FROZEN notification),
+   the existing copy of microcode image in the kernel is not freed up.
+   And during the CPU online operations (during resume/restore), since the
+   kernel finds that it already has copies of the microcode images for all the
+   CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU
+   type/model and the need for validating whether the microcode revisions are
+   right for the CPUs or not (due to the above assumption that physical CPU
+   hotplug will not be done in-between suspend/resume or hibernate/restore
+   cycles).
+
+
+III. Are there any known problems when regular CPU hotplug and suspend race
+     with each other?
+
+Yes, they are listed below:
+
+1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to
+   the _cpu_down() and _cpu_up() functions is *always* 0.
+   This might not reflect the true current state of the system, since the
+   tasks could have been frozen by an out-of-band event such as a suspend
+   operation in progress. Hence, it will lead to wrong notifications being
+   sent during the cpu online/offline events (eg, CPU_ONLINE notification
+   instead of CPU_ONLINE_FROZEN) which in turn will lead to execution of
+   inappropriate code by the callbacks registered for such CPU hotplug events.
+
+2. If a regular CPU hotplug stress test happens to race with the freezer due
+   to a suspend operation in progress at the same time, then we could hit the
+   situation described below:
+
+    * A regular cpu online operation continues its journey from userspace
+      into the kernel, since the freezing has not yet begun.
+    * Then freezer gets to work and freezes userspace.
+    * If cpu online has not yet completed the microcode update stuff by now,
+      it will now start waiting on the frozen userspace in the
+      TASK_UNINTERRUPTIBLE state, in order to get the microcode image.
+    * Now the freezer continues and tries to freeze the remaining tasks. But
+      due to this wait mentioned above, the freezer won't be able to freeze
+      the cpu online hotplug task and hence freezing of tasks fails.
+
+   As a result of this task freezing failure, the suspend operation gets
+   aborted.
diff --git a/Documentation/power/swsusp.txt b/Documentation/power/swsusp.txt
index 9d60ab717a7..f732a8321e8 100644
--- a/Documentation/power/swsusp.txt
+++ b/Documentation/power/swsusp.txt
@@ -33,6 +33,11 @@ echo shutdown > /sys/power/disk; echo disk > /sys/power/state
 
 echo platform > /sys/power/disk; echo disk > /sys/power/state
 
+. If you would like to write hibernation image to swap and then suspend
+to RAM (provided your platform supports it), you can try
+
+echo suspend > /sys/power/disk; echo disk > /sys/power/state
+
 . If you have SATA disks, you'll need recent kernels with SATA suspend
 support. For suspend and resume to work, make sure your disk drivers
 are built into kernel -- not modules. [There's way to make
@@ -45,10 +50,23 @@ echo N > /sys/power/image_size
 
 before suspend (it is limited to 500 MB by default).
 
+. The resume process checks for the presence of the resume device,
+if found, it then checks the contents for the hibernation image signature.
+If both are found, it resumes the hibernation image.
+
+. The resume process may be triggered in two ways:
+  1) During lateinit:  If resume=/dev/your_swap_partition is specified on
+     the kernel command line, lateinit runs the resume process.  If the
+     resume device has not been probed yet, the resume process fails and
+     bootup continues.
+  2) Manually from an initrd or initramfs:  May be run from
+     the init script by using the /sys/power/resume file.  It is vital
+     that this be done prior to remounting any filesystems (even as
+     read-only) otherwise data may be corrupted.
 
 Article about goals and implementation of Software Suspend for Linux
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Author: Gábor Kuti
+Author: Gábor Kuti
 Last revised: 2003-10-20 by Pavel Machek
 
 Idea and goals to achieve
@@ -66,7 +84,8 @@ swsusp saves the state of the machine into active swaps and then reboots or
 powerdowns.  You must explicitly specify the swap partition to resume from with
 ``resume='' kernel option. If signature is found it loads and restores saved
 state. If the option ``noresume'' is specified as a boot parameter, it skips
-the resuming.
+the resuming.  If the option ``hibernate=nocompress'' is specified as a boot
+parameter, it saves hibernation image without compression.
 
 In the meantime while the system is suspended you should not add/remove any
 of the hardware, write to the filesystems, etc.
@@ -191,7 +210,7 @@ Q: There don't seem to be any generally useful behavioral
 distinctions between SUSPEND and FREEZE.
 
 A: Doing SUSPEND when you are asked to do FREEZE is always correct,
-but it may be unneccessarily slow. If you want your driver to stay simple,
+but it may be unnecessarily slow. If you want your driver to stay simple,
 slowness may not matter to you. It can always be fixed later.
 
 For devices like disk it does matter, you do not want to spindown for
@@ -201,7 +220,10 @@ Q: After resuming, system is paging heavily, leading to very bad interactivity.
 
 A: Try running
 
-cat `cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u` > /dev/null
+cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u | while read file
+do
+  test -f "$file" && cat "$file" > /dev/null
+done
 
 after resume. swapoff -a; swapon -a may also be useful.
 
@@ -236,7 +258,7 @@ disk. Whole sequence goes like
 
       running system, user asks for suspend-to-disk
 
-      user processes are stopped (in common case there are none, but with resume-from-initrd, noone knows)
+      user processes are stopped (in common case there are none, but with resume-from-initrd, no one knows)
 
       read image from disk
 
@@ -320,7 +342,7 @@ Q: How can distributions ship a swsusp-supporting kernel with modular
 disk drivers (especially SATA)?
 
 A: Well, it can be done, load the drivers, then do echo into
-/sys/power/disk/resume file from initrd. Be sure not to mount
+/sys/power/resume file from initrd. Be sure not to mount
 anything, not even read-only mount, or you are going to lose your
 data.
 
diff --git a/Documentation/power/tricks.txt b/Documentation/power/tricks.txt
index 3b26bb502a4..a1b8f7249f4 100644
--- a/Documentation/power/tricks.txt
+++ b/Documentation/power/tricks.txt
@@ -1,6 +1,6 @@
 	swsusp/S3 tricks
 	~~~~~~~~~~~~~~~~
-Pavel Machek <pavel@suse.cz>
+Pavel Machek <pavel@ucw.cz>
 
 If you want to trick swsusp/S3 into working, you might want to try:
 
diff --git a/Documentation/power/userland-swsusp.txt b/Documentation/power/userland-swsusp.txt
index 7b99636564c..0e870825c1b 100644
--- a/Documentation/power/userland-swsusp.txt
+++ b/Documentation/power/userland-swsusp.txt
@@ -24,6 +24,10 @@ assumed to be in the resume mode.  The device cannot be open for simultaneous
 reading and writing.  It is also impossible to have the device open more than
 once at a time.
 
+Even opening the device has side effects. Data structures are
+allocated, and PM_HIBERNATION_PREPARE / PM_RESTORE_PREPARE chains are
+called.
+
 The ioctl() commands recognized by the device are:
 
 SNAPSHOT_FREEZE - freeze user space processes (the current process is
@@ -73,7 +77,8 @@ SNAPSHOT_SET_SWAP_AREA - set the resume partition and the offset (in <PAGE_SIZE>
 	resume_swap_area, as defined in kernel/power/suspend_ioctls.h,
 	containing the resume device specification and the offset); for swap
 	partitions the offset is always 0, but it is different from zero for
-	swap files (see Documentation/swsusp-and-swap-files.txt for details).
+	swap files (see Documentation/power/swsusp-and-swap-files.txt for
+	details).
 
 SNAPSHOT_PLATFORM_SUPPORT - enable/disable the hibernation platform support,
 	depending on the argument value (enable, if the argument is nonzero)
@@ -94,7 +99,7 @@ SNAPSHOT_S2RAM - suspend to RAM; using this call causes the kernel to
 The device's read() operation can be used to transfer the snapshot image from
 the kernel.  It has the following limitations:
 - you cannot read() more than one virtual memory page at a time
-- read()s accross page boundaries are impossible (ie. if ypu read() 1/2 of
+- read()s across page boundaries are impossible (ie. if ypu read() 1/2 of
 	a page in the previous call, you will only be able to read()
 	_at_ _most_ 1/2 of the page in the next call)
 
@@ -109,7 +114,7 @@ unfreeze user space processes frozen by SNAPSHOT_UNFREEZE if they are
 still frozen when the device is being closed).
 
 Currently it is assumed that the userland utilities reading/writing the
-snapshot image from/to the kernel will use a swap parition, called the resume
+snapshot image from/to the kernel will use a swap partition, called the resume
 partition, or a swap file as storage space (if a swap file is used, the resume
 partition is the partition that holds this file).  However, this is not really
 required, as they can use, for example, a special (blank) suspend partition or
@@ -133,7 +138,7 @@ mechanism and the userland utilities using the interface SHOULD use additional
 means, such as checksums, to ensure the integrity of the snapshot image.
 
 The suspending and resuming utilities MUST lock themselves in memory,
-preferrably using mlockall(), before calling SNAPSHOT_FREEZE.
+preferably using mlockall(), before calling SNAPSHOT_FREEZE.
 
 The suspending utility MUST check the value stored by SNAPSHOT_CREATE_IMAGE
 in the memory location pointed to by the last argument of ioctl() and proceed
@@ -143,7 +148,7 @@ in accordance with it:
 	(a)	The suspending utility MUST NOT close the snapshot device
 		_unless_ the whole suspend procedure is to be cancelled, in
 		which case, if the snapshot image has already been saved, the
-		suspending utility SHOULD destroy it, preferrably by zapping
+		suspending utility SHOULD destroy it, preferably by zapping
 		its header.  If the suspend is not to be cancelled, the
 		system MUST be powered off or rebooted after the snapshot
 		image has been saved.
diff --git a/Documentation/power/video.txt b/Documentation/power/video.txt
index 2b358498d09..3e6272bc447 100644
--- a/Documentation/power/video.txt
+++ b/Documentation/power/video.txt
@@ -67,11 +67,11 @@ There are a few types of systems where video works after S3 resume:
   POSTing bios works. Ole Rohne has patch to do just that at
   http://dev.gentoo.org/~marineam/patch-radeonfb-2.6.11-rc2-mm2.
 
-(8) on some systems, you can use the video_post utility mentioned here:
-  http://bugzilla.kernel.org/show_bug.cgi?id=3670. Do echo 3 > /sys/power/state
-  && /usr/sbin/video_post - which will initialize the display in console mode.
-  If  you are in X, you can switch to a virtual terminal and back to X using
-  CTRL+ALT+F1 - CTRL+ALT+F7 to get the display working in graphical mode again.
+(8) on some systems, you can use the video_post utility and or 
+  do echo 3 > /sys/power/state  && /usr/sbin/video_post - which will 
+  initialize the display in console mode. If you are in X, you can switch
+  to a virtual terminal and back to X using  CTRL+ALT+F1 - CTRL+ALT+F7 to get
+  the display working in graphical mode again.
 
 Now, if you pass acpi_sleep=something, and it does not work with your
 bios, you'll get a hard crash during resume. Be careful. Also it is
@@ -177,7 +177,7 @@ Mainboard	    Graphics card                 hack (or "how to do it")
 Asus A7V8X	    nVidia RIVA TNT2 model 64	  s3_bios,s3_mode (4)
 
 
-(*) from http://www.ubuntulinux.org/wiki/HoaryPMResults, not sure
+(*) from https://wiki.ubuntu.com/HoaryPMResults, not sure
     which options to use. If you know, please tell me.
 
 (***) To be tested with a newer kernel.
diff --git a/Documentation/power/video_extension.txt b/Documentation/power/video_extension.txt
deleted file mode 100644
index b2f9b1598ac..00000000000
--- a/Documentation/power/video_extension.txt
+++ /dev/null
@@ -1,37 +0,0 @@
-ACPI video extensions
-~~~~~~~~~~~~~~~~~~~~~
-
-This driver implement the ACPI Extensions For Display Adapters for
-integrated graphics devices on motherboard, as specified in ACPI 2.0
-Specification, Appendix B, allowing to perform some basic control like
-defining the video POST device, retrieving EDID information or to
-setup a video output, etc.  Note that this is an ref. implementation
-only.  It may or may not work for your integrated video device.
-
-Interfaces exposed to userland through /proc/acpi/video:
-
-VGA/info : display the supported video bus device capability like Video ROM, CRT/LCD/TV.
-VGA/ROM :  Used to get a copy of the display devices' ROM data (up to 4k).
-VGA/POST_info : Used to determine what options are implemented.
-VGA/POST : Used to get/set POST device.
-VGA/DOS : Used to get/set ownership of output switching:
-	Please refer ACPI spec B.4.1 _DOS
-VGA/CRT : CRT output
-VGA/LCD : LCD output
-VGA/TVO : TV output
-VGA/*/brightness : Used to get/set brightness of output device
-
-Notify event through /proc/acpi/event:
-
-#define ACPI_VIDEO_NOTIFY_SWITCH        0x80
-#define ACPI_VIDEO_NOTIFY_PROBE         0x81
-#define ACPI_VIDEO_NOTIFY_CYCLE         0x82
-#define ACPI_VIDEO_NOTIFY_NEXT_OUTPUT   0x83
-#define ACPI_VIDEO_NOTIFY_PREV_OUTPUT   0x84
-
-#define ACPI_VIDEO_NOTIFY_CYCLE_BRIGHTNESS      0x82
-#define ACPI_VIDEO_NOTIFY_INC_BRIGHTNESS        0x83
-#define ACPI_VIDEO_NOTIFY_DEC_BRIGHTNESS        0x84
-#define ACPI_VIDEO_NOTIFY_ZERO_BRIGHTNESS       0x85
-#define ACPI_VIDEO_NOTIFY_DISPLAY_OFF           0x86
-