aboutsummaryrefslogtreecommitdiff
path: root/Documentation/power/pci.txt
blob: 62328d76b55bd9cfc59294666b49a70e0ddca5da (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
PCI Power Management

Copyright (c) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.

An overview of concepts and the Linux kernel's interfaces related to PCI power
management.  Based on previous work by Patrick Mochel <mochel@transmeta.com>
(and others).

This document only covers the aspects of power management specific to PCI
devices.  For general description of the kernel's interfaces related to device
power management refer to Documentation/power/devices.txt and
Documentation/power/runtime_pm.txt.

---------------------------------------------------------------------------

1. Hardware and Platform Support for PCI Power Management
2. PCI Subsystem and Device Power Management
3. PCI Device Drivers and Power Management
4. Resources


1. Hardware and Platform Support for PCI Power Management
=========================================================

1.1. Native and Platform-Based Power Management
-----------------------------------------------
In general, power management is a feature allowing one to save energy by putting
devices into states in which they draw less power (low-power states) at the
price of reduced functionality or performance.

Usually, a device is put into a low-power state when it is underutilized or
completely inactive.  However, when it is necessary to use the device once
again, it has to be put back into the "fully functional" state (full-power
state).  This may happen when there are some data for the device to handle or
as a result of an external event requiring the device to be active, which may
be signaled by the device itself.

PCI devices may be put into low-power states in two ways, by using the device
capabilities introduced by the PCI Bus Power Management Interface Specification,
or with the help of platform firmware, such as an ACPI BIOS.  In the first
approach, that is referred to as the native PCI power management (native PCI PM)
in what follows, the device power state is changed as a result of writing a
specific value into one of its standard configuration registers.  The second
approach requires the platform firmware to provide special methods that may be
used by the kernel to change the device's power state.

Devices supporting the native PCI PM usually can generate wakeup signals called
Power Management Events (PMEs) to let the kernel know about external events
requiring the device to be active.  After receiving a PME the kernel is supposed
to put the device that sent it into the full-power state.  However, the PCI Bus
Power Management Interface Specification doesn't define any standard method of
delivering the PME from the device to the CPU and the operating system kernel.
It is assumed that the platform firmware will perform this task and therefore,
even though a PCI device is set up to generate PMEs, it also may be necessary to
prepare the platform firmware for notifying the CPU of the PMEs coming from the
device (e.g. by generating interrupts).

In turn, if the methods provided by the platform firmware are used for changing
the power state of a device, usually the platform also provides a method for
preparing the device to generate wakeup signals.  In that case, however, it
often also is necessary to prepare the device for generating PMEs using the
native PCI PM mechanism, because the method provided by the platform depends on
that.

Thus in many situations both the native and the platform-based power management
mechanisms have to be used simultaneously to obtain the desired result.

1.2. Native PCI Power Management
--------------------------------
The PCI Bus Power Management Interface Specification (PCI PM Spec) was
introduced between the PCI 2.1 and PCI 2.2 Specifications.  It defined a
standard interface for performing various operations related to power
management.

The implementation of the PCI PM Spec is optional for conventional PCI devices,
but it is mandatory for PCI Express devices.  If a device supports the PCI PM
Spec, it has an 8 byte power management capability field in its PCI
configuration space.  This field is used to describe and control the standard
features related to the native PCI power management.

The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses
(B0-B3).  The higher the number, the less power is drawn by the device or bus
in that state.  However, the higher the number, the longer the latency for
the device or bus to return to the full-power state (D0 or B0, respectively).

There are two variants of the D3 state defined by the specification.  The first
one is D3hot, referred to as the software accessible D3, because devices can be
programmed to go into it.  The second one, D3cold, is the state that PCI devices
are in when the supply voltage (Vcc) is removed from them.  It is not possible
to program a PCI device to go into D3cold, although there may be a programmable
interface for putting the bus the device is on into a state in which Vcc is
removed from all devices on the bus.

PCI bus power management, however, is not supported by the Linux kernel at the
time of this writing and therefore it is not covered by this document.

Note that every PCI device can be in the full-power state (D0) or in D3cold,
regardless of whether or not it implements the PCI PM Spec.  In addition to
that, if the PCI PM Spec is implemented by the device, it must support D3hot
as well as D0.  The support for the D1 and D2 power states is optional.

PCI devices supporting the PCI PM Spec can be programmed to go to any of the
supported low-power states (except for D3cold).  While in D1-D3hot the
standard configuration registers of the device must be accessible to software
(i.e. the device is required to respond to PCI configuration accesses), although
its I/O and memory spaces are then disabled.  This allows the device to be
programmatically put into D0.  Thus the kernel can switch the device back and
forth between D0 and the supported low-power states (except for D3cold) and the
possible power state transitions the device can undergo are the following:

+----------------------------+
| Current State | New State  |
+----------------------------+
| D0            | D1, D2, D3 |
+----------------------------+
| D1            | D2, D3     |
+----------------------------+
| D2            | D3         |
+----------------------------+
| D1, D2, D3    | D0         |
+----------------------------+

The transition from D3cold to D0 occurs when the supply voltage is provided to
the device (i.e. power is restored).  In that case the device returns to D0 with
a full power-on reset sequence and the power-on defaults are restored to the
device by hardware just as at initial power up.

PCI devices supporting the PCI PM Spec can be programmed to generate PMEs
while in a low-power state (D1-D3), but they are not required to be capable
of generating PMEs from all supported low-power states.  In particular, the
capability of generating PMEs from D3cold is optional and depends on the
presence of additional voltage (3.3Vaux) allowing the device to remain
sufficiently active to generate a wakeup signal.

1.3. ACPI Device Power Management
---------------------------------
The platform firmware support for the power management of PCI devices is
system-specific.  However, if the system in question is compliant with the
Advanced Configuration and Power Interface (ACPI) Specification, like the
majority of x86-based systems, it is supposed to implement device power
management interfaces defined by the ACPI standard.

For this purpose the ACPI BIOS provides special functions called "control
methods" that may be executed by the kernel to perform specific tasks, such as
putting a device into a low-power state.  These control methods are encoded
using special byte-code language called the ACPI Machine Language (AML) and
stored in the machine's BIOS.  The kernel loads them from the BIOS and executes
them as needed using an AML interpreter that translates the AML byte code into
computations and memory or I/O space accesses.  This way, in theory, a BIOS
writer can provide the kernel with a means to perform actions depending
on the system design in a system-specific fashion.

ACPI control methods may be divided into global control methods, that are not
associated with any particular devices, and device control methods, that have
to be defined separately for each device supposed to be handled with the help of
the platform.  This means, in particular, that ACPI device control methods can
only be used to handle devices that the BIOS writer knew about in advance.  The
ACPI methods used for device power management fall into that category.

The ACPI specification assumes that devices can be in one of four power states
labeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM
D0-D3 states (although the difference between D3hot and D3cold is not taken
into account by ACPI).  Moreover, for each power state of a device there is a
set of power resources that have to be enabled for the device to be put into
that state.  These power resources are controlled (i.e. enabled or disabled)
with the help of their own control methods, _ON and _OFF, that have to be
defined individually for each of them.

To put a device into the ACPI power state Dx (where x is a number between 0 and
3 inclusive) the kernel is supposed to (1) enable the power resources required
by the device in this state using their _ON control methods and (2) execute the
_PSx control method defined for the device.  In addition to that, if the device
is going to be put into a low-power state (D1-D3) and is supposed to generate
wakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI
3.0) control method defined for it has to be executed before _PSx.  Power
resources that are not required by the device in the target power state and are
not required any more by any other device should be disabled (by executing their
_OFF control methods).  If the current power state of the device is D3, it can
only be put into D0 this way.

However, quite often the power states of devices are changed during a
system-wide transition into a sleep state or back into the working state.  ACPI
defines four system sleep states, S1, S2, S3, and S4, and denotes the system
working state as S0.  In general, the target system sleep (or working) state
determines the highest power (lowest number) state the device can be put
into and the kernel is supposed to obtain this information by executing the
device's _SxD control method (where x is a number between 0 and 4 inclusive).
If the device is required to wake up the system from the target sleep state, the
lowest power (highest number) state it can be put into is also determined by the
target state of the system.  The kernel is then supposed to use the device's
_SxW control method to obtain the number of that state.  It also is supposed to
use the device's _PRW control method to learn which power resources need to be
enabled for the device to be able to generate wakeup signals.

1.4. Wakeup Signaling
---------------------
Wakeup signals generated by PCI devices, either as native PCI PMEs, or as
a result of the execution of the _DSW (or _PSW) ACPI control method before
putting the device into a low-power state, have to be caught and handled as
appropriate.  If they are sent while the system is in the working state
(ACPI S0), they should be translated into interrupts so that the kernel can
put the devices generating them into the full-power state and take care of the
events that triggered them.  In turn, if they are sent while the system is
sleeping, they should cause the system's core logic to trigger wakeup.

On ACPI-based systems wakeup signals sent by conventional PCI devices are
converted into ACPI General-Purpose Events (GPEs) which are hardware signals
from the system core logic generated in response to various events that need to
be acted upon.  Every GPE is associated with one or more sources of potentially
interesting events.  In particular, a GPE may be associated with a PCI device
capable of signaling wakeup.  The information on the connections between GPEs
and event sources is recorded in the system's ACPI BIOS from where it can be
read by the kernel.

If a PCI device known to the system's ACPI BIOS signals wakeup, the GPE
associated with it (if there is one) is triggered.  The GPEs associated with PCI
bridges may also be triggered in response to a wakeup signal from one of the
devices below the bridge (this also is the case for root bridges) and, for
example, native PCI PMEs from devices unknown to the system's ACPI BIOS may be
handled this way.

A GPE may be triggered when the system is sleeping (i.e. when it is in one of
the ACPI S1-S4 states), in which case system wakeup is started by its core logic
(the device that was the source of the signal causing the system wakeup to occur
may be identified later).  The GPEs used in such situations are referred to as
wakeup GPEs.

Usually, however, GPEs are also triggered when the system is in the working
state (ACPI S0) and in that case the system's core logic generates a System
Control Interrupt (SCI) to notify the kernel of the event.  Then, the SCI
handler identifies the GPE that caused the interrupt to be generated which,
in turn, allows the kernel to identify the source of the event (that may be
a PCI device signaling wakeup).  The GPEs used for notifying the kernel of
events occurring while the system is in the working state are referred to as
runtime GPEs.

Unfortunately, there is no standard way of handling wakeup signals sent by
conventional PCI devices on systems that are not ACPI-based, but there is one
for PCI Express devices.  Namely, the PCI Express Base Specification introduced
a native mechanism for converting native PCI PMEs into interrupts generated by
root ports.  For conventional PCI devices native PMEs are out-of-band, so they
are routed separately and they need not pass through bridges (in principle they
may be routed directly to the system's core logic), but for PCI Express devices
they are in-band messages that have to pass through the PCI Express hierarchy,
including the root port on the path from the device to the Root Complex.  Thus
it was possible to introduce a mechanism by which a root port generates an
interrupt whenever it receives a PME message from one of the devices below it.
The PCI Express Requester ID of the device that sent the PME message is then
recorded in one of the root port's configuration registers from where it may be
read by the interrupt handler allowing the device to be identified.  [PME
messages sent by PCI Express endpoints integrated with the Root Complex don't
pass through root ports, but instead they cause a Root Complex Event Collector
(if there is one) to generate interrupts.]

In principle the native PCI Express PME signaling may also be used on ACPI-based
systems along with the GPEs, but to use it the kernel has to ask the system's
ACPI BIOS to release control of root port configuration registers.  The ACPI
BIOS, however, is not required to allow the kernel to control these registers
and if it doesn't do that, the kernel must not modify their contents.  Of course
the native PCI Express PME signaling cannot be used by the kernel in that case.


2. PCI Subsystem and Device Power Management
============================================

2.1. Device Power Management Callbacks
--------------------------------------
The PCI Subsystem participates in the power management of PCI devices in a
number of ways.  First of all, it provides an intermediate code layer between
the device power management core (PM core) and PCI device drivers.
Specifically, the pm field of the PCI subsystem's struct bus_type object,
pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing
pointers to several device power management callbacks:

const struct dev_pm_ops pci_dev_pm_ops = {
	.prepare = pci_pm_prepare,
	.complete = pci_pm_complete,
	.suspend = pci_pm_suspend,
	.resume = pci_pm_resume,
	.freeze = pci_pm_freeze,
	.thaw = pci_pm_thaw,
	.poweroff = pci_pm_poweroff,
	.restore = pci_pm_restore,
	.suspend_noirq = pci_pm_suspend_noirq,
	.resume_noirq = pci_pm_resume_noirq,
	.freeze_noirq = pci_pm_freeze_noirq,
	.thaw_noirq = pci_pm_thaw_noirq,
	.poweroff_noirq = pci_pm_poweroff_noirq,
	.restore_noirq = pci_pm_restore_noirq,
	.runtime_suspend = pci_pm_runtime_suspend,
	.runtime_resume = pci_pm_runtime_resume,
	.runtime_idle = pci_pm_runtime_idle,
};

These callbacks are executed by the PM core in various situations related to
device power management and they, in turn, execute power management callbacks
provided by PCI device drivers.  They also perform power management operations
involving some standard configuration registers of PCI devices that device
drivers need not know or care about.

The structure representing a PCI device, struct pci_dev, contains several fields
that these callbacks operate on:

struct pci_dev {
	...
	pci_power_t     current_state;  /* Current operating state. */
	int		pm_cap;		/* PM capability offset in the
					   configuration space */
	unsigned int	pme_support:5;	/* Bitmask of states from which PME#
					   can be generated */
	unsigned int	pme_interrupt:1;/* Is native PCIe PME signaling used? */
	unsigned int	d1_support:1;	/* Low power state D1 is supported */
	unsigned int	d2_support:1;	/* Low power state D2 is supported */
	unsigned int	no_d1d2:1;	/* D1 and D2 are forbidden */
	unsigned int	wakeup_prepared:1;  /* Device prepared for wake up */
	unsigned int	d3_delay;	/* D3->D0 transition time in ms */
	...
};

They also indirectly use some fields of the struct device that is embedded in
struct pci_dev.

2.2. Device Initialization
--------------------------
The PCI subsystem's first task related to device power management is to
prepare the device for power management and initialize the fields of struct
pci_dev used for this purpose.  This happens in two functions defined in
drivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init().

The first of these functions checks if the device supports native PCI PM
and if that's the case the offset of its power management capability structure
in the configuration space is stored in the pm_cap field of the device's struct
pci_dev object.  Next, the function checks which PCI low-power states are
supported by the device and from which low-power states the device can generate
native PCI PMEs.  The power management fields of the device's struct pci_dev and
the struct device embedded in it are updated accordingly and the generation of
PMEs by the device is disabled.

The second function checks if the device can be prepared to signal wakeup with
the help of the platform firmware, such as the ACPI BIOS.  If that is the case,
the function updates the wakeup fields in struct device embedded in the
device's struct pci_dev and uses the firmware-provided method to prevent the
device from signaling wakeup.

At this point the device is ready for power management.  For driverless devices,
however, this functionality is limited to a few basic operations carried out
during system-wide transitions to a sleep state and back to the working state.

2.3. Runtime Device Power Management
------------------------------------
The PCI subsystem plays a vital role in the runtime power management of PCI
devices.  For this purpose it uses the general runtime power management
(runtime PM) framework described in Documentation/power/runtime_pm.txt.
Namely, it provides subsystem-level callbacks:

	pci_pm_runtime_suspend()
	pci_pm_runtime_resume()
	pci_pm_runtime_idle()

that are executed by the core runtime PM routines.  It also implements the
entire mechanics necessary for handling runtime wakeup signals from PCI devices
in low-power states, which at the time of this writing works for both the native
PCI Express PME signaling and the ACPI GPE-based wakeup signaling described in
Section 1.

First, a PCI device is put into a low-power state, or suspended, with the help
of pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call
pci_pm_runtime_suspend() to do the actual job.  For this to work, the device's
driver has to provide a pm->runtime_suspend() callback (see below), which is
run by pci_pm_runtime_suspend() as the first action.  If the driver's callback
returns successfully, the device's standard configuration registers are saved,
the device is prepared to generate wakeup signals and, finally, it is put into
the target low-power state.

The low-power state to put the device into is the lowest-power (highest number)
state from which it can signal wakeup.  The exact method of signaling wakeup is
system-dependent and is determined by the PCI subsystem on the basis of the
reported capabilities of the device and the platform firmware.  To prepare the
device for signaling wakeup and put it into the selected low-power state, the
PCI subsystem can use the platform firmware as well as the device's native PCI
PM capabilities, if supported.

It is expected that the device driver's pm->runtime_suspend() callback will
not attempt to prepare the device for signaling wakeup or to put it into a
low-power state.  The driver ought to leave these tasks to the PCI subsystem
that has all of the information necessary to perform them.

A suspended device is brought back into the "active" state, or resumed,
with the help of pm_request_resume() or pm_runtime_resume() which both call
pci_pm_runtime_resume() for PCI devices.  Again, this only works if the device's
driver provides a pm->runtime_resume() callback (see below).  However, before
the driver's callback is executed, pci_pm_runtime_resume() brings the device
back into the full-power state, prevents it from signaling wakeup while in that
state and restores its standard configuration registers.  Thus the driver's
callback need not worry about the PCI-specific aspects of the device resume.

Note that generally pci_pm_runtime_resume() may be called in two different
situat