diff options
author | Mauro Carvalho Chehab <mchehab@infradead.org> | 2006-02-06 10:43:13 -0200 |
---|---|---|
committer | Mauro Carvalho Chehab <mchehab@infradead.org> | 2006-02-06 10:43:13 -0200 |
commit | b2faf597d93bdf5e2d12d93ea0815935a73f749e (patch) | |
tree | 1876616290ff282b8a0814e2429d23e0104f3701 /Documentation | |
parent | 638e174688f58200d0deb7435093435e7d737b09 (diff) | |
parent | 410c05427a69f53851637ccb85c2212131409fbd (diff) |
Merge branch 'origin'
Diffstat (limited to 'Documentation')
30 files changed, 2607 insertions, 435 deletions
diff --git a/Documentation/RCU/RTFP.txt b/Documentation/RCU/RTFP.txt index fcbcbc35b12..6221464d1a7 100644 --- a/Documentation/RCU/RTFP.txt +++ b/Documentation/RCU/RTFP.txt @@ -90,16 +90,20 @@ at OLS. The resulting abundance of RCU patches was presented the following year [McKenney02a], and use of RCU in dcache was first described that same year [Linder02a]. -Also in 2002, Michael [Michael02b,Michael02a] presented techniques -that defer the destruction of data structures to simplify non-blocking -synchronization (wait-free synchronization, lock-free synchronization, -and obstruction-free synchronization are all examples of non-blocking -synchronization). In particular, this technique eliminates locking, -reduces contention, reduces memory latency for readers, and parallelizes -pipeline stalls and memory latency for writers. However, these -techniques still impose significant read-side overhead in the form of -memory barriers. Researchers at Sun worked along similar lines in the -same timeframe [HerlihyLM02,HerlihyLMS03]. +Also in 2002, Michael [Michael02b,Michael02a] presented "hazard-pointer" +techniques that defer the destruction of data structures to simplify +non-blocking synchronization (wait-free synchronization, lock-free +synchronization, and obstruction-free synchronization are all examples of +non-blocking synchronization). In particular, this technique eliminates +locking, reduces contention, reduces memory latency for readers, and +parallelizes pipeline stalls and memory latency for writers. However, +these techniques still impose significant read-side overhead in the +form of memory barriers. Researchers at Sun worked along similar lines +in the same timeframe [HerlihyLM02,HerlihyLMS03]. These techniques +can be thought of as inside-out reference counts, where the count is +represented by the number of hazard pointers referencing a given data +structure (rather than the more conventional counter field within the +data structure itself). In 2003, the K42 group described how RCU could be used to create hot-pluggable implementations of operating-system functions. Later that @@ -113,7 +117,6 @@ number of operating-system kernels [PaulEdwardMcKenneyPhD], a paper describing how to make RCU safe for soft-realtime applications [Sarma04c], and a paper describing SELinux performance with RCU [JamesMorris04b]. - 2005 has seen further adaptation of RCU to realtime use, permitting preemption of RCU realtime critical sections [PaulMcKenney05a, PaulMcKenney05b]. diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index e118a7c1a09..49e27cc1938 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt @@ -177,3 +177,9 @@ over a rather long period of time, but improvements are always welcome! If you want to wait for some of these other things, you might instead need to use synchronize_irq() or synchronize_sched(). + +12. Any lock acquired by an RCU callback must be acquired elsewhere + with irq disabled, e.g., via spin_lock_irqsave(). Failing to + disable irq on a given acquisition of that lock will result in + deadlock as soon as the RCU callback happens to interrupt that + acquisition's critical section. diff --git a/Documentation/RCU/listRCU.txt b/Documentation/RCU/listRCU.txt index f8a54fa0d8a..1fd175368a8 100644 --- a/Documentation/RCU/listRCU.txt +++ b/Documentation/RCU/listRCU.txt @@ -232,7 +232,7 @@ entry does not exist. For this to be helpful, the search function must return holding the per-entry spinlock, as ipc_lock() does in fact do. Quick Quiz: Why does the search function need to return holding the -per-entry lock for this deleted-flag technique to be helpful? + per-entry lock for this deleted-flag technique to be helpful? If the system-call audit module were to ever need to reject stale data, one way to accomplish this would be to add a "deleted" flag and a "lock" @@ -275,8 +275,8 @@ flag under the spinlock as follows: { struct audit_entry *e; - /* Do not use the _rcu iterator here, since this is the only - * deletion routine. */ + /* Do not need to use the _rcu iterator here, since this + * is the only deletion routine. */ list_for_each_entry(e, list, list) { if (!audit_compare_rule(rule, &e->rule)) { spin_lock(&e->lock); @@ -304,9 +304,12 @@ function to reject newly deleted data. Answer to Quick Quiz - -If the search function drops the per-entry lock before returning, then -the caller will be processing stale data in any case. If it is really -OK to be processing stale data, then you don't need a "deleted" flag. -If processing stale data really is a problem, then you need to hold the -per-entry lock across all of the code that uses the value looked up. + Why does the search function need to return holding the per-entry + lock for this deleted-flag technique to be helpful? + + If the search function drops the per-entry lock before returning, + then the caller will be processing stale data in any case. If it + is really OK to be processing stale data, then you don't need a + "deleted" flag. If processing stale data really is a problem, + then you need to hold the per-entry lock across all of the code + that uses the value that was returned. diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt index 6fa09225158..02e27bf1d36 100644 --- a/Documentation/RCU/rcu.txt +++ b/Documentation/RCU/rcu.txt @@ -111,6 +111,11 @@ o What are all these files in this directory? You are reading it! + rcuref.txt + + Describes how to combine use of reference counts + with RCU. + whatisRCU.txt Overview of how the RCU implementation works. Along diff --git a/Documentation/RCU/rcuref.txt b/Documentation/RCU/rcuref.txt index 3f60db41b2f..451de2ad832 100644 --- a/Documentation/RCU/rcuref.txt +++ b/Documentation/RCU/rcuref.txt @@ -1,7 +1,7 @@ -Refcounter design for elements of lists/arrays protected by RCU. +Reference-count design for elements of lists/arrays protected by RCU. -Refcounting on elements of lists which are protected by traditional -reader/writer spinlocks or semaphores are straight forward as in: +Reference counting on elements of lists which are protected by traditional +reader/writer spinlocks or semaphores are straightforward: 1. 2. add() search_and_reference() @@ -28,12 +28,12 @@ release_referenced() delete() ... } -If this list/array is made lock free using rcu as in changing the -write_lock in add() and delete() to spin_lock and changing read_lock +If this list/array is made lock free using RCU as in changing the +write_lock() in add() and delete() to spin_lock and changing read_lock in search_and_reference to rcu_read_lock(), the atomic_get in search_and_reference could potentially hold reference to an element which -has already been deleted from the list/array. atomic_inc_not_zero takes -care of this scenario. search_and_reference should look as; +has already been deleted from the list/array. Use atomic_inc_not_zero() +in this scenario as follows: 1. 2. add() search_and_reference() @@ -51,17 +51,16 @@ add() search_and_reference() release_referenced() delete() { { ... write_lock(&list_lock); - atomic_dec(&el->rc, relfunc) ... - ... delete_element -} write_unlock(&list_lock); - ... + if (atomic_dec_and_test(&el->rc)) ... + call_rcu(&el->head, el_free); delete_element + ... write_unlock(&list_lock); +} ... if (atomic_dec_and_test(&el->rc)) call_rcu(&el->head, el_free); ... } -Sometimes, reference to the element need to be obtained in the -update (write) stream. In such cases, atomic_inc_not_zero might be an -overkill since the spinlock serialising list updates are held. atomic_inc -is to be used in such cases. - +Sometimes, a reference to the element needs to be obtained in the +update (write) stream. In such cases, atomic_inc_not_zero() might be +overkill, since we hold the update-side spinlock. One might instead +use atomic_inc() in such cases. diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 15da16861fa..5ed85af8878 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -200,10 +200,11 @@ rcu_assign_pointer() the new value, and also executes any memory-barrier instructions required for a given CPU architecture. - Perhaps more important, it serves to document which pointers - are protected by RCU. That said, rcu_assign_pointer() is most - frequently used indirectly, via the _rcu list-manipulation - primitives such as list_add_rcu(). + Perhaps just as important, it serves to document (1) which + pointers are protected by RCU and (2) the point at which a + given structure becomes accessible to other CPUs. That said, + rcu_assign_pointer() is most frequently used indirectly, via + the _rcu list-manipulation primitives such as list_add_rcu(). rcu_dereference() @@ -258,9 +259,11 @@ rcu_dereference() locking. As with rcu_assign_pointer(), an important function of - rcu_dereference() is to document which pointers are protected - by RCU. And, again like rcu_assign_pointer(), rcu_dereference() - is typically used indirectly, via the _rcu list-manipulation + rcu_dereference() is to document which pointers are protected by + RCU, in particular, flagging a pointer that is subject to changing + at any time, including immediately after the rcu_dereference(). + And, again like rcu_assign_pointer(), rcu_dereference() is + typically used indirectly, via the _rcu list-manipulation primitives, such as list_for_each_entry_rcu(). The following diagram shows how each API communicates among the @@ -327,7 +330,7 @@ for specialized uses, but are relatively uncommon. 3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API? This section shows a simple use of the core RCU API to protect a -global pointer to a dynamically allocated structure. More typical +global pointer to a dynamically allocated structure. More-typical uses of RCU may be found in listRCU.txt, arrayRCU.txt, and NMI-RCU.txt. struct foo { @@ -410,6 +413,8 @@ o Use synchronize_rcu() -after- removing a data element from an data item. See checklist.txt for additional rules to follow when using RCU. +And again, more-typical uses of RCU may be found in listRCU.txt, +arrayRCU.txt, and NMI-RCU.txt. 4. WHAT IF MY UPDATING THREAD CANNOT BLOCK? @@ -513,7 +518,7 @@ production-quality implementation, and see: for papers describing the Linux kernel RCU implementation. The OLS'01 and OLS'02 papers are a good introduction, and the dissertation provides -more details on the current implementation. +more details on the current implementation as of early 2004. 5A. "TOY" IMPLEMENTATION #1: LOCKING @@ -768,7 +773,6 @@ RCU pointer/list traversal: rcu_dereference list_for_each_rcu (to be deprecated in favor of list_for_each_entry_rcu) - list_for_each_safe_rcu (deprecated, not used) list_for_each_entry_rcu list_for_each_continue_rcu (to be deprecated in favor of new list_for_each_entry_continue_rcu) @@ -807,7 +811,8 @@ Quick Quiz #1: Why is this argument naive? How could a deadlock Answer: Consider the following sequence of events: 1. CPU 0 acquires some unrelated lock, call it - "problematic_lock". + "problematic_lock", disabling irq via + spin_lock_irqsave(). 2. CPU 1 enters synchronize_rcu(), write-acquiring rcu_gp_mutex. @@ -894,7 +899,7 @@ Answer: Just as PREEMPT_RT permits preemption of spinlock ACKNOWLEDGEMENTS My thanks to the people who helped make this human-readable, including -Jon Walpole, Josh Triplett, Serge Hallyn, and Suzanne Wood. +Jon Walpole, Josh Triplett, Serge Hallyn, Suzanne Wood, and Alan Stern. For more information, see http://www.rdrop.com/users/paulmck/RCU. diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt new file mode 100644 index 00000000000..ff280e2e161 --- /dev/null +++ b/Documentation/cputopology.txt @@ -0,0 +1,41 @@ + +Export cpu topology info by sysfs. Items (attributes) are similar +to /proc/cpuinfo. + +1) /sys/devices/system/cpu/cpuX/topology/physical_package_id: +represent the physical package id of cpu X; +2) /sys/devices/system/cpu/cpuX/topology/core_id: +represent the cpu core id to cpu X; +3) /sys/devices/system/cpu/cpuX/topology/thread_siblings: +represent the thread siblings to cpu X in the same core; +4) /sys/devices/system/cpu/cpuX/topology/core_siblings: +represent the thread siblings to cpu X in the same physical package; + +To implement it in an architecture-neutral way, a new source file, +driver/base/topology.c, is to export the 5 attributes. + +If one architecture wants to support this feature, it just needs to +implement 4 defines, typically in file include/asm-XXX/topology.h. +The 4 defines are: +#define topology_physical_package_id(cpu) +#define topology_core_id(cpu) +#define topology_thread_siblings(cpu) +#define topology_core_siblings(cpu) + +The type of **_id is int. +The type of siblings is cpumask_t. + +To be consistent on all architectures, the 4 attributes should have +deafult values if their values are unavailable. Below is the rule. +1) physical_package_id: If cpu has no physical package id, -1 is the +default value. +2) core_id: If cpu doesn't support multi-core, its core id is 0. +3) thread_siblings: Just include itself, if the cpu doesn't support +HT/multi-thread. +4) core_siblings: Just include itself, if the cpu doesn't support +multi-core and HT/Multi-thread. + +So be careful when declaring the 4 defines in include/asm-XXX/topology.h. + +If an attribute isn't defined on an architecture, it won't be exported. + diff --git a/Documentation/driver-model/overview.txt b/Documentation/driver-model/overview.txt index 44662735cf8..ac4a7a737e4 100644 --- a/Documentation/driver-model/overview.txt +++ b/Documentation/driver-model/overview.txt @@ -1,50 +1,43 @@ The Linux Kernel Device Model -Patrick Mochel <mochel@osdl.org> +Patrick Mochel <mochel@digitalimplant.org> -26 August 2002 +Drafted 26 August 2002 +Updated 31 January 2006 Overview ~~~~~~~~ -This driver model is a unification of all the current, disparate driver models -that are currently in the kernel. It is intended to augment the +The Linux Kernel Driver Model is a unification of all the disparate driver +models that were previously used in the kernel. It is intended to augment the bus-specific drivers for bridges and devices by consolidating a set of data and operations into globally accessible data structures. -Current driver models implement some sort of tree-like structure (sometimes -just a list) for the devices they control. But, there is no linkage between -the different bus types. +Traditional driver models implemented some sort of tree-like structure +(sometimes just a list) for the devices they control. There wasn't any +uniformity across the different bus types. -A common data structure can provide this linkage with little overhead: when a -bus driver discovers a particular device, it can insert it into the global -tree as well as its local tree. In fact, the local tree becomes just a subset -of the global tree. - -Common data fields can also be moved out of the local bus models into the -global model. Some of the manipulations of these fields can also be -consolidated. Most likely, manipulation functions will become a set -of helper functions, which the bus drivers wrap around to include any -bus-specific items. - -The common device and bridge interface currently reflects the goals of the -modern PC: namely the ability to do seamless Plug and Play, power management, -and hot plug. (The model dictated by Intel and Microsoft (read: ACPI) ensures -us that any device in the system may fit any of these criteria.) - -In reality, not every bus will be able to support such operations. But, most -buses will support a majority of those operations, and all future buses will. -In other words, a bus that doesn't support an operation is the exception, -instead of the other way around. +The current driver model provides a comon, uniform data model for describing +a bus and the devices that can appear under the bus. The unified bus +model includes a set of common attributes which all busses carry, and a set +of common callbacks, such as device discovery during bus probing, bus +shutdown, bus power management, etc. +The common device and bridge interface reflects the goals of the modern +computer: namely the ability to do seamless device "plug and play", power +management, and hot plug. In particular, the model dictated by Intel and +Microsoft (namely ACPI) ensures that almost every device on almost any bus +on an x86-compatible system can work within this paradigm. Of course, +not every bus is able to support all such operations, although most +buses support a most of those operations. Downstream Access ~~~~~~~~~~~~~~~~~ Common data fields have been moved out of individual bus layers into a common -data structure. But, these fields must still be accessed by the bus layers, +data structure. These fields must still be accessed by the bus layers, and sometimes by the device-specific drivers. Other bus layers are encouraged to do what has been done for the PCI layer. @@ -53,7 +46,7 @@ struct pci_dev now looks like this: struct pci_dev { ... - struct device device; + struct device dev; }; Note first that it is statically allocated. This means only one allocation on @@ -64,9 +57,9 @@ the two. The PCI bus layer freely accesses the fields of struct device. It knows about the structure of struct pci_dev, and it should know the structure of struct -device. PCI devices that have been converted generally do not touch the fields -of struct device. More precisely, device-specific drivers should not touch -fields of struct device unless there is a strong compelling reason to do so. +device. Individual PCI device drivers that have been converted the the current +driver model generally do not and should not touch the fields of struct device, +unless there is a strong compelling reason to do so. This abstraction is prevention of unnecessary pain during transitional phases. If the name of the field changes or is removed, then every downstream driver diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index b4a1ea76269..4d4897c8ef9 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -148,3 +148,17 @@ Why: The 8250 serial driver now has the ability to deal with the differences brother on Alchemy SOCs. The loss of features is not considered an issue. Who: Ralf Baechle <ralf@linux-mips.org> + +--------------------------- + +What: Legacy /proc/pci interface (PCI_LEGACY_PROC) +When: March 2006 +Why: deprecated since 2.5.53 in favor of lspci(8) +Who: Adrian Bunk <bunk@stusta.de> + +--------------------------- + +What: pci_module_init(driver) +When: January 2007 +Why: Is replaced by pci_register_driver(pci_driver). +Who: Richard Knutsson <ricknu-0@student.ltu.se> and Greg Kroah-Hartman <gregkh@suse.de> diff --git a/Documentation/filesystems/configfs/configfs_example.c b/Documentation/filesystems/configfs/configfs_example.c index f3c6e4946f9..3d4713a6c20 100644 --- a/Documentation/filesystems/configfs/configfs_example.c +++ b/Documentation/filesystems/configfs/configfs_example.c @@ -320,6 +320,7 @@ static struct config_item_type simple_children_type = { .ct_item_ops = &simple_children_item_ops, .ct_group_ops = &simple_children_group_ops, .ct_attrs = simple_children_attrs, + .ct_owner = THIS_MODULE, }; static struct configfs_subsystem simple_children_subsys = { @@ -403,6 +404,7 @@ static struct config_item_type group_children_type = { .ct_item_ops = &group_children_item_ops, .ct_group_ops = &group_children_group_ops, .ct_attrs = group_children_attrs, + .ct_owner = THIS_MODULE, }; static struct configfs_subsystem group_children_subsys = { diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.txt index f2595caf052..4389c684a80 100644 --- a/Documentation/filesystems/ocfs2.txt +++ b/Documentation/filesystems/ocfs2.txt @@ -35,6 +35,7 @@ Features which OCFS2 does not support yet: be cluster coherent. - quotas - cluster aware flock + - cluster aware lockf - Directory change notification (F_NOTIFY) - Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease) - POSIX ACLs diff --git a/Documentation/kernel-doc-nano-HOWTO.txt b/Documentation/kernel-doc-nano-HOWTO.txt index c406ce67edd..c65233d430f 100644 --- a/Documentation/kernel-doc-nano-HOWTO.txt +++ b/Documentation/kernel-doc-nano-HOWTO.txt @@ -45,10 +45,10 @@ How to extract the documentation If you just want to read the ready-made books on the various subsystems (see Documentation/DocBook/*.tmpl), just type 'make -psdocs', or 'make pdfdocs', or 'make htmldocs', depending on your -preference. If you would rather read a different format, you can type -'make sgmldocs' and then use DocBook tools to convert -Documentation/DocBook/*.sgml to a format of your choice (for example, +psdocs', or 'make pdfdocs', or 'make htmldocs', depending on your +preference. If you would rather read a different format, you can type +'make sgmldocs' and then use DocBook tools to convert +Documentation/DocBook/*.sgml to a format of your choice (for example, 'db2html ...' if 'make htmldocs' was not defined). If you want to see man pages instead, you can do this: @@ -124,6 +124,36 @@ patterns, which are highlighted appropriately. Take a look around the source tree for examples. +kernel-doc for structs, unions, enums, and typedefs +--------------------------------------------------- + +Beside functions you can also write documentation for structs, unions, +enums and typedefs. Instead of the function name you must write the name +of the declaration; the struct/union/enum/typedef must always precede +the name. Nesting of declarations is not supported. +Use the argument mechanism to document members or constants. + +Inside a struct description, you can use the "private:" and "public:" +comment tags. Structure fields that are inside a "private:" area +are not listed in the generated output documentation. + +Example: + +/** + * struct my_struct - short description + * @a: first member + * @b: second member + * + * Longer description + */ +struct my_struct { + int a; + int b; +/* private: */ + int c; +}; + + How to make new SGML template files ----------------------------------- @@ -147,4 +177,3 @@ documentation, in <filename>, for the functions listed. Tim. */ <twaugh@redhat.com> - diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 1cbcf65b764..84370363da8 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -452,6 +452,11 @@ running once the system is up. eata= [HW,SCSI] + ec_intr= [HW,ACPI] ACPI Embedded Controller interrupt mode + Format: <int> + 0: polling mode + non-0: interrupt mode (default) + eda= [HW,PS2] edb= [HW,PS2] diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 2b7cf19a06a..26364d06ae9 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -427,6 +427,23 @@ icmp_ignore_bogus_error_responses - BOOLEAN will avoid log file clutter. Default: FALSE +icmp_errors_use_inbound_ifaddr - BOOLEAN + + If zero, icmp error messages are sent with the primary address of + the exiting interface. + + If non-zero, the message will be sent with the primary address of + the interface that received the packet that caused the icmp error. + This is the behaviour network many administrators will expect from + a router. And it can make debugging complicated network layouts + much easier. + + Note that if no primary address exists for the interface selected, + then the primary address of the first non-loopback interface that + has one will be used regarldess of this setting. + + Default: 0 + igmp_max_memberships - INTEGER Change the maximum number of multicast groups we can subscribe to. Default: 20 diff --git a/Documentation/parport-lowlevel.txt b/Documentation/parport-lowlevel.txt index 1d40008a192..8f2302415ef 100644 --- a/Documentation/parport-lowlevel.txt +++ b/Documentation/parport-lowlevel.txt @@ -1068,7 +1068,7 @@ SYNOPSIS struct parport_operations { ... - void (*write_status) (struct parport *port, unsigned char s); + void (*write_control) (struct parport *port, unsigned char s); ... }; @@ -1097,9 +1097,9 @@ SYNOPSIS struct parport_operations { ... - void (*frob_control) (struct parport *port, - unsigned char mask, - unsigned char val); + unsigned char (*frob_control) (struct parport *port, + unsigned char mask, + unsigned char val); ... }; diff --git a/Documentation/pci-error-recovery.txt b/Documentation/pci-error-recovery.txt index d089967e494..634d3e5b575 100644 --- a/Documentation/pci-error-recovery.txt +++ b/Documentation/pci-error-recovery.txt @@ -1,246 +1,396 @@ PCI Error Recovery ------------------ - May 31, 2005 - - Current document maintainer: - Linas Vepstas <linas@austin.ibm.com> - - -Some PCI bus controllers are able to detect certain "hard" PCI errors -on the bus, such as parity errors on the data and address busses, as -well as SERR and PERR errors. These chipsets are then able to disable -I/O to/from the affected device, so that, for example, a bad DMA -address doesn't end up corrupting system memory. These same chipsets -are also able to reset the affected PCI device, and return it to -working condition. This document describes a generic API form -performing error recovery. - -The core idea is that after a PCI error has been detected, there must -be a way for the kernel to coordinate with all affected device drivers -so that the pci card can be made operational again, possibly after -performing a full electrical #RST of the PCI card. The API below -provides a generic API for device drivers to be notified of PCI -errors, and to be notified of, and respond to, a reset sequence. - -Preliminary sketch of API, cut-n-pasted-n-modified email from -Ben Herrenschmidt, circa 5 april 2005 + February 2, 2006 + + Current document maintainer: + Linas Vepstas <linas@austin.ibm.com> + + +Many PCI bus controllers are able to detect a variety of hardware +PCI errors on the bus, such as parity errors on the data and address +busses, as well as SERR and PERR errors. Some of the more advanced +chipsets are able to deal with these errors; these include PCI-E chipsets, +and the PCI-host bridges found on IBM Power4 and Power5-based pSeries +boxes. A typical action taken is to disconnect the affected device, +halting all I/O to it. The goal of a disconnection is to avoid system +corruption; for example, to halt system memory corruption due to DMA's +to "wild" addresses. Typically, a reconnection mechanism is also +offered, so that the affected PCI device(s) are reset and put back +into working condition. The reset phase requires coordination +between the affected device drivers and the PCI controller chip. +This document describes a generic API for notifying device drivers +of a bus disconnection, and then performing error recovery. +This API is currently implemented in the 2.6.16 and later kernels. + +Reporting and recovery is performed in several steps. First, when +a PCI hardware error has resulted in a bus disconnect, that event +is reported as soon as possible to all affected device drivers, +including multiple instances of a device driver on multi-function +cards. This allows device drivers to avoid deadlocking in spinloops, +waiting for some i/o-space register to change, when it never will. +It also gives the drivers a chance to defer incoming I/O as +needed. + +Next, recovery is performed in several stages. Most of the complexity +is forced by the need to handle multi-function devices, that is, |