aboutsummaryrefslogtreecommitdiff
path: root/Documentation/PCI
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/PCI')
-rw-r--r--Documentation/PCI/00-INDEX4
-rw-r--r--Documentation/PCI/MSI-HOWTO.txt1055
-rw-r--r--Documentation/PCI/pci-error-recovery.txt123
-rw-r--r--Documentation/PCI/pci-iov-howto.txt135
-rw-r--r--Documentation/PCI/pci.txt38
-rw-r--r--Documentation/PCI/pcieaer-howto.txt48
6 files changed, 833 insertions, 570 deletions
diff --git a/Documentation/PCI/00-INDEX b/Documentation/PCI/00-INDEX
index 812b17fe3ed..147231f1613 100644
--- a/Documentation/PCI/00-INDEX
+++ b/Documentation/PCI/00-INDEX
@@ -2,12 +2,12 @@
- this file
MSI-HOWTO.txt
- the Message Signaled Interrupts (MSI) Driver Guide HOWTO and FAQ.
-PCI-DMA-mapping.txt
- - info for PCI drivers using DMA portably across all platforms
PCIEBUS-HOWTO.txt
- a guide describing the PCI Express Port Bus driver
pci-error-recovery.txt
- info on PCI error recovery
+pci-iov-howto.txt
+ - the PCI Express I/O Virtualization HOWTO
pci.txt
- info on the PCI subsystem for device driver authors
pcieaer-howto.txt
diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
index 256defd7e17..10a93696e55 100644
--- a/Documentation/PCI/MSI-HOWTO.txt
+++ b/Documentation/PCI/MSI-HOWTO.txt
@@ -4,506 +4,593 @@
Revised Feb 12, 2004 by Martine Silbermann
email: Martine.Silbermann@hp.com
Revised Jun 25, 2004 by Tom L Nguyen
+ Revised Jul 9, 2008 by Matthew Wilcox <willy@linux.intel.com>
+ Copyright 2003, 2008 Intel Corporation
1. About this guide
-This guide describes the basics of Message Signaled Interrupts (MSI),
-the advantages of using MSI over traditional interrupt mechanisms,
-and how to enable your driver to use MSI or MSI-X. Also included is
-a Frequently Asked Questions (FAQ) section.
-
-1.1 Terminology
-
-PCI devices can be single-function or multi-function. In either case,
-when this text talks about enabling or disabling MSI on a "device
-function," it is referring to one specific PCI device and function and
-not to all functions on a PCI device (unless the PCI device has only
-one function).
-
-2. Copyright 2003 Intel Corporation
-
-3. What is MSI/MSI-X?
-
-Message Signaled Interrupt (MSI), as described in the PCI Local Bus
-Specification Revision 2.3 or later, is an optional feature, and a
-required feature for PCI Express devices. MSI enables a device function
-to request service by sending an Inbound Memory Write on its PCI bus to
-the FSB as a Message Signal Interrupt transaction. Because MSI is
-generated in the form of a Memory Write, all transaction conditions,
-such as a Retry, Master-Abort, Target-Abort or normal completion, are
-supported.
-
-A PCI device that supports MSI must also support pin IRQ assertion
-interrupt mechanism to provide backward compatibility for systems that
-do not support MSI. In systems which support MSI, the bus driver is
-responsible for initializing the message address and message data of
-the device function's MSI/MSI-X capability structure during device
-initial configuration.
-
-An MSI capable device function indicates MSI support by implementing
-the MSI/MSI-X capability structure in its PCI capability list. The
-device function may implement both the MSI capability structure and
-the MSI-X capability structure; however, the bus driver should not
-enable both.
-
-The MSI capability structure contains Message Control register,
-Message Address register and Message Data register. These registers
-provide the bus driver control over MSI. The Message Control register
-indicates the MSI capability supported by the device. The Message
-Address register specifies the target address and the Message Data
-register specifies the characteristics of the message. To request
-service, the device function writes the content of the Message Data
-register to the target address. The device and its software driver
-are prohibited from writing to these registers.
-
-The MSI-X capability structure is an optional extension to MSI. It
-uses an independent and separate capability structure. There are
-some key advantages to implementing the MSI-X capability structure
-over the MSI capability structure as described below.
-
- - Support a larger maximum number of vectors per function.
-
- - Provide the ability for system software to configure
- each vector with an independent message address and message
- data, specified by a table that resides in Memory Space.
-
- - MSI and MSI-X both support per-vector masking. Per-vector
- masking is an optional extension of MSI but a required
- feature for MSI-X. Per-vector masking provides the kernel the
- ability to mask/unmask a single MSI while running its
- interrupt service routine. If per-vector masking is
- not supported, then the device driver should provide the
- hardware/software synchronization to ensure that the device
- generates MSI when the driver wants it to do so.
-
-4. Why use MSI?
-
-As a benefit to the simplification of board design, MSI allows board
-designers to remove out-of-band interrupt routing. MSI is another
-step towards a legacy-free environment.
-
-Due to increasing pressure on chipset and processor packages to
-reduce pin count, the need for interrupt pins is expected to
-diminish over time. Devices, due to pin constraints, may implement
-messages to increase performance.
-
-PCI Express endpoints uses INTx emulation (in-band messages) instead
-of IRQ pin assertion. Using INTx emulation requires interrupt
-sharing among devices connected to the same node (PCI bridge) while
-MSI is unique (non-shared) and does not require BIOS configuration
-support. As a result, the PCI Express technology requires MSI
-support for better interrupt performance.
-
-Using MSI enables the device functions to support two or more
-vectors, which can be configured to target different CPUs to
-increase scalability.
-
-5. Configuring a driver to use MSI/MSI-X
-
-By default, the kernel will not enable MSI/MSI-X on all devices that
-support this capability. The CONFIG_PCI_MSI kernel option
-must be selected to enable MSI/MSI-X support.
-
-5.1 Including MSI/MSI-X support into the kernel
-
-To allow MSI/MSI-X capable device drivers to selectively enable
-MSI/MSI-X (using pci_enable_msi()/pci_enable_msix() as described
-below), the VECTOR based scheme needs to be enabled by setting
-CONFIG_PCI_MSI during kernel config.
-
-Since the target of the inbound message is the local APIC, providing
-CONFIG_X86_LOCAL_APIC must be enabled as well as CONFIG_PCI_MSI.
-
-5.2 Configuring for MSI support
-
-Due to the non-contiguous fashion in vector assignment of the
-existing Linux kernel, this version does not support multiple
-messages regardless of a device function is capable of supporting
-more than one vector. To enable MSI on a device function's MSI
-capability structure requires a device driver to call the function
-pci_enable_msi() explicitly.
-
-5.2.1 API pci_enable_msi
+This guide describes the basics of Message Signaled Interrupts (MSIs),
+the advantages of using MSI over traditional interrupt mechanisms, how
+to change your driver to use MSI or MSI-X and some basic diagnostics to
+try if a device doesn't support MSIs.
+
+
+2. What are MSIs?
+
+A Message Signaled Interrupt is a write from the device to a special
+address which causes an interrupt to be received by the CPU.
+
+The MSI capability was first specified in PCI 2.2 and was later enhanced
+in PCI 3.0 to allow each interrupt to be masked individually. The MSI-X
+capability was also introduced with PCI 3.0. It supports more interrupts
+per device than MSI and allows interrupts to be independently configured.
+
+Devices may support both MSI and MSI-X, but only one can be enabled at
+a time.
+
+
+3. Why use MSIs?
+
+There are three reasons why using MSIs can give an advantage over
+traditional pin-based interrupts.
+
+Pin-based PCI interrupts are often shared amongst several devices.
+To support this, the kernel must call each interrupt handler associated
+with an interrupt, which leads to reduced performance for the system as
+a whole. MSIs are never shared, so this problem cannot arise.
+
+When a device writes data to memory, then raises a pin-based interrupt,
+it is possible that the interrupt may arrive before all the data has
+arrived in memory (this becomes more likely with devices behind PCI-PCI
+bridges). In order to ensure that all the data has arrived in memory,
+the interrupt handler must read a register on the device which raised
+the interrupt. PCI transaction ordering rules require that all the data
+arrive in memory before the value may be returned from the register.
+Using MSIs avoids this problem as the interrupt-generating write cannot
+pass the data writes, so by the time the interrupt is raised, the driver
+knows that all the data has arrived in memory.
+
+PCI devices can only support a single pin-based interrupt per function.
+Often drivers have to query the device to find out what event has
+occurred, slowing down interrupt handling for the common case. With
+MSIs, a device can support more interrupts, allowing each interrupt
+to be specialised to a different purpose. One possible design gives
+infrequent conditions (such as errors) their own interrupt which allows
+the driver to handle the normal interrupt handling path more efficiently.
+Other possible designs include giving one interrupt to each packet queue
+in a network card or each port in a storage controller.
+
+
+4. How to use MSIs
+
+PCI devices are initialised to use pin-based interrupts. The device
+driver has to set up the device to use MSI or MSI-X. Not all machines
+support MSIs correctly, and for those machines, the APIs described below
+will simply fail and the device will continue to use pin-based interrupts.
+
+4.1 Include kernel support for MSIs
+
+To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI
+option enabled. This option is only available on some architectures,
+and it may depend on some other options also being set. For example,
+on x86, you must also enable X86_UP_APIC or SMP in order to see the
+CONFIG_PCI_MSI option.
+
+4.2 Using MSI
+
+Most of the hard work is done for the driver in the PCI layer. It simply
+has to request that the PCI layer set up the MSI capability for this
+device.
+
+4.2.1 pci_enable_msi
int pci_enable_msi(struct pci_dev *dev)
-With this new API, a device driver that wants to have MSI
-enabled on its device function must call this API to enable MSI.
-A successful call will initialize the MSI capability structure
-with ONE vector, regardless of whether a device function is
-capable of supporting multiple messages. This vector replaces the
-pre-assigned dev->irq with a new MSI vector. To avoid a conflict
-of the new assigned vector with existing pre-assigned vector requires
-a device driver to call this API before calling request_irq().
+A successful call allocates ONE interrupt to the device, regardless
+of how many MSIs the device supports. The device is switched from
+pin-based interrupt mode to MSI mode. The dev->irq number is changed
+to a new number which represents the message signaled interrupt;
+consequently, this function should be called before the driver calls
+request_irq(), because an MSI is delivered via a vector that is
+different from the vector of a pin-based interrupt.
+
+4.2.2 pci_enable_msi_range
+
+int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
+
+This function allows a device driver to request any number of MSI
+interrupts within specified range from 'minvec' to 'maxvec'.
+
+If this function returns a positive number it indicates the number of
+MSI interrupts that have been successfully allocated. In this case
+the device is switched from pin-based interrupt mode to MSI mode and
+updates dev->irq to be the lowest of the new interrupts assigned to it.
+The other interrupts assigned to the device are in the range dev->irq
+to dev->irq + returned value - 1. Device driver can use the returned
+number of successfully allocated MSI interrupts to further allocate
+and initialize device resources.
+
+If this function returns a negative number, it indicates an error and
+the driver should not attempt to request any more MSI interrupts for
+this device.
+
+This function should be called before the driver calls request_irq(),
+because MSI interrupts are delivered via vectors that are different
+from the vector of a pin-based interrupt.
+
+It is ideal if drivers can cope with a variable number of MSI interrupts;
+there are many reasons why the platform may not be able to provide the
+exact number that a driver asks for.
+
+There could be devices that can not operate with just any number of MSI
+interrupts within a range. See chapter 4.3.1.3 to get the idea how to
+handle such devices for MSI-X - the same logic applies to MSI.
-5.2.2 API pci_disable_msi
+4.2.1.1 Maximum possible number of MSI interrupts
+
+The typical usage of MSI interrupts is to allocate as many vectors as
+possible, likely up to the limit returned by pci_msi_vec_count() function:
+
+static int foo_driver_enable_msi(struct pci_dev *pdev, int nvec)
+{
+ return pci_enable_msi_range(pdev, 1, nvec);
+}
+
+Note the value of 'minvec' parameter is 1. As 'minvec' is inclusive,
+the value of 0 would be meaningless and could result in error.
+
+Some devices have a minimal limit on number of MSI interrupts.
+In this case the function could look like this:
+
+static int foo_driver_enable_msi(struct pci_dev *pdev, int nvec)
+{
+ return pci_enable_msi_range(pdev, FOO_DRIVER_MINIMUM_NVEC, nvec);
+}
+
+4.2.1.2 Exact number of MSI interrupts
+
+If a driver is unable or unwilling to deal with a variable number of MSI
+interrupts it could request a particular number of interrupts by passing
+that number to pci_enable_msi_range() function as both 'minvec' and 'maxvec'
+parameters:
+
+static int foo_driver_enable_msi(struct pci_dev *pdev, int nvec)
+{
+ return pci_enable_msi_range(pdev, nvec, nvec);
+}
+
+Note, unlike pci_enable_msi_exact() function, which could be also used to
+enable a particular number of MSI-X interrupts, pci_enable_msi_range()
+returns either a negative errno or 'nvec' (not negative errno or 0 - as
+pci_enable_msi_exact() does).
+
+4.2.1.3 Single MSI mode
+
+The most notorious example of the request type described above is
+enabling the single MSI mode for a device. It could be done by passing
+two 1s as 'minvec' and 'maxvec':
+
+static int foo_driver_enable_single_msi(struct pci_dev *pdev)
+{
+ return pci_enable_msi_range(pdev, 1, 1);
+}
+
+Note, unlike pci_enable_msi() function, which could be also used to
+enable the single MSI mode, pci_enable_msi_range() returns either a
+negative errno or 1 (not negative errno or 0 - as pci_enable_msi()
+does).
+
+4.2.3 pci_enable_msi_exact
+
+int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
+
+This variation on pci_enable_msi_range() call allows a device driver to
+request exactly 'nvec' MSIs.
+
+If this function returns a negative number, it indicates an error and
+the driver should not attempt to request any more MSI interrupts for
+this device.
+
+By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
+returns zero in case of success, which indicates MSI interrupts have been
+successfully allocated.
+
+4.2.4 pci_disable_msi
void pci_disable_msi(struct pci_dev *dev)
-This API should always be used to undo the effect of pci_enable_msi()
-when a device driver is unloading. This API restores dev->irq with
-the pre-assigned IOAPIC vector and switches a device's interrupt
-mode to PCI pin-irq assertion/INTx emulation mode.
-
-Note that a device driver should always call free_irq() on the MSI vector
-that it has done request_irq() on before calling this API. Failure to do
-so results in a BUG_ON() and a device will be left with MSI enabled and
-leaks its vector.
-
-5.2.3 MSI mode vs. legacy mode diagram
-
-The below diagram shows the events which switch the interrupt
-mode on the MSI-capable device function between MSI mode and
-PIN-IRQ assertion mode.
-
- ------------ pci_enable_msi ------------------------
- | | <=============== | |
- | MSI MODE | | PIN-IRQ ASSERTION MODE |
- | | ===============> | |
- ------------ pci_disable_msi ------------------------
-
-
-Figure 1. MSI Mode vs. Legacy Mode
-
-In Figure 1, a device operates by default in legacy mode. Legacy
-in this context means PCI pin-irq assertion or PCI-Express INTx
-emulation. A successful MSI request (using pci_enable_msi()) switches
-a device's interrupt mode to MSI mode. A pre-assigned IOAPIC vector
-stored in dev->irq will be saved by the PCI subsystem and a new
-assigned MSI vector will replace dev->irq.
-
-To return back to its default mode, a device driver should always call
-pci_disable_msi() to undo the effect of pci_enable_msi(). Note that a
-device driver should always call free_irq() on the MSI vector it has
-done request_irq() on before calling pci_disable_msi(). Failure to do
-so results in a BUG_ON() and a device will be left with MSI enabled and
-leaks its vector. Otherwise, the PCI subsystem restores a device's
-dev->irq with a pre-assigned IOAPIC vector and marks the released
-MSI vector as unused.
-
-Once being marked as unused, there is no guarantee that the PCI
-subsystem will reserve this MSI vector for a device. Depending on
-the availability of current PCI vector resources and the number of
-MSI/MSI-X requests from other drivers, this MSI may be re-assigned.
-
-For the case where the PCI subsystem re-assigns this MSI vector to
-another driver, a request to switch back to MSI mode may result
-in being assigned a different MSI vector or a failure if no more
-vectors are available.
-
-5.3 Configuring for MSI-X support
-
-Due to the ability of the system software to configure each vector of
-the MSI-X capability structure with an independent message address
-and message data, the non-contiguous fashion in vector assignment of
-the existing Linux kernel has no impact on supporting multiple
-messages on an MSI-X capable device functions. To enable MSI-X on
-a device function's MSI-X capability structure requires its device
-driver to call the function pci_enable_msix() explicitly.
-
-The function pci_enable_msix(), once invoked, enables either
-all or nothing, depending on the current availability of PCI vector
-resources. If the PCI vector resources are available for the number
-of vectors requested by a device driver, this function will configure
-the MSI-X table of the MSI-X capability structure of a device with
-requested messages. To emphasize this reason, for example, a device
-may be capable for supporting the maximum of 32 vectors while its
-software driver usually may request 4 vectors. It is recommended
-that the device driver should call this function once during the
-initialization phase of the device driver.
-
-Unlike the function pci_enable_msi(), the function pci_enable_msix()
-does not replace the pre-assigned IOAPIC dev->irq with a new MSI
-vector because the PCI subsystem writes the 1:1 vector-to-entry mapping
-into the field vector of each element contained in a second argument.
-Note that the pre-assigned IOAPIC dev->irq is valid only if the device
-operates in PIN-IRQ assertion mode. In MSI-X mode, any attempt at
-using dev->irq by the device driver to request for interrupt service
-may result in unpredictable behavior.
-
-For each MSI-X vector granted, a device driver is responsible for calling
-other functions like request_irq(), enable_irq(), etc. to enable
-this vector with its corresponding interrupt service handler. It is
-a device driver's choice to assign all vectors with the same
-interrupt service handler or each vector with a unique interrupt
-service handler.
-
-5.3.1 Handling MMIO address space of MSI-X Table
-
-The PCI 3.0 specification has implementation notes that MMIO address
-space for a device's MSI-X structure should be isolated so that the
-software system can set different pages for controlling accesses to the
-MSI-X structure. The implementation of MSI support requires the PCI
-subsystem, not a device driver, to maintain full control of the MSI-X
-table/MSI-X PBA (Pending Bit Array) and MMIO address space of the MSI-X
-table/MSI-X PBA. A device driver should not access the MMIO address
-space of the MSI-X table/MSI-X PBA.
-
-5.3.2 API pci_enable_msix
-
-int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
-
-This API enables a device driver to request the PCI subsystem
-to enable MSI-X messages on its hardware device. Depending on
-the availability of PCI vectors resources, the PCI subsystem enables
-either all or none of the requested vectors.
-
-Argument 'dev' points to the device (pci_dev) structure.
-
-Argument 'entries' is a pointer to an array of msix_entry structs.
-The number of entries is indicated in argument 'nvec'.
-struct msix_entry is defined in /driver/pci/msi.h:
+This function should be used to undo the effect of pci_enable_msi_range().
+Calling it restores dev->irq to the pin-based interrupt number and frees
+the previously allocated MSIs. The interrupts may subsequently be assigned
+to another device, so drivers should not cache the value of dev->irq.
+
+Before calling this function, a device driver must always call free_irq()
+on any interrupt for which it previously called request_irq().
+Failure to do so results in a BUG_ON(), leaving the device with
+MSI enabled and thus leaking its vector.
+
+4.2.4 pci_msi_vec_count
+
+int pci_msi_vec_count(struct pci_dev *dev)
+
+This function could be used to retrieve the number of MSI vectors the
+device requested (via the Multiple Message Capable register). The MSI
+specification only allows the returned value to be a power of two,
+up to a maximum of 2^5 (32).
+
+If this function returns a negative number, it indicates the device is
+not capable of sending MSIs.
+
+If this function returns a positive number, it indicates the maximum
+number of MSI interrupt vectors that could be allocated.
+
+4.3 Using MSI-X
+
+The MSI-X capability is much more flexible than the MSI capability.
+It supports up to 2048 interrupts, each of which can be controlled
+independently. To support this flexibility, drivers must use an array of
+`struct msix_entry':
struct msix_entry {
u16 vector; /* kernel uses to write alloc vector */
u16 entry; /* driver uses to specify entry */
};
-A device driver is responsible for initializing the field 'entry' of
-each element with a unique entry supported by MSI-X table. Otherwise,
--EINVAL will be returned as a result. A successful return of zero
-indicates the PCI subsystem completed initializing each of the requested
-entries of the MSI-X table with message address and message data.
-Last but not least, the PCI subsystem will write the 1:1
-vector-to-entry mapping into the field 'vector' of each element. A
-device driver is responsible for keeping track of allocated MSI-X
-vectors in its internal data structure.
-
-A return of zero indicates that the number of MSI-X vectors was
-successfully allocated. A return of greater than zero indicates
-MSI-X vector shortage. Or a return of less than zero indicates
-a failure. This failure may be a result of duplicate entries
-specified in second argument, or a result of no available vector,
-or a result of failing to initialize MSI-X table entries.
-
-5.3.3 API pci_disable_msix
+This allows for the device to use these interrupts in a sparse fashion;
+for example, it could use interrupts 3 and 1027 and yet allocate only a
+two-element array. The driver is expected to fill in the 'entry' value
+in each element of the array to indicate for which entries the kernel
+should assign interrupts; it is invalid to fill in two entries with the
+same number.
+
+4.3.1 pci_enable_msix_range
+
+int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
+ int minvec, int maxvec)
+
+Calling this function asks the PCI subsystem to allocate any number of
+MSI-X interrupts within specified range from 'minvec' to 'maxvec'.
+The 'entries' argument is a pointer to an array of msix_entry structs
+which should be at least 'maxvec' entries in size.
+
+On success, the device is switched into MSI-X mode and the function
+returns the number of MSI-X interrupts that have been successfully
+allocated. In this case the 'vector' member in entries numbered from
+0 to the returned value - 1 is populated with the interrupt number;
+the driver should then call request_irq() for each 'vector' that it
+decides to use. The device driver is responsible for keeping track of the
+interrupts assigned to the MSI-X vectors so it can free them again later.
+Device driver can use the returned number of successfully allocated MSI-X
+interrupts to further allocate and initialize device resources.
+
+If this function returns a negative number, it indicates an error and
+the driver should not attempt to allocate any more MSI-X interrupts for
+this device.
+
+This function, in contrast with pci_enable_msi_range(), does not adjust
+dev->irq. The device will not generate interrupts for this interrupt
+number once MSI-X is enabled.
+
+Device drivers should normally call this function once per device
+during the initialization phase.
+
+It is ideal if drivers can cope with a variable number of MSI-X interrupts;
+there are many reasons why the platform may not be able to provide the
+exact number that a driver asks for.
+
+There could be devices that can not operate with just any number of MSI-X
+interrupts within a range. E.g., an network adapter might need let's say
+four vectors per each queue it provides. Therefore, a number of MSI-X
+interrupts allocated should be a multiple of four. In this case interface
+pci_enable_msix_range() can not be used alone to request MSI-X interrupts
+(since it can allocate any number within the range, without any notion of
+the multiple of four) and the device driver should master a custom logic
+to request the required number of MSI-X interrupts.
+
+4.3.1.1 Maximum possible number of MSI-X interrupts
+
+The typical usage of MSI-X interrupts is to allocate as many vectors as
+possible, likely up to the limit returned by pci_msix_vec_count() function:
+
+static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
+{
+ return pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
+ 1, nvec);
+}
+
+Note the value of 'minvec' parameter is 1. As 'minvec' is inclusive,
+the value of 0 would be meaningless and could result in error.
+
+Some devices have a minimal limit on number of MSI-X interrupts.
+In this case the function could look like this:
+
+static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
+{
+ return pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
+ FOO_DRIVER_MINIMUM_NVEC, nvec);
+}
+
+4.3.1.2 Exact number of MSI-X interrupts
+
+If a driver is unable or unwilling to deal with a variable number of MSI-X
+interrupts it could request a particular number of interrupts by passing
+that number to pci_enable_msix_range() function as both 'minvec' and 'maxvec'
+parameters:
+
+static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
+{
+ return pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
+ nvec, nvec);
+}
+
+Note, unlike pci_enable_msix_exact() function, which could be also used to
+enable a particular number of MSI-X interrupts, pci_enable_msix_range()
+returns either a negative errno or 'nvec' (not negative errno or 0 - as
+pci_enable_msix_exact() does).
+
+4.3.1.3 Specific requirements to the number of MSI-X interrupts
+
+As noted above, there could be devices that can not operate with just any
+number of MSI-X interrupts within a range. E.g., let's assume a device that
+is only capable sending the number of MSI-X interrupts which is a power of
+two. A routine that enables MSI-X mode for such device might look like this:
+
+/*
+ * Assume 'minvec' and 'maxvec' are non-zero
+ */
+static int foo_driver_enable_msix(struct foo_adapter *adapter,
+ int minvec, int maxvec)
+{
+ int rc;
+
+ minvec = roundup_pow_of_two(minvec);
+ maxvec = rounddown_pow_of_two(maxvec);
+
+ if (minvec > maxvec)
+ return -ERANGE;
+
+retry:
+ rc = pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
+ maxvec, maxvec);
+ /*
+ * -ENOSPC is the only error code allowed to be analized
+ */
+ if (rc == -ENOSPC) {
+ if (maxvec == 1)
+ return -ENOSPC;
+
+ maxvec /= 2;
+
+ if (minvec > maxvec)
+ return -ENOSPC;
+
+ goto retry;
+ }
+
+ return rc;
+}
+
+Note how pci_enable_msix_range() return value is analized for a fallback -
+any error code other than -ENOSPC indicates a fatal error and should not
+be retried.
+
+4.3.2 pci_enable_msix_exact
+
+int pci_enable_msix_exact(struct pci_dev *dev,
+ struct msix_entry *entries, int nvec)
+
+This variation on pci_enable_msix_range() call allows a device driver to
+request exactly 'nvec' MSI-Xs.
+
+If this function returns a negative number, it indicates an error and
+the driver should not attempt to allocate any more MSI-X interrupts for
+this device.
+
+By contrast with pci_enable_msix_range() function, pci_enable_msix_exact()
+returns zero in case of success, which indicates MSI-X interrupts have been
+successfully allocated.
+
+Another version of a routine that enables MSI-X mode for a device with
+specific requirements described in chapter 4.3.1.3 might look like this:
+
+/*
+ * Assume 'minvec' and 'maxvec' are non-zero
+ */
+static int foo_driver_enable_msix(struct foo_adapter *adapter,
+ int minvec, int maxvec)
+{
+ int rc;
+
+ minvec = roundup_pow_of_two(minvec);
+ maxvec = rounddown_pow_of_two(maxvec);
+
+ if (minvec > maxvec)
+ return -ERANGE;
+
+retry:
+ rc = pci_enable_msix_exact(adapter->pdev,
+ adapter->msix_entries, maxvec);
+
+ /*
+ * -ENOSPC is the only error code allowed to be analyzed
+ */
+ if (rc == -ENOSPC) {
+ if (maxvec == 1)
+ return -ENOSPC;
+
+ maxvec /= 2;
+
+ if (minvec > maxvec)
+ return -ENOSPC;
+
+ goto retry;
+ } else if (rc < 0) {
+ return rc;
+ }
+
+ return maxvec;
+}
+
+4.3.3 pci_disable_msix
void pci_disable_msix(struct pci_dev *dev)
-This API should always be used to undo the effect of pci_enable_msix()
-when a device driver is unloading. Note that a device driver should
-always call free_irq() on all MSI-X vectors it has done request_irq()
-on before calling this API. Failure to do so results in a BUG_ON() and
-a device will be left with MSI-X enabled and leaks its vectors.
-
-5.3.4 MSI-X mode vs. legacy mode diagram
-
-The below diagram shows the events which switch the interrupt
-mode on the MSI-X capable device function between MSI-X mode and
-PIN-IRQ assertion mode (legacy).
-
- ------------ pci_enable_msix(,,n) ------------------------
- | | <=============== | |
- | MSI-X MODE | | PIN-IRQ ASSERTION MODE |
- | | ===============> | |
- ------------ pci_disable_msix ------------------------
-
-Figure 2. MSI-X Mode vs. Legacy Mode
-
-In Figure 2, a device operates by default in legacy mode. A
-successful MSI-X request (using pci_enable_msix()) switches a
-device's interrupt mode to MSI-X mode. A pre-assigned IOAPIC vector
-stored in dev->irq will be saved by the PCI subsystem; however,
-unlike MSI mode, the PCI subsystem will not replace dev->irq with
-assigned MSI-X vector because the PCI subsystem already writes the 1:1
-vector-to-entry mapping into the field 'vector' of each element
-specified in second argument.
-
-To return back to its default mode, a device driver should always call
-pci_disable_msix() to undo the effect of pci_enable_msix(). Note that
-a device driver should always call free_irq() on all MSI-X vectors it
-has done request_irq() on before calling pci_disable_msix(). Failure
-to do so results in a BUG_ON() and a device will be left with MSI-X
-enabled and leaks its vectors. Otherwise, the PCI subsystem switches a
-device function's interrupt mode from MSI-X mode to legacy mode and
-marks all allocated MSI-X vectors as unused.
-
-Once being marked as unused, there is no guarantee that the PCI
-subsystem will reserve these MSI-X vectors for a device. Depending on
-the availability of current PCI vector resources and the number of
-MSI/MSI-X requests from other drivers, these MSI-X vectors may be
-re-assigned.
-
-For the case where the PCI subsystem re-assigned these MSI-X vectors
-to other drivers, a request to switch back to MSI-X mode may result
-being assigned with another set of MSI-X vectors or a failure if no
-more vectors are available.
-
-5.4 Handling function implementing both MSI and MSI-X capabilities
-
-For the case where a function implements both MSI and MSI-X
-capabilities, the PCI subsystem enables a device to run either in MSI
-mode or MSI-X mode but not both. A device driver determines whether it
-wants MSI or MSI-X enabled on its hardware device. Once a device
-driver requests for MSI, for example, it is prohibited from requesting
-MSI-X; in other words, a device driver is not permitted to ping-pong
-between MSI mod MSI-X mode during a run-time.
-
-5.5 Hardware requirements for MSI/MSI-X support
-
-MSI/MSI-X support requires support from both system hardware and
-individual hardware device functions.
-
-5.5.1 Required x86 hardware support
-
-Since the target of MSI address is the local APIC CPU, enabling
-MSI/MSI-X support in the Linux kernel is dependent on whether existing
-system hardware supports local APIC. Users should verify that their
-system supports local APIC operation by testing that it runs when
-CONFIG_X86_LOCAL_APIC=y.
-
-In SMP environment, CONFIG_X86_LOCAL_APIC is automatically set;
-however, in UP environment, users must manually set
-CONFIG_X86_LOCAL_APIC. Once CONFIG_X86_LOCAL_APIC=y, setting
-CONFIG_PCI_MSI enables the VECTOR based scheme and the option for
-MSI-capable device drivers to selectively enable MSI/MSI-X.
-
-Note that CONFIG_X86_IO_APIC setting is irrelevant because MSI/MSI-X
-vector is allocated new during runtime and MSI/MSI-X support does not
-depend on BIOS support. This key independency enables MSI/MSI-X
-support on future IOxAPIC free platforms.
-
-5.5.2 Device hardware support
-
-The hardware device function supports MSI by indicating the
-MSI/MSI-X capability structure on its PCI capability list. By
-default, this capability structure will not be initialized by
-the kernel to enable MSI during the system boot. In other words,
-the device function is running on its default pin assertion mode.
-Note that in many cases the hardware supporting MSI have bugs,
-which may result in system hangs. The software driver of specific
-MSI-capable hardware is responsible for deciding whether to call
-pci_enable_msi or not. A return of zero indicates the kernel
-successfully initialized the MSI/MSI-X capability structure of the
-device function. The device function is now running on MSI/MSI-X mode.
-
-5.6 How to tell whether MSI/MSI-X is enabled on device function
-
-At the driver level, a return of zero from the function call of
-pci_enable_msi()/pci_enable_msix() indicates to a device driver that
-its device function is initialized successfully and ready to run in
-MSI/MSI-X mode.
-
-At the user level, users can use the command 'cat /proc/interrupts'
-to display the vectors allocated for devices and their interrupt
-MSI/MSI-X modes ("PCI-MSI"/"PCI-MSI-X"). Below shows MSI mode is
-enabled on a SCSI Adaptec 39320D Ultra320 controller.
-
- CPU0 CPU1
- 0: 324639 0 IO-APIC-edge timer
- 1: 1186 0 IO-APIC-edge i8042
- 2: 0 0 XT-PIC cascade
- 12: 2797 0 IO-APIC-edge i8042
- 14: 6543 0 IO-APIC-edge ide0
- 15: 1 0 IO-APIC-edge ide1
-169: 0 0 IO-APIC-level uhci-hcd
-185: 0 0 IO-APIC-level uhci-hcd
-193: 138 10 PCI-MSI aic79xx
-201: 30 0 PCI-MSI aic79xx
-225: 30 0 IO-APIC-level aic7xxx
-233: 30 0 IO-APIC-level aic7xxx
-NMI: 0 0
-LOC: 324553 325068
-ERR: 0
-MIS: 0
-
-6. MSI quirks
-
-Several PCI chipsets or devices are known to not support MSI.
-The PCI stack provides 3 possible levels of MSI disabling:
-* on a single device
-* on all devices behind a specific bridge
-* globally
-
-6.1. Disabling MSI on a single device
-
-Under some circumstances it might be required to disable MSI on a
-single device. This may be achieved by either not calling pci_enable_msi()
-or all, or setting the pci_dev->no_msi flag before (most of the time
-in a quirk).
-
-6.2. Disabling MSI below a bridge
-
-The vast majority of MSI quirks are required by PCI bridges not
-being able to route MSI between busses. In this case, MSI have to be
-disabled on all devices behind this bridge. It is achieves by setting
-the PCI_BUS_FLAGS_NO_MSI flag in the pci_bus->bus_flags of the bridge
-subordinate bus. There is no need to set the same flag on bridges that
-are below the broken bridge. When pci_enable_msi() is called to enable
-MSI on a device, pci_msi_supported() takes care of checking the NO_MSI
-flag in all parent busses of the device.
-
-Some bridges actually support dynamic MSI support enabling/disabling
-by changing some bits in their PCI configuration space (especially
-the Hypertransport chipsets such as the nVidia nForce and Serverworks
-HT2000). It may then be required to update the NO_MSI flag on the
-corresponding devices in the sysfs hierarchy. To enable MSI support
-on device "0000:00:0e", do:
-
- echo 1 > /sys/bus/pci/devices/0000:00:0e/msi_bus
-
-To disable MSI support, echo 0 instead of 1. Note that it should be
-used with caution since changing this value might break interrupts.
-
-6.3. Disabling MSI globally
-
-Some extreme cases may require to disable MSI globally on the system.
-For now, the only known case is a Serverworks PCI-X chipsets (MSI are
-not supported on several busses that are not all connected to the
-chipset in the Linux PCI hierarchy). In the vast majority of other
-cases, disabling only behind a specific bridge is enough.
-
-For debugging purpose, the user may also pass pci=nomsi on the kernel
-command-line to explicitly disable MSI globally. But, once the appro-
-priate quirks are added to the kernel, this option should not be
-required anymore.
-
-6.4. Finding why MSI cannot be enabled on a device
-
-Assuming that MSI are not enabled on a device, you should look at
-dmesg to find messages that quirks may output when disabling MSI
-on some devices, some bridges or even globally.
-Then, lspci -t gives the list of bridges above a device. Reading
-/sys/bus/pci/devices/0000:00:0e/msi_bus will tell you whether MSI
-are enabled (1) or disabled (0). In 0 is found in a single bridge
-msi_bus file above the device, MSI cannot be enabled.
-
-7. FAQ
-
-Q1. Are there any limitations on using the MSI?
-
-A1. If the PCI device supports MSI and conforms to the
-specification and the platform supports the APIC local bus,
-then using MSI should work.
-
-Q2. Will it work on all the Pentium processors (P3, P4, Xeon,
-AMD processors)? In P3 IPI's are transmitted on the APIC local
-bus and in P4 and Xeon they are transmitted on the system
-bus. Are there any implications with this?
-
-A2. MSI support enables a PCI device sending an inbound
-memory write (0xfeexxxxx as target address) on its PCI bus
-directly to the FSB. Since the message address has a
-redirection hint bit cleared, it should work.
-
-Q3. The target address 0xfeexxxxx will be translated by the
-Host Bridge into an interrupt message. Are there any
-limitations on the chipsets such as Intel 8xx, Intel e7xxx,
-or VIA?
-
-A3. If these chipsets support an inbound memory write with
-target address set as 0xfeexxxxx, as conformed to PCI
-specification 2.3 or latest, then it should work.
-
-Q4. From the driver point of view, if the MSI is lost because
-of errors occurring during inbound memory write, then it may
-wait forever. Is there a mechanism for it to recover?
-
-A4. Since the target of the transaction is an inbound memory
-write, all transaction termination conditions (Retry,
-Master-Abort, Target-Abort, or normal completion) are
-supported. A device sending an MSI must abide by all the PCI
-rules and conditions regarding that inbound memory write. So,
-if a retry is signaled it must retry, etc... We believe that
-the recommendation for Abort is also a retry (refer to PCI
-specification 2.3 or latest).
+This function should be used to undo the effect of pci_enable_msix_range().
+It frees the previously allocated MSI-X interrupts. The interrupts may
+subsequently be assigned to another device, so drivers should not cache
+the value of the 'vector' elements over a call to pci_disable_msix().
+
+Before calling this function, a device driver must always call free_irq()
+on any interrupt for which it previously called request_irq().
+Failure to do so results in a BUG_ON(), leaving the device with
+MSI-X enabled and thus leaking its vector.
+
+4.3.3 The MSI-X Table
+
+The MSI-X capability specifies a BAR and offset within that BAR for the
+MSI-X Table. This address is mapped by the PCI subsystem, and should not
+be accessed directly by the device driver. If the driver wishes to
+mask or unmask an interrupt, it should call disable_irq() / enable_irq().
+
+4.3.4 pci_msix_vec_count
+
+int pci_msix_vec_count(struct pci_dev *dev)
+
+This function could be used to retrieve number of entries in the device
+MSI-X table.
+
+If this function returns a negative number, it indicates the device is
+not capable of sending MSI-Xs.
+
+If this function returns a positive number, it indicates the maximum
+number of MSI-X interrupt vectors that could be allocated.
+
+4.4 Handling devices implementing both MSI and MSI-X capabilities
+
+If a device implements both MSI and MSI-X capabilities, it can
+run in either MSI mode or MSI-X mode, but not both simultaneously.
+This is a requirement of the PCI spec, and it is enforced by the
+PCI layer. Calling pci_enable_msi_range() when MSI-X is already
+enabled or pci_enable_msix_range() when MSI is already enabled
+results in an error. If a device driver wishes to switch between MSI
+and MSI-X at runtime, it must first quiesce the device, then switch
+it back to pin-interrupt mode, before calling pci_enable_msi_range()
+or pci_enable_msix_range() and resuming operation. This is not expected
+to be a common operation but may be useful for debugging or testing
+during development.
+
+4.5 Considerations when using MSIs
+
+4.5.1 Choosing between MSI-X and MSI
+
+If your device supports both MSI-X and MSI capabilities, you should use
+the MSI-X facilities in preference to the MSI facilities. As mentioned
+above, MSI-X supports any number of interrupts between 1 and 2048.
+In constrast, MSI is restricted to a maximum of 32 interrupts (and
+must be a power of two). In addition, the MSI interrupt vectors must
+be allocated consecutively, so the system might not be able to allocate
+as many vectors for MSI as it could for MSI-X. On some platforms, MSI
+interrupts must all be targeted at the same set of CPUs whereas MSI-X
+interrupts can all be targeted at different CPUs.
+
+4.5.2 Spinlocks
+
+Most device drivers have a per-device spinlock which is taken in the
+interrupt handler. With pin-based interrupts or a single MSI, it is not
+necessary to disable interrupts (Linux guarantees the same interrupt will
+not be re-entered). If a device uses multiple interrupts, the driver
+must disable interrupts while the lock is held. If the device sends
+a different interrupt, the driver will deadlock trying to recursively
+acquire the spinlock.
+
+There are two solutions. The first is to take the lock with
+spin_lock_irqsave() or spin_lock_irq() (see
+Documentation/DocBook/kernel-locking). The second is to specify
+IRQF_DISABLED to request_irq() so that the kernel runs the entire
+interrupt routine with interrupts disabled.
+
+If your MSI interrupt routine does not hold the lock for the whole time
+it is running, the first solution may be best. The second solution is
+normally preferred as it avoids making two transitions from interrupt
+disabled to enabled and back again.
+
+4.6 How to tell whether MSI/MSI-X is enabled on a device
+
+Using 'lspci -v' (as root) may show some devices with "MSI", "Message
+Signalled Interrupts" or "MSI-X" capabilities. Each of these capabilities
+has an 'Enable' flag which is followed with either "+" (enabled)
+or "-" (disabled).
+
+
+5. MSI quirks
+
+Several PCI chipsets or devices are known not to support MSIs.
+The PCI stack provides three ways to disable MSIs:
+
+1. globally
+2. on all devices behind a specific bridge
+3. on a single device
+
+5.1. Disabling MSIs globally
+
+Some host chipsets simply don't support MSIs properly. If we're
+lucky, the manufacturer knows this and has indicated it in the ACPI
+FADT table. In this case, Linux automatically disables MSIs.
+Some boards don't include this information in the table and so we have
+to detect them ourselves. The complete list of these is found near the
+quirk_disable_all_msi() function in drivers/pci/quirks.c.
+
+If you have a board which has problems with MSIs, you can pass pci=nomsi
+on the kernel command line to disable MSIs on all devices. It would be
+in your best interests to report the problem to linux-pci@vger.kernel.org
+including a full 'lspci -v' so we can add the quirks to the kernel.
+
+5.2. Disabling MSIs below a bridge
+
+Some PCI bridges are not able to route MSIs between busses properly.
+In this case, MSIs must be disabled on all devices behind the bridge.
+
+Some bridges allow you to enable MSIs by changing some bits in their
+PCI configuration space (especially the Hypertransport chipsets such
+as the nVidia nForce and Serverworks HT2000). As with host chipsets,
+Linux mostly knows about them and automatically enables MSIs if it can.
+If you have a bridge unknown to Linux, you can enable
+MSIs in configuration space using whatever method you know works, then
+enable MSIs on that bridge by doing:
+
+ echo 1 > /sys/bus/pci/devices/$bridge/msi_bus
+
+where $bridge is the PCI address of the bridge you've enabled (eg
+0000:00:0e.0).
+
+To disable MSIs, echo 0 instead of 1. Changing this value should be
+done with caution as it could break interrupt handling for all devices
+below this bridge.
+
+Again, please notify linux-pci@vger.kernel.org of any bridges that need
+special handling.
+
+5.3. Disabling MSIs on a single device
+
+Some devices are known to have faulty MSI implementations. Usually this
+is handled in the individual device driver, but occasionally it's necessary
+to handle this with a quirk. Some drivers have an option to disable use
+of MSI. While this is a convenient workaround for the driver author,
+it is not good practise, and should not be emulated.
+
+5.4. Finding why MSIs are disabled on a device
+
+From the above three sections, you can see that there are many reasons
+why MSIs may not be enabled for a given device. Your first step should
+be to examine your dmesg carefully to determine whether MSIs are enabled
+for your machine. You should also check your .config to be sure you
+have enabled CONFIG_PCI_MSI.
+
+Then, 'lspci -t' gives the list of bridges above a device. Reading
+/sys/bus/pci/devices/*/msi_bus will tell you whether MSIs are enabled (1)
+or disabled (0). If 0 is found in any of the msi_bus files belonging
+to bridges between the PCI root and the device, MSIs are disabled.
+
+It is also worth checking the device driver to see whether it supports MSIs.
+For example, it may contain calls to pci_enable_msi_range() or
+pci_enable_msix_range().
diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt
index 6650af43252..898ded24510 100644
--- a/Documentation/PCI/pci-error-recovery.txt
+++ b/Documentation/PCI/pci-error-recovery.txt
@@ -4,15 +4,17 @@
February 2, 2006
Current document maintainer:
- Linas Vepstas <linas@austin.ibm.com>
+ Linas Vepstas <linasvepstas@gmail.com>
+ updated by Richard Lary <rlary@us.ibm.com>
+ and Mike Mason <mmlnx@us.ibm.com> on 27-Jul-2009
Many PCI bus controllers are able to detect a variety of hardware
PCI errors on the bus, such as parity errors on the data and address
busses, as well as SERR and PERR errors. Some of the more advanced
chipsets are able to deal with these errors; these include PCI-E chipsets,
-and the PCI-host bridges found on IBM Power4 and Power5-based pSeries
-boxes. A typical action taken is to disconnect the affected device,
+and the PCI-host bridges found on IBM Power4, Power5 and Power6-based
+pSeries boxes. A typical action taken is to disconnect the affected device,
halting all I/O to it. The goal of a disconnection is to avoid system
corruption; for example, to halt system memory corruption due to DMA's
to "wild" addresses. Typically, a reconnection mechanism is also
@@ -37,10 +39,11 @@ is forced by the need to handle multi-function devices, that is,
devices that have multiple device drivers associated with them.
In the first stage, each driver is allowed to indicate what type
of reset it desires, the choices being a simple re-enabling of I/O
-or requesting a hard reset (a full electrical #RST of the PCI card).
-If any driver requests a full reset, that is what will be done.
+or requesting a slot reset.
-After a full reset and/or a re-enabling of I/O, all drivers are
+If any driver requests a slot reset, that is what will be done.
+
+After a reset and/or a re-enabling of I/O, all drivers are
again notified, so that they may then perform any device setup/config
that may be required. After these have all completed, a final
"resume normal operations" event is sent out.
@@ -101,7 +104,7 @@ if it implements any, it must implement error_detected(). If a callback
is not implemented, the corresponding feature is considered unsupported.
For example, if mmio_enabled() and resume() aren't there, then it
is assumed that the driver is not doing any direct recovery and requires
-a reset. If link_reset() is not implemented, the card is assumed as
+a slot reset. If link_reset() is not implemented, the card is assumed to
not care about link resets. Typically a driver will want to know about
a slot_reset().
@@ -111,7 +114,7 @@ sequence described below.
STEP 0: Error Event
-------------------
-PCI bus error is detect by the PCI hardware. On powerpc, the slot
+A PCI bus error is detected by the PCI hardware. On powerpc, the slot
is isolated, in that all I/O is blocked: all reads return 0xffffffff,
all writes are ignored.
@@ -139,7 +142,7 @@ The driver must return one of the following result codes:
a chance to extract some diagnostic information (see
mmio_enable, below).
- PCI_ERS_RESULT_NEED_RESET:
- Driver returns this if it can't recover without a hard
+ Driver returns this if it can't recover without a
slot reset.
- PCI_ERS_RESULT_DISCONNECT:
Driver returns this if it doesn't want to recover at all.
@@ -169,11 +172,11 @@ is STEP 6 (Permanent Failure).
>>> The current powerpc implementation doesn't much care if the device
>>> attempts I/O at this point, or not. I/O's will fail, returning
->>> a value of 0xff on read, and writes will be dropped. If the device
->>> driver attempts more than 10K I/O's to a frozen adapter, it will
->>> assume that the device driver has gone into an infinite loop, and
->>> it will panic the kernel. There doesn't seem to be any other
->>> way of stopping a device driver that insists on spinning on I/O.
+>>> a value of 0xff on read, and writes will be dropped. If more than
+>>> EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH
+>>> assumes that the device driver has gone into an infinite loop
+>>> and prints an error to syslog. A reboot is then required to
+>>> get the device working again.
STEP 2: MMIO Enabled
-------------------
@@ -182,15 +185,14 @@ DMA), and then calls the mmio_enabled() callback on all affected
device drivers.
This is the "early recovery" call. IOs are allowed again, but DMA is
-not (hrm... to be discussed, I prefer not), with some restrictions. This
-is NOT a callback for the driver to start operations again, only to
-peek/poke at the device, extract diagnostic information, if any, and
-eventually do things like trigger a device local reset or some such,
-but not restart operations. This is callback is made if all drivers on
-a segment agree that they can try to recover and if no automatic link reset
-was performed by the HW. If the platform can't just re-enable IOs without
-a slot reset or a link reset, it wont call this callback, and instead
-will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
+not, with some restrictions. This is NOT a callback for the driver to
+start operations again, only to peek/poke at the device, extract diagnostic
+information, if any, and eventually do things like trigger a device local
+reset or some such, but not restart operations. This callback is made if
+all drivers on a segment agree that they can try to recover and if no automatic
+link reset was performed by the HW. If the platform can't just re-enable IOs
+without a slot reset or a link reset, it will not call this callback, and
+instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
>>> The following is proposed; no platform implements this yet:
>>> Proposal: All I/O's should be done _synchronously_ from within
@@ -214,7 +216,7 @@ The driver should return one of the following result codes:
- PCI_ERS_RESULT_NEED_RESET
Driver returns this if it thinks the device is not
- recoverable in it's current state and it needs a slot
+ recoverable in its current state and it needs a slot
reset to proceed.
- PCI_ERS_RESULT_DISCONNECT
@@ -228,9 +230,6 @@ proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
proceeds to STEP 4 (Slot Reset)
->>> The current powerpc implementation does not implement this callback.
-
-
STEP 3: Link Reset
------------------
The platform resets the link, and then calls the link_reset() callback
@@ -242,7 +241,7 @@ in working condition.
The driver is not supposed to restart normal driver I/O operations
at this point. It should limit itself to "probing" the device to
-check it's recoverability status. If all is right, then the platform
+check its recoverability status. If all is right, then the platform
will call resume() once all drivers have ack'd link_reset().
Result codes:
@@ -253,16 +252,33 @@ The platform then proceeds to either STEP 4 (Slot Reset) or STEP 5
>>> The current powerpc implementation does not implement this callback.
-
STEP 4: Slot Reset
------------------
-The platform performs a soft or hard reset of the device, and then
-calls the slot_reset() callback.
-A soft reset consists of asserting the adapter #RST line and then
+In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
+the platform will peform a slot reset on the requesting PCI device(s).
+The actual steps taken by a platform to perform a slot reset
+will be platform-dependent. Upon completion of slot reset, the
+platform will call the device slot_reset() callback.
+
+Powerpc platforms implement two levels of slot reset:
+soft reset(default) and fundamental(optional) reset.
+
+Powerpc soft reset consists of asserting the adapter #RST line and then
restoring the PCI BAR's and PCI configuration header to a state
that is equivalent to what it would be after a fresh system
power-on followed by power-on BIOS/system firmware initialization.
+Soft reset is also known as hot-reset.
+
+Powerpc fundamental reset is supported by PCI Express cards only
+and results in device's state machines, hardware logic, port states and
+configuration registers to initialize to their default conditions.
+
+For most PCI devices, a soft reset will be sufficient for recovery.
+Optional fundamental reset is provided to support a limited number
+of PCI Express PCI devices for which a soft reset is not sufficient
+for recovery.
+
If the platform supports PCI hotplug, then the reset might be
performed by toggling the slot electrical power off/on.
@@ -274,10 +290,12 @@ may result in hung devices, kernel panics, or silent data corruption.
This call gives drivers the chance to re-initialize the hardware
(re-download firmware, etc.). At this point, the driver may assume
-that he card is in a fresh state and is fully functional. In
-particular, interrupt generation should work normally.
+that the card is in a fresh state and is fully functional. The slot
+is unfrozen and the driver has full access to PCI config space,
+memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X)
+will also be available.
-Drivers should not yet restart normal I/O processing operations
+Drivers should not restart normal I/O processing operations
at this point. If all device drivers report success on this
callback, the platform will call resume() to complete the sequence,
and let the driver restart normal I/O processing.
@@ -302,11 +320,21 @@ driver performs device init only from PCI function 0:
- PCI_ERS_RESULT_DISCONNECT
Same as above.
+Drivers for PCI Express cards that require a fundamental reset must
+set the needs_freset bit in the pci_dev structure in their probe function.
+For example, the QLogic qla2xxx driver sets the needs_freset bit for certain
+PCI card types:
+
++ /* Set EEH reset type to fundamental if required by hba */
++ if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha))
++ pdev->needs_freset = 1;
++
+
Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent
Failure).
->>> The current powerpc implementation does not currently try a
->>> power-cycle reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
+>>> The current powerpc implementation does not try a power-cycle
+>>> reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
>>> However, it probably should.
@@ -348,7 +376,7 @@ software errors.
Conclusion; General Remarks
---------------------------
-The way those callbacks are called is platform policy. A platform with
+The way the callbacks are called is platform policy. A platform with
no slot reset capability may want to just "ignore" drivers that can't
recover (disconnect them) and try to let other cards on the same segment
recover. Keep in mind that in most real life cases, though, there will
@@ -361,8 +389,8 @@ That is, the recovery API only requires that:
- There is no guarantee that interrupt delivery can proceed from any
device on the segment starting from the error detection and until the
-resume callback is sent, at which point interrupts are expected to be
-fully operational.
+slot_reset callback is called, at which point interrupts are expected
+to be fully operational.
- There is no guarantee that interrupt delivery is stopped, that is,
a driver that gets an interrupt after detecting an error, or that detects
@@ -381,16 +409,23 @@ anyway :)
>>> Implementation details for the powerpc platform are discussed in
>>> the file Documentation/powerpc/eeh-pci-error-recovery.txt
->>> As of this writing, there are six device drivers with patches
->>> implementing error recovery. Not all of these patches are in
+>>> As of this writing, there is a growing list of device drivers with
+>>> patches implementing error recovery. Not all of these patches are in
>>> mainline yet. These may be used as "examples":
>>>
->>> drivers/scsi/ipr.c
->>> drivers/scsi/sym53cxx_2
+>>> drivers/scsi/ipr
+>>> drivers/scsi/sym53c8xx_2
+>>> drivers/scsi/qla2xxx
+>>> drivers/scsi/lpfc
+>>> drivers/next/bnx2.c
>>> drivers/next/e100.c
>>> drivers/net/e1000
+>>> drivers/net/e1000e
>>> drivers/net/ixgb
+>>> drivers/net/ixgbe
+>>> drivers/net/cxgb3
>>> drivers/net/s2io.c
+>>> drivers/net/qlge
The End
-------
diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt
new file mode 100644
index 00000000000..2d91ae25198
--- /dev/null
+++ b/Documentation/PCI/pci-iov-howto.txt
@@ -0,0 +1,135 @@
+ PCI Express I/O Virtualization Howto
+ Copyright (C) 2009 Intel Corporation
+ Yu Zhao <yu.zhao@intel.com>
+
+ Update: November 2012
+ -- sysfs-based SRIOV enable-/disable-ment
+ Donald Dutile <ddutile@redhat.com>
+
+1. Overview
+
+1.1 What is SR-IOV
+
+Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
+capability which makes one physical device appear as multiple virtual
+devices. The physical device is referred to as Physical Function (PF)
+while the virtual devices are referred to as Virtual Functions (VF).
+Allocation of the VF can be dynamically controlled by the PF via
+registers encapsulated in the capability. By default, this feature is
+not enabled and the PF behaves as traditional PCIe device. Once it's
+turned on, each VF's PCI configuration space can be accessed by its own
+Bus, Device and Function Number (Routing ID). And each VF also has PCI
+Memory Space, which is used to map its register set. VF device driver
+operates on the register set so it can be functional and appear as a
+real existing PCI device.
+
+2. User Guide
+
+2.1 How can I enable SR-IOV capability
+
+Multiple methods are available for SR-IOV enablement.
+In the first method, the device driver (PF driver) will control the
+enabling and disabling of the capability via API provided by SR-IOV core.
+If the hardware has SR-IOV capability, loading its PF driver would
+enable it and all VFs associated with the PF. Some PF drivers require
+a module parameter to be set to determine the number of VFs to enable.
+In the second method, a write to the sysfs file sriov_numvfs will
+enable and disable the VFs associated with a PCIe PF. This method
+enables per-PF, VF enable/disable values versus the first method,
+which applies to all PFs of the same device. Additionally, the
+PCI SRIOV core support ensures that enable/disable operations are
+valid to reduce duplication in multiple drivers for the same
+checks, e.g., check numvfs == 0 if enabling VFs, ensure
+numvfs <= totalvfs.
+The second method is the recommended method for new/future VF devices.
+
+2.2 How can I use the Virtual Functions
+
+The VF is treated as hot-plugged PCI devices in the kernel, so they
+should be able to work in the same way as real PCI devices. The VF
+requires device driver that is same as a normal PCI device's.
+
+3. Developer Guide
+
+3.1 SR-IOV API
+
+To enable SR-IOV capability:
+(a) For the first method, in the driver:
+ int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
+ 'nr_virtfn' is number of VFs to be enabled.
+(b) For the second method, from sysfs:
+ echo 'nr_virtfn' > \
+ /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
+
+To disable SR-IOV capability:
+(a) For the first method, in the driver:
+ void pci_disable_sriov(struct pci_dev *dev);
+(b) For the second method, from sysfs:
+ echo 0 > \
+ /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
+
+3.2 Usage example
+
+Following piece of code illustrates the usage of the SR-IOV API.
+
+static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
+{
+ pci_enable_sriov(dev, NR_VIRTFN);
+
+ ...
+
+ return 0;
+}
+
+static void dev_remove(struct pci_dev *dev)
+{
+ pci_disable_sriov(dev);
+
+ ...
+}
+
+static int dev_suspend(struct pci_dev *dev, pm_message_t state)
+{
+ ...
+
+ return 0;
+}
+
+static int dev_resume(struct pci_dev *dev)
+{
+ ...
+
+ return 0;
+}
+
+static void dev_shutdown(struct pci_dev *dev)
+{
+ ...
+}
+
+static int dev_sriov_configure(struct pci_dev *dev, int numvfs)
+{
+ if (numvfs > 0) {
+ ...
+ pci_enable_sriov(dev, numvfs);
+ ...
+ return numvfs;
+ }
+ if (numvfs == 0) {
+ ....
+ pci_disable_sriov(dev);
+ ...
+ return 0;
+ }
+}
+
+static struct pci_driver dev_driver = {
+ .name = "SR-IOV Physical Function driver",
+ .id_table = dev_id_table,
+ .probe = dev_probe,
+ .remove = dev_remove,
+ .suspend = dev_suspend,
+ .resume = dev_resume,
+ .shutdown = dev_shutdown,
+ .sriov_configure = dev_sriov_configure,
+};
diff --git a/Documentation/PCI/pci.txt b/Documentation/PCI/pci.txt
index 7f6de6ea5b4..9518006f667 100644
--- a/Documentation/PCI/pci.txt
+++ b/Documentation/PCI/pci.txt
@@ -123,8 +123,10 @@ initialization with a pointer to a structure describing the driver
The ID table is an array of struct pci_device_id entries ending with an
-all-zero entry; use of the macro DEFINE_PCI_DEVICE_TABLE is the preferred
-method of declaring the table. Each entry consists of:
+all-zero entry. Definitions with static const are generally preferred.
+Use of the deprecated macro DEFINE_PCI_DEVICE_TABLE should be avoided.
+
+Each entry consists of:
vendor,device Vendor and device ID to match (or PCI_ANY_ID)
@@ -183,12 +185,6 @@ Please mark the initialization and cleanup functions where appropriate
initializes.
__exit Exit code. Ignored for non-modular drivers.
-
- __devinit Device initialization code.
- Identical to __init if the kernel is not compiled
- with CONFIG_HOTPLUG, normal function otherwise.
- __devexit The same for __exit.
-
Tips on when/where to use the above attributes:
o The module_init()/module_exit() functions (and all
initialization functions called _only_ from these)
@@ -196,20 +192,6 @@ Tips on when/where to use the above attributes:
o Do not mark the struct pci_driver.
- o The ID table array should be marked __devinitconst; this is done
- automatically if the table is declared with DEFINE_PCI_DEVICE_TABLE().
-
- o The probe() and remove() functions should be marked __devinit
- and __devexit respectively. All initialization functions
- exclusively called by the probe() routine, can be marked __devinit.
- Ditto for remove() and __devexit.
-
- o If mydriver_remove() is marked with __devexit(), then all address
- references to mydriver_remove must use __devexit_p(mydriver_remove)
- (in the struct pci_driver declaration for example).
- __devexit_p() will generate the function name _or_ NULL if the
- function will be discarded. For an example, see drivers/net/tg3.c.
-
o Do NOT mark a function if you are not sure which mark to use.
Better to not mark the function than mark the function wrong.
@@ -314,7 +296,7 @@ from the PCI device config space. Use the values in the pci_dev structure
as the PCI "bus address" might have been remapped to a "host physical"
address by the arch/chip-set specific kernel support.
-See Documentation/IO-mapping.txt for how to access device registers
+See Documentation/io-mapping.txt for how to access device registers
or device memory.
The device driver needs to call pci_request_region() to verify
@@ -545,8 +527,9 @@ corresponding register block for you.
6. Other interesting functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-pci_find_slot() Find pci_dev corresponding to given bus and
- slot numbers.
+pci_get_domain_bus_and_slot() Find pci_dev corresponding to given domain,
+ bus and slot and number. If the device is
+ found, its reference count is increased.
pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3)
pci_find_capability() Find specified capability in device's capability
list.
@@ -581,7 +564,7 @@ to be handled by platform and generic code, not individual drivers.
8. Vendor and device identifications
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-One is not not required to add new device ids to include/linux/pci_ids.h.
+One is not required to add new device ids to include/linux/pci_ids.h.
Please add PCI_VENDOR_ID_xxx for vendors and a hex constant for device ids.
PCI_VENDOR_ID_xxx constants are re-used. The device ids are arbitrary
@@ -602,7 +585,8 @@ having sane locking.
pci_find_device() Superseded by pci_get_device()
pci_find_subsys() Superseded by pci_get_subsys()
-pci_find_slot() Superseded by pci_get_slot()
+pci_find_slot() Superseded by pci_get_domain_bus_and_slot()
+pci_get_slot() Superseded by pci_get_domain_bus_and_slot()
The alternative is the traditional PCI device driver that walks PCI
diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt
index ddeb14beacc..26d3d945c3c 100644
--- a/Documentation/PCI/pcieaer-howto.txt
+++ b/Documentation/PCI/pcieaer-howto.txt
@@ -13,7 +13,7 @@ Reporting (AER) driver and provides information on how to use it, as
well as how to enable the drivers of endpoint devices to conform with
PCI Express AER driver.
-1.2 Copyright © Intel Corporation 2006.
+1.2 Copyright (C) Intel Corporation 2006.
1.3 What is the PCI Express AER Driver?
@@ -61,21 +61,21 @@ be initiated although firmwares have no _OSC support. To enable the
walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line
when booting kernel. Note that forceload=n by default.
+nosourceid, another parameter of type bool, can be used when broken
+hardware (mostly chipsets) has root ports that cannot obtain the reporting
+source ID. nosourceid=n by default.
+
2.3 AER error output
When a PCI-E AER error is captured, an error message will be outputed to
console. If it's a correctable error, it is outputed as a warning.
Otherwise, it is printed as an error. So users could choose different
log level to filter out correctable error messages.
-Below shows an example.
-+------ PCI-Express Device Error -----+
-Error Severity : Uncorrected (Fatal)
-PCIE Bus Error type : Transaction Layer
-Unsupported Request : First
-Requester ID : 0500
-VendorID=8086h, DeviceID=0329h, Bus=05h, Device=00h, Function=00h
-TLB Header:
-04000001 00200a03 05010000 00050100
+Below shows an example:
+0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
+0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000
+0000:50:00.0: [20] Unsupported Request (First)
+0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100
In the example, 'Requester ID' means the ID of the device who sends
the error message to root port. Pls. refer to pci express specs for
@@ -108,7 +108,7 @@ but the PCI Express link itself is fully functional. Fatal errors, on
the other hand, cause the link to be unreliable.
When AER is enabled, a PCI Express device will automatically send an
-error message to the PCIE root port above it when the device captures
+error message to the PCIe root port above it when the device captures
an error. The Root Port, upon receiving an error reporting message,
internally processes and logs the error message in its PCI Express
capability structure. Error information being logged includes storing
@@ -194,8 +194,9 @@ to reset link, AER port service driver is required to provide the
function to reset link. Firstly, kernel looks for if the upstream
component has an aer driver. If it has, kernel uses the reset_link
callback of the aer driver. If the upstream component has no aer driver
-and the port is downstream port, we will use the aer driver of the
-root port who reports the AER error. As for upstream ports,
+and the port is downstream port, we will perform a hot reset as the
+default by setting the Secondary Bus Reset bit of the Bridge Control
+register associated with the downstream port. As for upstream ports,
they should provide their own aer service drivers with reset_link
function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and
reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
@@ -246,3 +247,24 @@ with the PCI Express AER Root driver?
A: It could call the helper functions to enable AER in devices and
cleanup uncorrectable status register. Pls. refer to section 3.3.
+
+4. Software error injection
+
+Debugging PCIe AER error recovery code is quite difficult because it
+is hard to trigger real hardware errors. Software based error
+injection can be used to fake various kinds of PCIe errors.
+
+First you should enable PCIe AER software error injection in kernel
+configuration, that is, following item should be in your .config.
+
+CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
+
+After reboot with new kernel or insert the module, a device file named
+/dev/aer_inject should be created.
+
+Then, you need a user space tool named aer-inject, which can be gotten
+from:
+ http://www.kernel.org/pub/linux/utils/pci/aer-inject/
+
+More information about aer-inject can be found in the document comes
+with its source code.