7 files changed, 870 insertions, 96 deletions
diff --git a/Documentation/PCI/00-INDEX b/Documentation/PCI/00-INDEX
index 49f43946c6b..147231f1613 100644
--- a/Documentation/PCI/00-INDEX
+++ b/Documentation/PCI/00-INDEX
@@ -1,11 +1,13 @@
 00-INDEX
 	- this file
-PCI-DMA-mapping.txt
-	- info for PCI drivers using DMA portably across all platforms
+MSI-HOWTO.txt
+	- the Message Signaled Interrupts (MSI) Driver Guide HOWTO and FAQ.
 PCIEBUS-HOWTO.txt
 	- a guide describing the PCI Express Port Bus driver
 pci-error-recovery.txt
 	- info on PCI error recovery
+pci-iov-howto.txt
+	- the PCI Express I/O Virtualization HOWTO
 pci.txt
 	- info on the PCI subsystem for device driver authors
 pcieaer-howto.txt
diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
new file mode 100644
index 00000000000..10a93696e55
--- /dev/null
+++ b/Documentation/PCI/MSI-HOWTO.txt
@@ -0,0 +1,596 @@
+		The MSI Driver Guide HOWTO
+	Tom L Nguyen tom.l.nguyen@intel.com
+			10/03/2003
+	Revised Feb 12, 2004 by Martine Silbermann
+		email: Martine.Silbermann@hp.com
+	Revised Jun 25, 2004 by Tom L Nguyen
+	Revised Jul  9, 2008 by Matthew Wilcox <willy@linux.intel.com>
+		Copyright 2003, 2008 Intel Corporation
+
+1. About this guide
+
+This guide describes the basics of Message Signaled Interrupts (MSIs),
+the advantages of using MSI over traditional interrupt mechanisms, how
+to change your driver to use MSI or MSI-X and some basic diagnostics to
+try if a device doesn't support MSIs.
+
+
+2. What are MSIs?
+
+A Message Signaled Interrupt is a write from the device to a special
+address which causes an interrupt to be received by the CPU.
+
+The MSI capability was first specified in PCI 2.2 and was later enhanced
+in PCI 3.0 to allow each interrupt to be masked individually.  The MSI-X
+capability was also introduced with PCI 3.0.  It supports more interrupts
+per device than MSI and allows interrupts to be independently configured.
+
+Devices may support both MSI and MSI-X, but only one can be enabled at
+a time.
+
+
+3. Why use MSIs?
+
+There are three reasons why using MSIs can give an advantage over
+traditional pin-based interrupts.
+
+Pin-based PCI interrupts are often shared amongst several devices.
+To support this, the kernel must call each interrupt handler associated
+with an interrupt, which leads to reduced performance for the system as
+a whole.  MSIs are never shared, so this problem cannot arise.
+
+When a device writes data to memory, then raises a pin-based interrupt,
+it is possible that the interrupt may arrive before all the data has
+arrived in memory (this becomes more likely with devices behind PCI-PCI
+bridges).  In order to ensure that all the data has arrived in memory,
+the interrupt handler must read a register on the device which raised
+the interrupt.  PCI transaction ordering rules require that all the data
+arrive in memory before the value may be returned from the register.
+Using MSIs avoids this problem as the interrupt-generating write cannot
+pass the data writes, so by the time the interrupt is raised, the driver
+knows that all the data has arrived in memory.
+
+PCI devices can only support a single pin-based interrupt per function.
+Often drivers have to query the device to find out what event has
+occurred, slowing down interrupt handling for the common case.  With
+MSIs, a device can support more interrupts, allowing each interrupt
+to be specialised to a different purpose.  One possible design gives
+infrequent conditions (such as errors) their own interrupt which allows
+the driver to handle the normal interrupt handling path more efficiently.
+Other possible designs include giving one interrupt to each packet queue
+in a network card or each port in a storage controller.
+
+
+4. How to use MSIs
+
+PCI devices are initialised to use pin-based interrupts.  The device
+driver has to set up the device to use MSI or MSI-X.  Not all machines
+support MSIs correctly, and for those machines, the APIs described below
+will simply fail and the device will continue to use pin-based interrupts.
+
+4.1 Include kernel support for MSIs
+
+To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI
+option enabled.  This option is only available on some architectures,
+and it may depend on some other options also being set.  For example,
+on x86, you must also enable X86_UP_APIC or SMP in order to see the
+CONFIG_PCI_MSI option.
+
+4.2 Using MSI
+
+Most of the hard work is done for the driver in the PCI layer.  It simply
+has to request that the PCI layer set up the MSI capability for this
+device.
+
+4.2.1 pci_enable_msi
+
+int pci_enable_msi(struct pci_dev *dev)
+
+A successful call allocates ONE interrupt to the device, regardless
+of how many MSIs the device supports.  The device is switched from
+pin-based interrupt mode to MSI mode.  The dev->irq number is changed
+to a new number which represents the message signaled interrupt;
+consequently, this function should be called before the driver calls
+request_irq(), because an MSI is delivered via a vector that is
+different from the vector of a pin-based interrupt.
+
+4.2.2 pci_enable_msi_range
+
+int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
+
+This function allows a device driver to request any number of MSI
+interrupts within specified range from 'minvec' to 'maxvec'.
+
+If this function returns a positive number it indicates the number of
+MSI interrupts that have been successfully allocated.  In this case
+the device is switched from pin-based interrupt mode to MSI mode and
+updates dev->irq to be the lowest of the new interrupts assigned to it.
+The other interrupts assigned to the device are in the range dev->irq
+to dev->irq + returned value - 1.  Device driver can use the returned
+number of successfully allocated MSI interrupts to further allocate
+and initialize device resources.
+
+If this function returns a negative number, it indicates an error and
+the driver should not attempt to request any more MSI interrupts for
+this device.
+
+This function should be called before the driver calls request_irq(),
+because MSI interrupts are delivered via vectors that are different
+from the vector of a pin-based interrupt.
+
+It is ideal if drivers can cope with a variable number of MSI interrupts;
+there are many reasons why the platform may not be able to provide the
+exact number that a driver asks for.
+
+There could be devices that can not operate with just any number of MSI
+interrupts within a range.  See chapter 4.3.1.3 to get the idea how to
+handle such devices for MSI-X - the same logic applies to MSI.
+
+4.2.1.1 Maximum possible number of MSI interrupts
+
+The typical usage of MSI interrupts is to allocate as many vectors as
+possible, likely up to the limit returned by pci_msi_vec_count() function:
+
+static int foo_driver_enable_msi(struct pci_dev *pdev, int nvec)
+{
+	return pci_enable_msi_range(pdev, 1, nvec);
+}
+
+Note the value of 'minvec' parameter is 1.  As 'minvec' is inclusive,
+the value of 0 would be meaningless and could result in error.
+
+Some devices have a minimal limit on number of MSI interrupts.
+In this case the function could look like this:
+
+static int foo_driver_enable_msi(struct pci_dev *pdev, int nvec)
+{
+	return pci_enable_msi_range(pdev, FOO_DRIVER_MINIMUM_NVEC, nvec);
+}
+
+4.2.1.2 Exact number of MSI interrupts
+
+If a driver is unable or unwilling to deal with a variable number of MSI
+interrupts it could request a particular number of interrupts by passing
+that number to pci_enable_msi_range() function as both 'minvec' and 'maxvec'
+parameters:
+
+static int foo_driver_enable_msi(struct pci_dev *pdev, int nvec)
+{
+	return pci_enable_msi_range(pdev, nvec, nvec);
+}
+
+Note, unlike pci_enable_msi_exact() function, which could be also used to
+enable a particular number of MSI-X interrupts, pci_enable_msi_range()
+returns either a negative errno or 'nvec' (not negative errno or 0 - as
+pci_enable_msi_exact() does).
+
+4.2.1.3 Single MSI mode
+
+The most notorious example of the request type described above is
+enabling the single MSI mode for a device.  It could be done by passing
+two 1s as 'minvec' and 'maxvec':
+
+static int foo_driver_enable_single_msi(struct pci_dev *pdev)
+{
+	return pci_enable_msi_range(pdev, 1, 1);
+}
+
+Note, unlike pci_enable_msi() function, which could be also used to
+enable the single MSI mode, pci_enable_msi_range() returns either a
+negative errno or 1 (not negative errno or 0 - as pci_enable_msi()
+does).
+
+4.2.3 pci_enable_msi_exact
+
+int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
+
+This variation on pci_enable_msi_range() call allows a device driver to
+request exactly 'nvec' MSIs.
+
+If this function returns a negative number, it indicates an error and
+the driver should not attempt to request any more MSI interrupts for
+this device.
+
+By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
+returns zero in case of success, which indicates MSI interrupts have been
+successfully allocated.
+
+4.2.4 pci_disable_msi
+
+void pci_disable_msi(struct pci_dev *dev)
+
+This function should be used to undo the effect of pci_enable_msi_range().
+Calling it restores dev->irq to the pin-based interrupt number and frees
+the previously allocated MSIs.  The interrupts may subsequently be assigned
+to another device, so drivers should not cache the value of dev->irq.
+
+Before calling this function, a device driver must always call free_irq()
+on any interrupt for which it previously called request_irq().
+Failure to do so results in a BUG_ON(), leaving the device with
+MSI enabled and thus leaking its vector.
+
+4.2.4 pci_msi_vec_count
+
+int pci_msi_vec_count(struct pci_dev *dev)
+
+This function could be used to retrieve the number of MSI vectors the
+device requested (via the Multiple Message Capable register). The MSI
+specification only allows the returned value to be a power of two,
+up to a maximum of 2^5 (32).
+
+If this function returns a negative number, it indicates the device is
+not capable of sending MSIs.
+
+If this function returns a positive number, it indicates the maximum
+number of MSI interrupt vectors that could be allocated.
+
+4.3 Using MSI-X
+
+The MSI-X capability is much more flexible than the MSI capability.
+It supports up to 2048 interrupts, each of which can be controlled
+independently.  To support this flexibility, drivers must use an array of
+`struct msix_entry':
+
+struct msix_entry {
+	u16 	vector; /* kernel uses to write alloc vector */
+	u16	entry; /* driver uses to specify entry */
+};
+
+This allows for the device to use these interrupts in a sparse fashion;
+for example, it could use interrupts 3 and 1027 and yet allocate only a
+two-element array.  The driver is expected to fill in the 'entry' value
+in each element of the array to indicate for which entries the kernel
+should assign interrupts; it is invalid to fill in two entries with the
+same number.
+
+4.3.1 pci_enable_msix_range
+
+int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
+			  int minvec, int maxvec)
+
+Calling this function asks the PCI subsystem to allocate any number of
+MSI-X interrupts within specified range from 'minvec' to 'maxvec'.
+The 'entries' argument is a pointer to an array of msix_entry structs
+which should be at least 'maxvec' entries in size.
+
+On success, the device is switched into MSI-X mode and the function
+returns the number of MSI-X interrupts that have been successfully
+allocated.  In this case the 'vector' member in entries numbered from
+0 to the returned value - 1 is populated with the interrupt number;
+the driver should then call request_irq() for each 'vector' that it
+decides to use.  The device driver is responsible for keeping track of the
+interrupts assigned to the MSI-X vectors so it can free them again later.
+Device driver can use the returned number of successfully allocated MSI-X
+interrupts to further allocate and initialize device resources.
+
+If this function returns a negative number, it indicates an error and
+the driver should not attempt to allocate any more MSI-X interrupts for
+this device.
+
+This function, in contrast with pci_enable_msi_range(), does not adjust
+dev->irq.  The device will not generate interrupts for this interrupt
+number once MSI-X is enabled.
+
+Device drivers should normally call this function once per device
+during the initialization phase.
+
+It is ideal if drivers can cope with a variable number of MSI-X interrupts;
+there are many reasons why the platform may not be able to provide the
+exact number that a driver asks for.
+
+There could be devices that can not operate with just any number of MSI-X
+interrupts within a range.  E.g., an network adapter might need let's say
+four vectors per each queue it provides.  Therefore, a number of MSI-X
+interrupts allocated should be a multiple of four.  In this case interface
+pci_enable_msix_range() can not be used alone to request MSI-X interrupts
+(since it can allocate any number within the range, without any notion of
+the multiple of four) and the device driver should master a custom logic
+to request the required number of MSI-X interrupts.
+
+4.3.1.1 Maximum possible number of MSI-X interrupts
+
+The typical usage of MSI-X interrupts is to allocate as many vectors as
+possible, likely up to the limit returned by pci_msix_vec_count() function:
+
+static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
+{
+	return pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
+				     1, nvec);
+}
+
+Note the value of 'minvec' parameter is 1.  As 'minvec' is inclusive,
+the value of 0 would be meaningless and could result in error.
+
+Some devices have a minimal limit on number of MSI-X interrupts.
+In this case the function could look like this:
+
+static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
+{
+	return pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
+				     FOO_DRIVER_MINIMUM_NVEC, nvec);
+}
+
+4.3.1.2 Exact number of MSI-X interrupts
+
+If a driver is unable or unwilling to deal with a variable number of MSI-X
+interrupts it could request a particular number of interrupts by passing
+that number to pci_enable_msix_range() function as both 'minvec' and 'maxvec'
+parameters:
+
+static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
+{
+	return pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
+				     nvec, nvec);
+}
+
+Note, unlike pci_enable_msix_exact() function, which could be also used to
+enable a particular number of MSI-X interrupts, pci_enable_msix_range()
+returns either a negative errno or 'nvec' (not negative errno or 0 - as
+pci_enable_msix_exact() does).
+
+4.3.1.3 Specific requirements to the number of MSI-X interrupts
+
+As noted above, there could be devices that can not operate with just any
+number of MSI-X interrupts within a range.  E.g., let's assume a device that
+is only capable sending the number of MSI-X interrupts which is a power of
+two.  A routine that enables MSI-X mode for such device might look like this:
+
+/*
+ * Assume 'minvec' and 'maxvec' are non-zero
+ */
+static int foo_driver_enable_msix(struct foo_adapter *adapter,
+				  int minvec, int maxvec)
+{
+	int rc;
+
+	minvec = roundup_pow_of_two(minvec);
+	maxvec = rounddown_pow_of_two(maxvec);
+
+	if (minvec > maxvec)
+		return -ERANGE;
+
+retry:
+	rc = pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
+				   maxvec, maxvec);
+	/*
+	 * -ENOSPC is the only error code allowed to be analized
+	 */
+	if (rc == -ENOSPC) {
+		if (maxvec == 1)
+			return -ENOSPC;
+
+		maxvec /= 2;
+
+		if (minvec > maxvec)
+			return -ENOSPC;
+
+		goto retry;
+	}
+
+	return rc;
+}
+
+Note how pci_enable_msix_range() return value is analized for a fallback -
+any error code other than -ENOSPC indicates a fatal error and should not
+be retried.
+
+4.3.2 pci_enable_msix_exact
+
+int pci_enable_msix_exact(struct pci_dev *dev,
+			  struct msix_entry *entries, int nvec)
+
+This variation on pci_enable_msix_range() call allows a device driver to
+request exactly 'nvec' MSI-Xs.
+
+If this function returns a negative number, it indicates an error and
+the driver should not attempt to allocate any more MSI-X interrupts for
+this device.
+
+By contrast with pci_enable_msix_range() function, pci_enable_msix_exact()
+returns zero in case of success, which indicates MSI-X interrupts have been
+successfully allocated.
+
+Another version of a routine that enables MSI-X mode for a device with
+specific requirements described in chapter 4.3.1.3 might look like this:
+
+/*
+ * Assume 'minvec' and 'maxvec' are non-zero
+ */
+static int foo_driver_enable_msix(struct foo_adapter *adapter,
+				  int minvec, int maxvec)
+{
+	int rc;
+
+	minvec = roundup_pow_of_two(minvec);
+	maxvec = rounddown_pow_of_two(maxvec);
+
+	if (minvec > maxvec)
+		return -ERANGE;
+
+retry:
+	rc = pci_enable_msix_exact(adapter->pdev,
+				   adapter->msix_entries, maxvec);
+
+	/*
+	 * -ENOSPC is the only error code allowed to be analyzed
+	 */
+	if (rc == -ENOSPC) {
+		if (maxvec == 1)
+			return -ENOSPC;
+
+		maxvec /= 2;
+
+		if (minvec > maxvec)
+			return -ENOSPC;
+
+		goto retry;
+	} else if (rc < 0) {
+		return rc;
+	}
+
+	return maxvec;
+}
+
+4.3.3 pci_disable_msix
+
+void pci_disable_msix(struct pci_dev *dev)
+
+This function should be used to undo the effect of pci_enable_msix_range().
+It frees the previously allocated MSI-X interrupts. The interrupts may
+subsequently be assigned to another device, so drivers should not cache
+the value of the 'vector' elements over a call to pci_disable_msix().
+
+Before calling this function, a device driver must always call free_irq()
+on any interrupt for which it previously called request_irq().
+Failure to do so results in a BUG_ON(), leaving the device with
+MSI-X enabled and thus leaking its vector.
+
+4.3.3 The MSI-X Table
+
+The MSI-X capability specifies a BAR and offset within that BAR for the
+MSI-X Table.  This address is mapped by the PCI subsystem, and should not
+be accessed directly by the device driver.  If the driver wishes to
+mask or unmask an interrupt, it should call disable_irq() / enable_irq().
+
+4.3.4 pci_msix_vec_count
+
+int pci_msix_vec_count(struct pci_dev *dev)
+
+This function could be used to retrieve number of entries in the device
+MSI-X table.
+
+If this function returns a negative number, it indicates the device is
+not capable of sending MSI-Xs.
+
+If this function returns a positive number, it indicates the maximum
+number of MSI-X interrupt vectors that could be allocated.
+
+4.4 Handling devices implementing both MSI and MSI-X capabilities
+
+If a device implements both MSI and MSI-X capabilities, it can
+run in either MSI mode or MSI-X mode, but not both simultaneously.
+This is a requirement of the PCI spec, and it is enforced by the
+PCI layer.  Calling pci_enable_msi_range() when MSI-X is already
+enabled or pci_enable_msix_range() when MSI is already enabled
+results in an error.  If a device driver wishes to switch between MSI
+and MSI-X at runtime, it must first quiesce the device, then switch
+it back to pin-interrupt mode, before calling pci_enable_msi_range()
+or pci_enable_msix_range() and resuming operation.  This is not expected
+to be a common operation but may be useful for debugging or testing
+during development.
+
+4.5 Considerations when using MSIs
+
+4.5.1 Choosing between MSI-X and MSI
+
+If your device supports both MSI-X and MSI capabilities, you should use
+the MSI-X facilities in preference to the MSI facilities.  As mentioned
+above, MSI-X supports any number of interrupts between 1 and 2048.
+In constrast, MSI is restricted to a maximum of 32 interrupts (and
+must be a power of two).  In addition, the MSI interrupt vectors must
+be allocated consecutively, so the system might not be able to allocate
+as many vectors for MSI as it could for MSI-X.  On some platforms, MSI
+interrupts must all be targeted at the same set of CPUs whereas MSI-X
+interrupts can all be targeted at different CPUs.
+
+4.5.2 Spinlocks
+
+Most device drivers have a per-device spinlock which is taken in the
+interrupt handler.  With pin-based interrupts or a single MSI, it is not
+necessary to disable interrupts (Linux guarantees the same interrupt will
+not be re-entered).  If a device uses multiple interrupts, the driver
+must disable interrupts while the lock is held.  If the device sends
+a different interrupt, the driver will deadlock trying to recursively
+acquire the spinlock.
+
+There are two solutions.  The first is to take the lock with
+spin_lock_irqsave() or spin_lock_irq() (see
+Documentation/DocBook/kernel-locking).  The second is to specify
+IRQF_DISABLED to request_irq() so that the kernel runs the entire
+interrupt routine with interrupts disabled.
+
+If your MSI interrupt routine does not hold the lock for the whole time
+it is running, the first solution may be best.  The second solution is
+normally preferred as it avoids making two transitions from interrupt
+disabled to enabled and back again.
+
+4.6 How to tell whether MSI/MSI-X is enabled on a device
+
+Using 'lspci -v' (as root) may show some devices with "MSI", "Message
+Signalled Interrupts" or "MSI-X" capabilities.  Each of these capabilities
+has an 'Enable' flag which is followed with either "+" (enabled)
+or "-" (disabled).
+
+
+5. MSI quirks
+
+Several PCI chipsets or devices are known not to support MSIs.
+The PCI stack provides three ways to disable MSIs:
+
+1. globally
+2. on all devices behind a specific bridge
+3. on a single device
+
+5.1. Disabling MSIs globally
+
+Some host chipsets simply don't support MSIs properly.  If we're
+lucky, the manufacturer knows this and has indicated it in the ACPI
+FADT table.  In this case, Linux automatically disables MSIs.
+Some boards don't include this information in the table and so we have
+to detect them ourselves.  The complete list of these is found near the
+quirk_disable_all_msi() function in drivers/pci/quirks.c.
+
+If you have a board which has problems with MSIs, you can pass pci=nomsi
+on the kernel command line to disable MSIs on all devices.  It would be
+in your best interests to report the problem to linux-pci@vger.kernel.org
+including a full 'lspci -v' so we can add the quirks to the kernel.
+
+5.2. Disabling MSIs below a bridge
+
+Some PCI bridges are not able to route MSIs between busses properly.
+In this case, MSIs must be disabled on all devices behind the bridge.
+
+Some bridges allow you to enable MSIs by changing some bits in their
+PCI configuration space (especially the Hypertransport chipsets such
+as the nVidia nForce and Serverworks HT2000).  As with host chipsets,
+Linux mostly knows about them and automatically enables MSIs if it can.
+If you have a bridge unknown to Linux, you can enable
+MSIs in configuration space using whatever method you know works, then
+enable MSIs on that bridge by doing:
+
+       echo 1 > /sys/bus/pci/devices/$bridge/msi_bus
+
+where $bridge is the PCI address of the bridge you've enabled (eg
+0000:00:0e.0).
+
+To disable MSIs, echo 0 instead of 1.  Changing this value should be
+done with caution as it could break interrupt handling for all devices
+below this bridge.
+
+Again, please notify linux-pci@vger.kernel.org of any bridges that need
+special handling.
+
+5.3. Disabling MSIs on a single device
+
+Some devices are known to have faulty MSI implementations.  Usually this
+is handled in the individual device driver, but occasionally it's necessary
+to handle this with a quirk.  Some drivers have an option to disable use
+of MSI.  While this is a convenient workaround for the driver author,
+it is not good practise, and should not be emulated.
+
+5.4. Finding why MSIs are disabled on a device
+
+From the above three sections, you can see that there are many reasons
+why MSIs may not be enabled for a given device.  Your first step should
+be to examine your dmesg carefully to determine whether MSIs are enabled
+for your machine.  You should also check your .config to be sure you
+have enabled CONFIG_PCI_MSI.
+
+Then, 'lspci -t' gives the list of bridges above a device.  Reading
+/sys/bus/pci/devices/*/msi_bus will tell you whether MSIs are enabled (1)
+or disabled (0).  If 0 is found in any of the msi_bus files belonging
+to bridges between the PCI root and the device, MSIs are disabled.
+
+It is also worth checking the device driver to see whether it supports MSIs.
+For example, it may contain calls to pci_enable_msi_range() or
+pci_enable_msix_range().
diff --git a/Documentation/PCI/PCIEBUS-HOWTO.txt b/Documentation/PCI/PCIEBUS-HOWTO.txt
index 9a07e38631b..6bd5f372ade 100644
--- a/Documentation/PCI/PCIEBUS-HOWTO.txt
+++ b/Documentation/PCI/PCIEBUS-HOWTO.txt
@@ -93,7 +93,7 @@ the PCI Express Port Bus driver from loading a service driver.
 
 int pcie_port_service_register(struct pcie_port_service_driver *new)
 
-This API replaces the Linux Driver Model's pci_module_init API. A
+This API replaces the Linux Driver Model's pci_register_driver API. A
 service driver should always calls pcie_port_service_register at
 module init. Note that after service driver being loaded, calls
 such as pci_enable_device(dev) and pci_set_master(dev) are no longer
diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt
index 6650af43252..898ded24510 100644
--- a/Documentation/PCI/pci-error-recovery.txt
+++ b/Documentation/PCI/pci-error-recovery.txt
@@ -4,15 +4,17 @@
                         February 2, 2006
 
                  Current document maintainer:
-             Linas Vepstas <linas@austin.ibm.com>
+             Linas Vepstas <linasvepstas@gmail.com>
+          updated by Richard Lary <rlary@us.ibm.com>
+       and Mike Mason <mmlnx@us.ibm.com> on 27-Jul-2009
 
 
 Many PCI bus controllers are able to detect a variety of hardware
 PCI errors on the bus, such as parity errors on the data and address
 busses, as well as SERR and PERR errors.  Some of the more advanced
 chipsets are able to deal with these errors; these include PCI-E chipsets,
-and the PCI-host bridges found on IBM Power4 and Power5-based pSeries
-boxes. A typical action taken is to disconnect the affected device,
+and the PCI-host bridges found on IBM Power4, Power5 and Power6-based
+pSeries boxes. A typical action taken is to disconnect the affected device,
 halting all I/O to it.  The goal of a disconnection is to avoid system
 corruption; for example, to halt system memory corruption due to DMA's
 to "wild" addresses. Typically, a reconnection mechanism is also
@@ -37,10 +39,11 @@ is forced by the need to handle multi-function devices, that is,
 devices that have multiple device drivers associated with them.
 In the first stage, each driver is allowed to indicate what type
 of reset it desires, the choices being a simple re-enabling of I/O
-or requesting a hard reset (a full electrical #RST of the PCI card).
-If any driver requests a full reset, that is what will be done.
+or requesting a slot reset.
 
-After a full reset and/or a re-enabling of I/O, all drivers are
+If any driver requests a slot reset, that is what will be done.
+
+After a reset and/or a re-enabling of I/O, all drivers are
 again notified, so that they may then perform any device setup/config
 that may be required.  After these have all completed, a final
 "resume normal operations" event is sent out.
@@ -101,7 +104,7 @@ if it implements any, it must implement error_detected(). If a callback
 is not implemented, the corresponding feature is considered unsupported.
 For example, if mmio_enabled() and resume() aren't there, then it
 is assumed that the driver is not doing any direct recovery and requires
-a reset. If link_reset() is not implemented, the card is assumed as
+a slot reset. If link_reset() is not implemented, the card is assumed to
 not care about link resets. Typically a driver will want to know about
 a slot_reset().
 
@@ -111,7 +114,7 @@ sequence described below.
 
 STEP 0: Error Event
 -------------------
-PCI bus error is detect by the PCI hardware.  On powerpc, the slot
+A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
 is isolated, in that all I/O is blocked: all reads return 0xffffffff,
 all writes are ignored.
 
@@ -139,7 +142,7 @@ The driver must return one of the following result codes:
 		  a chance to extract some diagnostic information (see
 		  mmio_enable, below).
 		- PCI_ERS_RESULT_NEED_RESET:
-		  Driver returns this if it can't recover without a hard
+		  Driver returns this if it can't recover without a
 		  slot reset.
 		- PCI_ERS_RESULT_DISCONNECT:
 		  Driver returns this if it doesn't want to recover at all.
@@ -169,11 +172,11 @@ is STEP 6 (Permanent Failure).
 
 >>> The current powerpc implementation doesn't much care if the device
 >>> attempts I/O at this point, or not.  I/O's will fail, returning
->>> a value of 0xff on read, and writes will be dropped. If the device
->>> driver attempts more than 10K I/O's to a frozen adapter, it will
->>> assume that the device driver has gone into an infinite loop, and
->>> it will panic the kernel. There doesn't seem to be any other
->>> way of stopping a device driver that insists on spinning on I/O.
+>>> a value of 0xff on read, and writes will be dropped. If more than
+>>> EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH
+>>> assumes that the device driver has gone into an infinite loop
+>>> and prints an error to syslog.  A reboot is then required to 
+>>> get the device working again.
 
 STEP 2: MMIO Enabled
 -------------------
@@ -182,15 +185,14 @@ DMA), and then calls the mmio_enabled() callback on all affected
 device drivers.
 
 This is the "early recovery" call. IOs are allowed again, but DMA is
-not (hrm... to be discussed, I prefer not), with some restrictions. This
-is NOT a callback for the driver to start operations again, only to
-peek/poke at the device, extract diagnostic information, if any, and
-eventually do things like trigger a device local reset or some such,
-but not restart operations. This is callback is made if all drivers on
-a segment agree that they can try to recover and if no automatic link reset
-was performed by the HW. If the platform can't just re-enable IOs without
-a slot reset or a link reset, it wont call this callback, and instead
-will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
+not, with some restrictions. This is NOT a callback for the driver to
+start operations again, only to peek/poke at the device, extract diagnostic
+information, if any, and eventually do things like trigger a device local
+reset or some such, but not restart operations. This callback is made if
+all drivers on a segment agree that they can try to recover and if no automatic
+link reset was performed by the HW. If the platform can't just re-enable IOs
+without a slot reset or a link reset, it will not call this callback, and
+instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
 
 >>> The following is proposed; no platform implements this yet:
 >>> Proposal: All I/O's should be done _synchronously_ from within
@@ -214,7 +216,7 @@ The driver should return one of the following result codes:
 
 		- PCI_ERS_RESULT_NEED_RESET
 		  Driver returns this if it thinks the device is not
-		  recoverable in it's current state and it needs a slot
+		  recoverable in its current state and it needs a slot
 		  reset to proceed.
 
 		- PCI_ERS_RESULT_DISCONNECT
@@ -228,9 +230,6 @@ proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
 If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
 proceeds to STEP 4 (Slot Reset)
 
->>> The current powerpc implementation does not implement this callback.
-
-
 STEP 3: Link Reset
 ------------------
 The platform resets the link, and then calls the link_reset() callback
@@ -242,7 +241,7 @@ in working condition.
 
 The driver is not supposed to restart normal driver I/O operations
 at this point.  It should limit itself to "probing" the device to
-check it's recoverability status. If all is right, then the platform
+check its recoverability status. If all is right, then the platform
 will call resume() once all drivers have ack'd link_reset().
 
 	Result codes:
@@ -253,16 +252,33 @@ The platform then proceeds to either STEP 4 (Slot Reset) or STEP 5
 
 >>> The current powerpc implementation does not implement this callback.
 
-
 STEP 4: Slot Reset
 ------------------
-The platform performs a soft or hard reset of the device, and then
-calls the slot_reset() callback.
 
-A soft reset consists of asserting the adapter #RST line and then
+In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
+the platform will peform a slot reset on the requesting PCI device(s). 
+The actual steps taken by a platform to perform a slot reset
+will be platform-dependent. Upon completion of slot reset, the
+platform will call the device slot_reset() callback.
+
+Powerpc platforms implement two levels of slot reset:
+soft reset(default) and fundamental(optional) reset.
+
+Powerpc soft reset consists of asserting the adapter #RST line and then
 restoring the PCI BAR's and PCI configuration header to a state
 that is equivalent to what it would be after a fresh system
 power-on followed by power-on BIOS/system firmware initialization.
+Soft reset is also known as hot-reset.
+
+Powerpc fundamental reset is supported by PCI Express cards only
+and results in device's state machines, hardware logic, port states and
+configuration registers to initialize to their default conditions.
+
+For most PCI devices, a soft reset will be sufficient for recovery.
+Optional fundamental reset is provided to support a limited number
+of PCI Express PCI devices  for which a soft reset is not sufficient
+for recovery.
+
 If the platform supports PCI hotplug, then the reset might be
 performed by toggling the slot electrical power off/on.
 
@@ -274,10 +290,12 @@ may result in hung devices, kernel panics, or silent data corruption.
 
 This call gives drivers the chance to re-initialize the hardware
 (re-download firmware, etc.).  At this point, the driver may assume
-that he card is in a fresh state and is fully functional. In
-particular, interrupt generation should work normally.
+that the card is in a fresh state and is fully functional. The slot
+is unfrozen and the driver has full access to PCI config space,
+memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X)
+will also be available.
 
-Drivers should not yet restart normal I/O processing operations
+Drivers should not restart normal I/O processing operations
 at this point.  If all device drivers report success on this
 callback, the platform will call resume() to complete the sequence,
 and let the driver restart normal I/O processing.
@@ -302,11 +320,21 @@ driver performs device init only from PCI function 0:
 		- PCI_ERS_RESULT_DISCONNECT
 		Same as above.
 
+Drivers for PCI Express cards that require a fundamental reset must
+set the needs_freset bit in the pci_dev structure in their probe function.  
+For example, the QLogic qla2xxx driver sets the needs_freset bit for certain
+PCI card types:
+
++	/* Set EEH reset type to fundamental if required by hba  */
++	if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha))
++		pdev->needs_freset = 1;
++
+
 Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent
 Failure).
 
->>> The current powerpc implementation does not currently try a
->>> power-cycle reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
+>>> The current powerpc implementation does not try a power-cycle
+>>> reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
 >>> However, it probably should.
 
 
@@ -348,7 +376,7 @@ software errors.
 
 Conclusion; General Remarks
 ---------------------------
-The way those callbacks are called is platform policy. A platform with
+The way the callbacks are called is platform policy. A platform with
 no slot reset capability may want to just "ignore" drivers that can't
 recover (disconnect them) and try to let other cards on the same segment
 recover. Keep in mind that in most real life cases, though, there will
@@ -361,8 +389,8 @@ That is, the recovery API only requires that:
 
  - There is no guarantee that interrupt delivery can proceed from any
 device on the segment starting from the error detection and until the
-resume callback is sent, at which point interrupts are expected to be
-fully operational.
+slot_reset callback is called, at which point interrupts are expected
+to be fully operational.
 
  - There is no guarantee that interrupt delivery is stopped, that is,
 a driver that gets an interrupt after detecting an error, or that detects
@@ -381,16 +409,23 @@ anyway :)
 >>> Implementation details for the powerpc platform are discussed in
 >>> the file Documentation/powerpc/eeh-pci-error-recovery.txt
 
->>> As of this writing, there are six device drivers with patches
->>> implementing error recovery. Not all of these patches are in
+>>> As of this writing, there is a growing list of device drivers with
+>>> patches implementing error recovery. Not all of these patches are in
 >>> mainline yet. These may be used as "examples":
 >>>
->>> drivers/scsi/ipr.c
->>> drivers/scsi/sym53cxx_2
+>>> drivers/scsi/ipr
+>>> drivers/scsi/sym53c8xx_2
+>>> drivers/scsi/qla2xxx
+>>> drivers/scsi/lpfc
+>>> drivers/next/bnx2.c
 >>> drivers/next/e100.c
 >>> drivers/net/e1000
+>>> drivers/net/e1000e
 >>> drivers/net/ixgb
+>>> drivers/net/ixgbe
+>>> drivers/net/cxgb3
 >>> drivers/net/s2io.c
+>>> drivers/net/qlge
 
 The End
 -------
diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt
new file mode 100644
index 00000000000..2d91ae25198
--- /dev/null
+++ b/Documentation/PCI/pci-iov-howto.txt
@@ -0,0 +1,135 @@
+		PCI Express I/O Virtualization Howto
+		Copyright (C) 2009 Intel Corporation
+		    Yu Zhao <yu.zhao@intel.com>
+
+		Update: November 2012
+			-- sysfs-based SRIOV enable-/disable-ment
+		Donald Dutile <ddutile@redhat.com>
+
+1. Overview
+
+1.1 What is SR-IOV
+
+Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
+capability which makes one physical device appear as multiple virtual
+devices. The physical device is referred to as Physical Function (PF)
+while the virtual devices are referred to as Virtual Functions (VF).
+Allocation of the VF can be dynamically controlled by the PF via
+registers encapsulated in the capability. By default, this feature is
+not enabled and the PF behaves as traditional PCIe device. Once it's
+turned on, each VF's PCI configuration space can be accessed by its own
+Bus, Device and Function Number (Routing ID). And each VF also has PCI
+Memory Space, which is used to map its register set. VF device driver
+operates on the register set so it can be functional and appear as a
+real existing PCI device.
+
+2. User Guide
+
+2.1 How can I enable SR-IOV capability
+
+Multiple methods are available for SR-IOV enablement.
+In the first method, the device driver (PF driver) will control the
+enabling and disabling of the capability via API provided by SR-IOV core.
+If the hardware has SR-IOV capability, loading its PF driver would
+enable it and all VFs associated with the PF.  Some PF drivers require
+a module parameter to be set to determine the number of VFs to enable.
+In the second method, a write to the sysfs file sriov_numvfs will
+enable and disable the VFs associated with a PCIe PF.  This method
+enables per-PF, VF enable/disable values versus the first method,
+which applies to all PFs of the same device.  Additionally, the
+PCI SRIOV core support ensures that enable/disable operations are
+valid to reduce duplication in multiple drivers for the same
+checks, e.g., check numvfs == 0 if enabling VFs, ensure
+numvfs <= totalvfs.
+The second method is the recommended method for new/future VF devices.
+
+2.2 How can I use the Virtual Functions
+
+The VF is treated as hot-plugged PCI devices in the kernel, so they
+should be able to work in the same way as real PCI devices. The VF
+requires device driver that is same as a normal PCI device's.
+
+3. Developer Guide
+
+3.1 SR-IOV API
+
+To enable SR-IOV capability:
+(a) For the first method, in the driver:
+	int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
+	'nr_virtfn' is number of VFs to be enabled.
+(b) For the second method, from sysfs:
+	echo 'nr_virtfn' > \
+        /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
+
+To disable SR-IOV capability:
+(a) For the first method, in the driver:
+	void pci_disable_sriov(struct pci_dev *dev);
+(b) For the second method, from sysfs:
+	echo  0 > \
+        /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
+
+3.2 Usage example
+
+Following piece of code illustrates the usage of the SR-IOV API.
+
+static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
+{
+	pci_enable_sriov(dev, NR_VIRTFN);
+
+	...
+
+	return 0;
+}
+
+static void dev_remove(struct pci_dev *dev)
+{
+	pci_disable_sriov(dev);
+
+	...
+}
+
+static int dev_suspend(struct pci_dev *dev, pm_message_t state)
+{
+	...
+
+	return 0;
+}
+
+static int dev_resume(struct pci_dev *dev)
+{
+	...
+
+	return 0;
+}
+
+static void dev_shutdown(struct pci_dev *dev)
+{
+	...
+}
+
+static int dev_sriov_configure(struct pci_dev *dev, int numvfs)
+{
+	if (numvfs > 0) {
+		...
+		pci_enable_sriov(dev, numvfs);
+		...
+		return numvfs;
+	}
+	if (numvfs == 0) {
+		....
+		pci_disable_sriov(dev);
+		...
+		return 0;
+	}
+}
+
+static struct pci_driver dev_driver = {
+	.name =		"SR-IOV Physical Function driver",
+	.id_table =	dev_id_table,
+	.probe =	dev_probe,
+	.remove =	dev_remove,
+	.suspend =	dev_suspend,
+	.resume =	dev_resume,
+	.shutdown =	dev_shutdown,
+	.sriov_configure = dev_sriov_configure,
+};
diff --git a/Documentation/PCI/pci.txt b/Documentation/PCI/pci.txt
index 8d4dc6250c5..9518006f667 100644
--- a/Documentation/PCI/pci.txt
+++ b/Documentation/PCI/pci.txt
@@ -123,8 +123,10 @@ initialization with a pointer to a structure describing the driver
 
 
 The ID table is an array of struct pci_device_id entries ending with an
-all-zero entry; use of the macro DEFINE_PCI_DEVICE_TABLE is the preferred
-method of declaring the table.  Each entry consists of:
+all-zero entry.  Definitions with static const are generally preferred.
+Use of the deprecated macro DEFINE_PCI_DEVICE_TABLE should be avoided.
+
+Each entry consists of:
 
 	vendor,device	Vendor and device ID to match (or PCI_ANY_ID)
 
@@ -163,6 +165,10 @@ need pass only as many optional fields as necessary:
 	o class and classmask fields default to 0
 	o driver_data defaults to 0UL.
 
+Note that driver_data must match the value used by any of the pci_device_id
+entries defined in the driver. This makes the driver_data field mandatory
+if all the pci_device_id entries have a non-zero driver_data value.
+
 Once added, the driver probe routine will be invoked for any unclaimed
 PCI devices listed in its (newly updated) pci_ids list.
 
@@ -179,12 +185,6 @@ Please mark the initialization and cleanup functions where appropriate
 			initializes.
 	__exit		Exit code. Ignored for non-modular drivers.
 
-
-	__devinit	Device initialization code.
-			Identical to __init if the kernel is not compiled
-			with CONFIG_HOTPLUG, normal function otherwise.
-	__devexit	The same for __exit.
-
 Tips on when/where to use the above attributes:
 	o The module_init()/module_exit() functions (and all
 	  initialization functions called _only_ from these)
@@ -192,20 +192,6 @@ Tips on when/where to use the above attributes:
 
 	o Do not mark the struct pci_driver.
 
-	o The ID table array should be marked __devinitconst; this is done
-	  automatically if the table is declared with DEFINE_PCI_DEVICE_TABLE().
-
-	o The probe() and remove() functions should be marked __devinit
-	  and __devexit respectively.  All initialization functions
-	  exclusively called by the probe() routine, can be marked __devinit.
-	  Ditto for remove() and __devexit.
-
-	o If mydriver_remove() is marked with __devexit(), then all address
-	  references to mydriver_remove must use __devexit_p(mydriver_remove)
-	  (in the struct pci_driver declaration for example).
-	  __devexit_p() will generate the function name _or_ NULL if the
-	  function will be discarded.  For an example, see drivers/net/tg3.c.
-
 	o Do NOT mark a function if you are not sure which mark to use.
 	  Better to not mark the function than mark the function wrong.
 
@@ -290,7 +276,8 @@ NOTE: pci_enable_device() can fail! Check the return value.
 
 pci_set_master() will enable DMA by setting the bus master bit
 in the PCI_COMMAND register. It also fixes the latency timer value if
-it's set to something bogus by the BIOS.
+it's set to something bogus by the BIOS.  pci_clear_master() will
+disable DMA by clearing the bus master bit.
 
 If the PCI device can use the PCI Memory-Write-Invalidate transaction,
 call pci_set_mwi().  This enables the PCI_COMMAND bit for Mem-Wr-Inval
@@ -309,7 +296,7 @@ from the PCI device config space. Use the values in the pci_dev structure
 as the PCI "bus address" might have been remapped to a "host physical"
 address by the arch/chip-set specific kernel support.
 
-See Documentation/IO-mapping.txt for how to access device registers
+See Documentation/io-mapping.txt for how to access device registers
 or device memory.
 
 The device driver needs to call pci_request_region() to verify
@@ -540,8 +527,9 @@ corresponding register block for you.
 6. Other interesting functions
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-pci_find_slot()			Find pci_dev corresponding to given bus and
-				slot numbers.
+pci_get_domain_bus_and_slot()	Find pci_dev corresponding to given domain,
+				bus and slot and number. If the device is
+				found, its reference count is increased.
 pci_set_power_state()		Set PCI Power Management state (0=D0 ... 3=D3)
 pci_find_capability()		Find specified capability in device's capability
 				list.
@@ -576,7 +564,7 @@ to be handled by platform and generic code, not individual drivers.
 8. Vendor and device identifications
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-One is not not required to add new device ids to include/linux/pci_ids.h.
+One is not required to add new device ids to include/linux/pci_ids.h.
 Please add PCI_VENDOR_ID_xxx for vendors and a hex constant for device ids.
 
 PCI_VENDOR_ID_xxx constants are re-used. The device ids are arbitrary
@@ -597,7 +585,8 @@ having sane locking.
 
 pci_find_device()	Superseded by pci_get_device()
 pci_find_subsys()	Superseded by pci_get_subsys()
-pci_find_slot()		Superseded by pci_get_slot()
+pci_find_slot()		Superseded by pci_get_domain_bus_and_slot()
+pci_get_slot()		Superseded by pci_get_domain_bus_and_slot()
 
 
 The alternative is the traditional PCI device driver that walks PCI
diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt
index 16c251230c8..26d3d945c3c 100644
--- a/Documentation/PCI/pcieaer-howto.txt
+++ b/Documentation/PCI/pcieaer-howto.txt
@@ -13,7 +13,7 @@ Reporting (AER) driver and provides information on how to use it, as
 well as how to enable the drivers of endpoint devices to conform with
 PCI Express AER driver.
 
-1.2 Copyright � Intel Corporation 2006.
+1.2 Copyright (C) Intel Corporation 2006.
 
 1.3 What is the PCI Express AER Driver?
 
@@ -61,21 +61,21 @@ be initiated although firmwares have no _OSC support. To enable the
 walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line
 when booting kernel. Note that forceload=n by default.
 
+nosourceid, another parameter of type bool, can be used when broken
+hardware (mostly chipsets) has root ports that cannot obtain the reporting
+source ID. nosourceid=n by default.
+
 2.3 AER error output
 When a PCI-E AER error is captured, an error message will be outputed to
 console. If it's a correctable error, it is outputed as a warning.
 Otherwise, it is printed as an error. So users could choose different
 log level to filter out correctable error messages.
 
-Below shows an example.
-+------ PCI-Express Device Error -----+
-Error Severity          : Uncorrected (Fatal)
-PCIE Bus Error type     : Transaction Layer
-Unsupported Request     : First
-Requester ID            : 0500
-VendorID=8086h, DeviceID=0329h, Bus=05h, Device=00h, Function=00h
-TLB Header:
-04000001 00200a03 05010000 00050100
+Below shows an example:
+0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
+0000:50:00.0:   device [8086:0329] error status/mask=00100000/00000000
+0000:50:00.0:    [20] Unsupported Request    (First)
+0000:50:00.0:   TLP Header: 04000001 00200a03 05010000 00050100
 
 In the example, 'Requester ID' means the ID of the device who sends
 the error message to root port. Pls. refer to pci express specs for
@@ -108,7 +108,7 @@ but the PCI Express link itself is fully functional. Fatal errors, on
 the other hand, cause the link to be unreliable.
 
 When AER is enabled, a PCI Express device will automatically send an
-error message to the PCIE root port above it when the device captures
+error message to the PCIe root port above it when the device captures
 an error. The Root Port, upon receiving an error reporting message,
 internally processes and logs the error message in its PCI Express
 capability structure. Error information being logged includes storing
@@ -194,8 +194,9 @@ to reset link, AER port service driver is required to provide the
 function to reset link. Firstly, kernel looks for if the upstream
 component has an aer driver. If it has, kernel uses the reset_link
 callback of the aer driver. If the upstream component has no aer driver
-and the port is downstream port, we will use the aer driver of the
-root port who reports the AER error. As for upstream ports,
+and the port is downstream port, we will perform a hot reset as the
+default by setting the Secondary Bus Reset bit of the Bridge Control
+register associated with the downstream port. As for upstream ports,
 they should provide their own aer service drivers with reset_link
 function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and
 reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
@@ -203,22 +204,17 @@ to mmio_enabled.
 
 3.3 helper functions
 
-3.3.1 int pci_find_aer_capability(struct pci_dev *dev);
-pci_find_aer_capability locates the PCI Express AER capability
-in the device configuration space. If the device doesn't support
-PCI-Express AER, the function returns 0.
-
-3.3.2 int pci_enable_pcie_error_reporting(struct pci_dev *dev);
+3.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev);
 pci_enable_pcie_error_reporting enables the device to send error
 messages to root port when an error is detected. Note that devices
 don't enable the error reporting by default, so device drivers need
 call this function to enable it.
 
-3.3.3 int pci_disable_pcie_error_reporting(struct pci_dev *dev);
+3.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev);
 pci_disable_pcie_error_reporting disables the device to send error
 messages to root port when an error is detected.
 
-3.3.4 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
+3.3.3 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
 pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable
 error status register.
 
@@ -251,3 +247,24 @@ with the PCI Express AER Root driver?
 A: It could call the helper functions to enable AER in devices and
 cleanup uncorrectable status register. Pls. refer to section 3.3.
 
+
+4. Software error injection
+
+Debugging PCIe AER error recovery code is quite difficult because it
+is hard to trigger real hardware errors. Software based error
+injection can be used to fake various kinds of PCIe errors.
+
+First you should enable PCIe AER software error injection in kernel
+configuration, that is, following item should be in your .config.
+
+CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
+
+After reboot with new kernel or insert the module, a device file named
+/dev/aer_inject should be created.
+
+Then, you need a user space tool named aer-inject, which can be gotten
+from:
+    http://www.kernel.org/pub/linux/utils/pci/aer-inject/
+
+More information about aer-inject can be found in the document comes
+with its source code.