6 files changed, 509 insertions, 195 deletions
diff --git a/Documentation/PCI/00-INDEX b/Documentation/PCI/00-INDEX
index 812b17fe3ed..147231f1613 100644
--- a/Documentation/PCI/00-INDEX
+++ b/Documentation/PCI/00-INDEX
@@ -2,12 +2,12 @@
 	- this file
 MSI-HOWTO.txt
 	- the Message Signaled Interrupts (MSI) Driver Guide HOWTO and FAQ.
-PCI-DMA-mapping.txt
-	- info for PCI drivers using DMA portably across all platforms
 PCIEBUS-HOWTO.txt
 	- a guide describing the PCI Express Port Bus driver
 pci-error-recovery.txt
 	- info on PCI error recovery
+pci-iov-howto.txt
+	- the PCI Express I/O Virtualization HOWTO
 pci.txt
 	- info on the PCI subsystem for device driver authors
 pcieaer-howto.txt
diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
index dcf7acc720e..10a93696e55 100644
--- a/Documentation/PCI/MSI-HOWTO.txt
+++ b/Documentation/PCI/MSI-HOWTO.txt
@@ -45,7 +45,7 @@ arrived in memory (this becomes more likely with devices behind PCI-PCI
 bridges).  In order to ensure that all the data has arrived in memory,
 the interrupt handler must read a register on the device which raised
 the interrupt.  PCI transaction ordering rules require that all the data
-arrives in memory before the value can be returned from the register.
+arrive in memory before the value may be returned from the register.
 Using MSIs avoids this problem as the interrupt-generating write cannot
 pass the data writes, so by the time the interrupt is raised, the driver
 knows that all the data has arrived in memory.
@@ -86,61 +86,143 @@ device.
 
 int pci_enable_msi(struct pci_dev *dev)
 
-A successful call will allocate ONE interrupt to the device, regardless
-of how many MSIs the device supports.  The device will be switched from
+A successful call allocates ONE interrupt to the device, regardless
+of how many MSIs the device supports.  The device is switched from
 pin-based interrupt mode to MSI mode.  The dev->irq number is changed
-to a new number which represents the message signaled interrupt.
-This function should be called before the driver calls request_irq()
-since enabling MSIs disables the pin-based IRQ and the driver will not
-receive interrupts on the old interrupt.
+to a new number which represents the message signaled interrupt;
+consequently, this function should be called before the driver calls
+request_irq(), because an MSI is delivered via a vector that is
+different from the vector of a pin-based interrupt.
 
-4.2.2 pci_enable_msi_block
+4.2.2 pci_enable_msi_range
 
-int pci_enable_msi_block(struct pci_dev *dev, int count)
+int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
 
-This variation on the above call allows a device driver to request multiple
-MSIs.  The MSI specification only allows interrupts to be allocated in
-powers of two, up to a maximum of 2^5 (32).
+This function allows a device driver to request any number of MSI
+interrupts within specified range from 'minvec' to 'maxvec'.
 
-If this function returns 0, it has succeeded in allocating at least as many
-interrupts as the driver requested (it may have allocated more in order
-to satisfy the power-of-two requirement).  In this case, the function
-enables MSI on this device and updates dev->irq to be the lowest of
-the new interrupts assigned to it.  The other interrupts assigned to
-the device are in the range dev->irq to dev->irq + count - 1.
+If this function returns a positive number it indicates the number of
+MSI interrupts that have been successfully allocated.  In this case
+the device is switched from pin-based interrupt mode to MSI mode and
+updates dev->irq to be the lowest of the new interrupts assigned to it.
+The other interrupts assigned to the device are in the range dev->irq
+to dev->irq + returned value - 1.  Device driver can use the returned
+number of successfully allocated MSI interrupts to further allocate
+and initialize device resources.
 
 If this function returns a negative number, it indicates an error and
 the driver should not attempt to request any more MSI interrupts for
-this device.  If this function returns a positive number, it will be
-less than 'count' and indicate the number of interrupts that could have
-been allocated.  In neither case will the irq value have been
-updated, nor will the device have been switched into MSI mode.
-
-The device driver must decide what action to take if
-pci_enable_msi_block() returns a value less than the number asked for.
-Some devices can make use of fewer interrupts than the maximum they
-request; in this case the driver should call pci_enable_msi_block()
-again.  Note that it is not guaranteed to succeed, even when the
-'count' has been reduced to the value returned from a previous call to
-pci_enable_msi_block().  This is because there are multiple constraints
-on the number of vectors that can be allocated; pci_enable_msi_block()
-will return as soon as it finds any constraint that doesn't allow the
-call to succeed.
-
-4.2.3 pci_disable_msi
+this device.
+
+This function should be called before the driver calls request_irq(),
+because MSI interrupts are delivered via vectors that are different
+from the vector of a pin-based interrupt.
+
+It is ideal if drivers can cope with a variable number of MSI interrupts;
+there are many reasons why the platform may not be able to provide the
+exact number that a driver asks for.
+
+There could be devices that can not operate with just any number of MSI
+interrupts within a range.  See chapter 4.3.1.3 to get the idea how to
+handle such devices for MSI-X - the same logic applies to MSI.
+
+4.2.1.1 Maximum possible number of MSI interrupts
+
+The typical usage of MSI interrupts is to allocate as many vectors as
+possible, likely up to the limit returned by pci_msi_vec_count() function:
+
+static int foo_driver_enable_msi(struct pci_dev *pdev, int nvec)
+{
+	return pci_enable_msi_range(pdev, 1, nvec);
+}
+
+Note the value of 'minvec' parameter is 1.  As 'minvec' is inclusive,
+the value of 0 would be meaningless and could result in error.
+
+Some devices have a minimal limit on number of MSI interrupts.
+In this case the function could look like this:
+
+static int foo_driver_enable_msi(struct pci_dev *pdev, int nvec)
+{
+	return pci_enable_msi_range(pdev, FOO_DRIVER_MINIMUM_NVEC, nvec);
+}
+
+4.2.1.2 Exact number of MSI interrupts
+
+If a driver is unable or unwilling to deal with a variable number of MSI
+interrupts it could request a particular number of interrupts by passing
+that number to pci_enable_msi_range() function as both 'minvec' and 'maxvec'
+parameters:
+
+static int foo_driver_enable_msi(struct pci_dev *pdev, int nvec)
+{
+	return pci_enable_msi_range(pdev, nvec, nvec);
+}
+
+Note, unlike pci_enable_msi_exact() function, which could be also used to
+enable a particular number of MSI-X interrupts, pci_enable_msi_range()
+returns either a negative errno or 'nvec' (not negative errno or 0 - as
+pci_enable_msi_exact() does).
+
+4.2.1.3 Single MSI mode
+
+The most notorious example of the request type described above is
+enabling the single MSI mode for a device.  It could be done by passing
+two 1s as 'minvec' and 'maxvec':
+
+static int foo_driver_enable_single_msi(struct pci_dev *pdev)
+{
+	return pci_enable_msi_range(pdev, 1, 1);
+}
+
+Note, unlike pci_enable_msi() function, which could be also used to
+enable the single MSI mode, pci_enable_msi_range() returns either a
+negative errno or 1 (not negative errno or 0 - as pci_enable_msi()
+does).
+
+4.2.3 pci_enable_msi_exact
+
+int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
+
+This variation on pci_enable_msi_range() call allows a device driver to
+request exactly 'nvec' MSIs.
+
+If this function returns a negative number, it indicates an error and
+the driver should not attempt to request any more MSI interrupts for
+this device.
+
+By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
+returns zero in case of success, which indicates MSI interrupts have been
+successfully allocated.
+
+4.2.4 pci_disable_msi
 
 void pci_disable_msi(struct pci_dev *dev)
 
-This function should be used to undo the effect of pci_enable_msi() or
-pci_enable_msi_block().  Calling it restores dev->irq to the pin-based
-interrupt number and frees the previously allocated message signaled
-interrupt(s).  The interrupt may subsequently be assigned to another
-device, so drivers should not cache the value of dev->irq.
+This function should be used to undo the effect of pci_enable_msi_range().
+Calling it restores dev->irq to the pin-based interrupt number and frees
+the previously allocated MSIs.  The interrupts may subsequently be assigned
+to another device, so drivers should not cache the value of dev->irq.
+
+Before calling this function, a device driver must always call free_irq()
+on any interrupt for which it previously called request_irq().
+Failure to do so results in a BUG_ON(), leaving the device with
+MSI enabled and thus leaking its vector.
 
-A device driver must always call free_irq() on the interrupt(s)
-for which it has called request_irq() before calling this function.
-Failure to do so will result in a BUG_ON(), the device will be left with
-MSI enabled and will leak its vector.
+4.2.4 pci_msi_vec_count
+
+int pci_msi_vec_count(struct pci_dev *dev)
+
+This function could be used to retrieve the number of MSI vectors the
+device requested (via the Multiple Message Capable register). The MSI
+specification only allows the returned value to be a power of two,
+up to a maximum of 2^5 (32).
+
+If this function returns a negative number, it indicates the device is
+not capable of sending MSIs.
+
+If this function returns a positive number, it indicates the maximum
+number of MSI interrupt vectors that could be allocated.
 
 4.3 Using MSI-X
 
@@ -155,72 +237,213 @@ struct msix_entry {
 };
 
 This allows for the device to use these interrupts in a sparse fashion;
-for example it could use interrupts 3 and 1027 and allocate only a
+for example, it could use interrupts 3 and 1027 and yet allocate only a
 two-element array.  The driver is expected to fill in the 'entry' value
-in each element of the array to indicate which entries it wants the kernel
-to assign interrupts for.  It is invalid to fill in two entries with the
+in each element of the array to indicate for which entries the kernel
+should assign interrupts; it is invalid to fill in two entries with the
 same number.
 
-4.3.1 pci_enable_msix
+4.3.1 pci_enable_msix_range
 
-int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
+int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
+			  int minvec, int maxvec)
 
-Calling this function asks the PCI subsystem to allocate 'nvec' MSIs.
+Calling this function asks the PCI subsystem to allocate any number of
+MSI-X interrupts within specified range from 'minvec' to 'maxvec'.
 The 'entries' argument is a pointer to an array of msix_entry structs
-which should be at least 'nvec' entries in size.  On success, the
-function will return 0 and the device will have been switched into
-MSI-X interrupt mode.  The 'vector' elements in each entry will have
-been filled in with the interrupt number.  The driver should then call
-request_irq() for each 'vector' that it decides to use.
+which should be at least 'maxvec' entries in size.
+
+On success, the device is switched into MSI-X mode and the function
+returns the number of MSI-X interrupts that have been successfully
+allocated.  In this case the 'vector' member in entries numbered from
+0 to the returned value - 1 is populated with the interrupt number;
+the driver should then call request_irq() for each 'vector' that it
+decides to use.  The device driver is responsible for keeping track of the
+interrupts assigned to the MSI-X vectors so it can free them again later.
+Device driver can use the returned number of successfully allocated MSI-X
+interrupts to further allocate and initialize device resources.
 
 If this function returns a negative number, it indicates an error and
 the driver should not attempt to allocate any more MSI-X interrupts for
-this device.  If it returns a positive number, it indicates the maximum
-number of interrupt vectors that could have been allocated. See example
-below.
+this device.
 
-This function, in contrast with pci_enable_msi(), does not adjust
+This function, in contrast with pci_enable_msi_range(), does not adjust
 dev->irq.  The device will not generate interrupts for this interrupt
-number once MSI-X is enabled.  The device driver is responsible for
-keeping track of the interrupts assigned to the MSI-X vectors so it can
-free them again later.
+number once MSI-X is enabled.
 
 Device drivers should normally call this function once per device
 during the initialization phase.
 
-It is ideal if drivers can cope with a variable number of MSI-X interrupts,
+It is ideal if drivers can cope with a variable number of MSI-X interrupts;
 there are many reasons why the platform may not be able to provide the
-exact number a driver asks for.
+exact number that a driver asks for.
+
+There could be devices that can not operate with just any number of MSI-X
+interrupts within a range.  E.g., an network adapter might need let's say
+four vectors per each queue it provides.  Therefore, a number of MSI-X
+interrupts allocated should be a multiple of four.  In this case interface
+pci_enable_msix_range() can not be used alone to request MSI-X interrupts
+(since it can allocate any number within the range, without any notion of
+the multiple of four) and the device driver should master a custom logic
+to request the required number of MSI-X interrupts.
+
+4.3.1.1 Maximum possible number of MSI-X interrupts
 
-A request loop to achieve that might look like:
+The typical usage of MSI-X interrupts is to allocate as many vectors as
+possible, likely up to the limit returned by pci_msix_vec_count() function:
 
 static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
 {
-	while (nvec >= FOO_DRIVER_MINIMUM_NVEC) {
-		rc = pci_enable_msix(adapter->pdev,
-				     adapter->msix_entries, nvec);
-		if (rc > 0)
-			nvec = rc;
-		else
-			return rc;
+	return pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
+				     1, nvec);
+}
+
+Note the value of 'minvec' parameter is 1.  As 'minvec' is inclusive,
+the value of 0 would be meaningless and could result in error.
+
+Some devices have a minimal limit on number of MSI-X interrupts.
+In this case the function could look like this:
+
+static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
+{
+	return pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
+				     FOO_DRIVER_MINIMUM_NVEC, nvec);
+}
+
+4.3.1.2 Exact number of MSI-X interrupts
+
+If a driver is unable or unwilling to deal with a variable number of MSI-X
+interrupts it could request a particular number of interrupts by passing
+that number to pci_enable_msix_range() function as both 'minvec' and 'maxvec'
+parameters:
+
+static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
+{
+	return pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
+				     nvec, nvec);
+}
+
+Note, unlike pci_enable_msix_exact() function, which could be also used to
+enable a particular number of MSI-X interrupts, pci_enable_msix_range()
+returns either a negative errno or 'nvec' (not negative errno or 0 - as
+pci_enable_msix_exact() does).
+
+4.3.1.3 Specific requirements to the number of MSI-X interrupts
+
+As noted above, there could be devices that can not operate with just any
+number of MSI-X interrupts within a range.  E.g., let's assume a device that
+is only capable sending the number of MSI-X interrupts which is a power of
+two.  A routine that enables MSI-X mode for such device might look like this:
+
+/*
+ * Assume 'minvec' and 'maxvec' are non-zero
+ */
+static int foo_driver_enable_msix(struct foo_adapter *adapter,
+				  int minvec, int maxvec)
+{
+	int rc;
+
+	minvec = roundup_pow_of_two(minvec);
+	maxvec = rounddown_pow_of_two(maxvec);
+
+	if (minvec > maxvec)
+		return -ERANGE;
+
+retry:
+	rc = pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
+				   maxvec, maxvec);
+	/*
+	 * -ENOSPC is the only error code allowed to be analized
+	 */
+	if (rc == -ENOSPC) {
+		if (maxvec == 1)
+			return -ENOSPC;
+
+		maxvec /= 2;
+
+		if (minvec > maxvec)
+			return -ENOSPC;
+
+		goto retry;
 	}
 
-	return -ENOSPC;
+	return rc;
 }
 
-4.3.2 pci_disable_msix
+Note how pci_enable_msix_range() return value is analized for a fallback -
+any error code other than -ENOSPC indicates a fatal error and should not
+be retried.
+
+4.3.2 pci_enable_msix_exact
+
+int pci_enable_msix_exact(struct pci_dev *dev,
+			  struct msix_entry *entries, int nvec)
+
+This variation on pci_enable_msix_range() call allows a device driver to
+request exactly 'nvec' MSI-Xs.
+
+If this function returns a negative number, it indicates an error and
+the driver should not attempt to allocate any more MSI-X interrupts for
+this device.
+
+By contrast with pci_enable_msix_range() function, pci_enable_msix_exact()
+returns zero in case of success, which indicates MSI-X interrupts have been
+successfully allocated.
+
+Another version of a routine that enables MSI-X mode for a device with
+specific requirements described in chapter 4.3.1.3 might look like this:
+
+/*
+ * Assume 'minvec' and 'maxvec' are non-zero
+ */
+static int foo_driver_enable_msix(struct foo_adapter *adapter,
+				  int minvec, int maxvec)
+{
+	int rc;
+
+	minvec = roundup_pow_of_two(minvec);
+	maxvec = rounddown_pow_of_two(maxvec);
+
+	if (minvec > maxvec)
+		return -ERANGE;
+
+retry:
+	rc = pci_enable_msix_exact(adapter->pdev,
+				   adapter->msix_entries, maxvec);
+
+	/*
+	 * -ENOSPC is the only error code allowed to be analyzed
+	 */
+	if (rc == -ENOSPC) {
+		if (maxvec == 1)
+			return -ENOSPC;
+
+		maxvec /= 2;
+
+		if (minvec > maxvec)
+			return -ENOSPC;
+
+		goto retry;
+	} else if (rc < 0) {
+		return rc;
+	}
+
+	return maxvec;
+}
+
+4.3.3 pci_disable_msix
 
 void pci_disable_msix(struct pci_dev *dev)
 
-This API should be used to undo the effect of pci_enable_msix().  It frees
-the previously allocated message signaled interrupts.  The interrupts may
+This function should be used to undo the effect of pci_enable_msix_range().
+It frees the previously allocated MSI-X interrupts. The interrupts may
 subsequently be assigned to another device, so drivers should not cache
 the value of the 'vector' elements over a call to pci_disable_msix().
 
-A device driver must always call free_irq() on the interrupt(s)
-for which it has called request_irq() before calling this function.
-Failure to do so will result in a BUG_ON(), the device will be left with
-MSI enabled and will leak its vector.
+Before calling this function, a device driver must always call free_irq()
+on any interrupt for which it previously called request_irq().
+Failure to do so results in a BUG_ON(), leaving the device with
+MSI-X enabled and thus leaking its vector.
 
 4.3.3 The MSI-X Table
 
@@ -229,18 +452,32 @@ MSI-X Table.  This address is mapped by the PCI subsystem, and should not
 be accessed directly by the device driver.  If the driver wishes to
 mask or unmask an interrupt, it should call disable_irq() / enable_irq().
 
+4.3.4 pci_msix_vec_count
+
+int pci_msix_vec_count(struct pci_dev *dev)
+
+This function could be used to retrieve number of entries in the device
+MSI-X table.
+
+If this function returns a negative number, it indicates the device is
+not capable of sending MSI-Xs.
+
+If this function returns a positive number, it indicates the maximum
+number of MSI-X interrupt vectors that could be allocated.
+
 4.4 Handling devices implementing both MSI and MSI-X capabilities
 
 If a device implements both MSI and MSI-X capabilities, it can
-run in either MSI mode or MSI-X mode but not both simultaneously.
+run in either MSI mode or MSI-X mode, but not both simultaneously.
 This is a requirement of the PCI spec, and it is enforced by the
-PCI layer.  Calling pci_enable_msi() when MSI-X is already enabled or
-pci_enable_msix() when MSI is already enabled will result in an error.
-If a device driver wishes to switch between MSI and MSI-X at runtime,
-it must first quiesce the device, then switch it back to pin-interrupt
-mode, before calling pci_enable_msi() or pci_enable_msix() and resuming
-operation.  This is not expected to be a common operation but may be
-useful for debugging or testing during development.
+PCI layer.  Calling pci_enable_msi_range() when MSI-X is already
+enabled or pci_enable_msix_range() when MSI is already enabled
+results in an error.  If a device driver wishes to switch between MSI
+and MSI-X at runtime, it must first quiesce the device, then switch
+it back to pin-interrupt mode, before calling pci_enable_msi_range()
+or pci_enable_msix_range() and resuming operation.  This is not expected
+to be a common operation but may be useful for debugging or testing
+during development.
 
 4.5 Considerations when using MSIs
 
@@ -251,10 +488,10 @@ the MSI-X facilities in preference to the MSI facilities.  As mentioned
 above, MSI-X supports any number of interrupts between 1 and 2048.
 In constrast, MSI is restricted to a maximum of 32 interrupts (and
 must be a power of two).  In addition, the MSI interrupt vectors must
-be allocated consecutively, so the system may not be able to allocate
+be allocated consecutively, so the system might not be able to allocate
 as many vectors for MSI as it could for MSI-X.  On some platforms, MSI
-interrupts must all be targetted at the same set of CPUs whereas MSI-X
-interrupts can all be targetted at different CPUs.
+interrupts must all be targeted at the same set of CPUs whereas MSI-X
+interrupts can all be targeted at different CPUs.
 
 4.5.2 Spinlocks
 
@@ -281,7 +518,7 @@ disabled to enabled and back again.
 
 Using 'lspci -v' (as root) may show some devices with "MSI", "Message
 Signalled Interrupts" or "MSI-X" capabilities.  Each of these capabilities
-has an 'Enable' flag which will be followed with either "+" (enabled)
+has an 'Enable' flag which is followed with either "+" (enabled)
 or "-" (disabled).
 
 
@@ -298,7 +535,7 @@ The PCI stack provides three ways to disable MSIs:
 
 Some host chipsets simply don't support MSIs properly.  If we're
 lucky, the manufacturer knows this and has indicated it in the ACPI
-FADT table.  In this case, Linux will automatically disable MSIs.
+FADT table.  In this case, Linux automatically disables MSIs.
 Some boards don't include this information in the table and so we have
 to detect them ourselves.  The complete list of these is found near the
 quirk_disable_all_msi() function in drivers/pci/quirks.c.
@@ -317,7 +554,7 @@ Some bridges allow you to enable MSIs by changing some bits in their
 PCI configuration space (especially the Hypertransport chipsets such
 as the nVidia nForce and Serverworks HT2000).  As with host chipsets,
 Linux mostly knows about them and automatically enables MSIs if it can.
-If you have a bridge which Linux doesn't yet know about, you can enable
+If you have a bridge unknown to Linux, you can enable
 MSIs in configuration space using whatever method you know works, then
 enable MSIs on that bridge by doing:
 
@@ -327,7 +564,7 @@ where $bridge is the PCI address of the bridge you've enabled (eg
 0000:00:0e.0).
 
 To disable MSIs, echo 0 instead of 1.  Changing this value should be
-done with caution as it can break interrupt handling for all devices
+done with caution as it could break interrupt handling for all devices
 below this bridge.
 
 Again, please notify linux-pci@vger.kernel.org of any bridges that need
@@ -336,7 +573,7 @@ special handling.
 5.3. Disabling MSIs on a single device
 
 Some devices are known to have faulty MSI implementations.  Usually this
-is handled in the individual device driver but occasionally it's necessary
+is handled in the individual device driver, but occasionally it's necessary
 to handle this with a quirk.  Some drivers have an option to disable use
 of MSI.  While this is a convenient workaround for the driver author,
 it is not good practise, and should not be emulated.
@@ -350,10 +587,10 @@ for your machine.  You should also check your .config to be sure you
 have enabled CONFIG_PCI_MSI.
 
 Then, 'lspci -t' gives the list of bridges above a device.  Reading
-/sys/bus/pci/devices/*/msi_bus will tell you whether MSI are enabled (1)
+/sys/bus/pci/devices/*/msi_bus will tell you whether MSIs are enabled (1)
 or disabled (0).  If 0 is found in any of the msi_bus files belonging
 to bridges between the PCI root and the device, MSIs are disabled.
 
 It is also worth checking the device driver to see whether it supports MSIs.
-For example, it may contain calls to pci_enable_msi(), pci_enable_msix() or
-pci_enable_msi_block().
+For example, it may contain calls to pci_enable_msi_range() or
+pci_enable_msix_range().
diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt
index 6650af43252..898ded24510 100644
--- a/Documentation/PCI/pci-error-recovery.txt
+++ b/Documentation/PCI/pci-error-recovery.txt
@@ -4,15 +4,17 @@
                         February 2, 2006
 
                  Current document maintainer:
-             Linas Vepstas <linas@austin.ibm.com>
+             Linas Vepstas <linasvepstas@gmail.com>
+          updated by Richard Lary <rlary@us.ibm.com>
+       and Mike Mason <mmlnx@us.ibm.com> on 27-Jul-2009
 
 
 Many PCI bus controllers are able to detect a variety of hardware
 PCI errors on the bus, such as parity errors on the data and address
 busses, as well as SERR and PERR errors.  Some of the more advanced
 chipsets are able to deal with these errors; these include PCI-E chipsets,
-and the PCI-host bridges found on IBM Power4 and Power5-based pSeries
-boxes. A typical action taken is to disconnect the affected device,
+and the PCI-host bridges found on IBM Power4, Power5 and Power6-based
+pSeries boxes. A typical action taken is to disconnect the affected device,
 halting all I/O to it.  The goal of a disconnection is to avoid system
 corruption; for example, to halt system memory corruption due to DMA's
 to "wild" addresses. Typically, a reconnection mechanism is also
@@ -37,10 +39,11 @@ is forced by the need to handle multi-function devices, that is,
 devices that have multiple device drivers associated with them.
 In the first stage, each driver is allowed to indicate what type
 of reset it desires, the choices being a simple re-enabling of I/O
-or requesting a hard reset (a full electrical #RST of the PCI card).
-If any driver requests a full reset, that is what will be done.
+or requesting a slot reset.
 
-After a full reset and/or a re-enabling of I/O, all drivers are
+If any driver requests a slot reset, that is what will be done.
+
+After a reset and/or a re-enabling of I/O, all drivers are
 again notified, so that they may then perform any device setup/config
 that may be required.  After these have all completed, a final
 "resume normal operations" event is sent out.
@@ -101,7 +104,7 @@ if it implements any, it must implement error_detected(). If a callback
 is not implemented, the corresponding feature is considered unsupported.
 For example, if mmio_enabled() and resume() aren't there, then it
 is assumed that the driver is not doing any direct recovery and requires
-a reset. If link_reset() is not implemented, the card is assumed as
+a slot reset. If link_reset() is not implemented, the card is assumed to
 not care about link resets. Typically a driver will want to know about
 a slot_reset().
 
@@ -111,7 +114,7 @@ sequence described below.
 
 STEP 0: Error Event
 -------------------
-PCI bus error is detect by the PCI hardware.  On powerpc, the slot
+A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
 is isolated, in that all I/O is blocked: all reads return 0xffffffff,
 all writes are ignored.
 
@@ -139,7 +142,7 @@ The driver must return one of the following result codes:
 		  a chance to extract some diagnostic information (see
 		  mmio_enable, below).
 		- PCI_ERS_RESULT_NEED_RESET:
-		  Driver returns this if it can't recover without a hard
+		  Driver returns this if it can't recover without a
 		  slot reset.
 		- PCI_ERS_RESULT_DISCONNECT:
 		  Driver returns this if it doesn't want to recover at all.
@@ -169,11 +172,11 @@ is STEP 6 (Permanent Failure).
 
 >>> The current powerpc implementation doesn't much care if the device
 >>> attempts I/O at this point, or not.  I/O's will fail, returning
->>> a value of 0xff on read, and writes will be dropped. If the device
->>> driver attempts more than 10K I/O's to a frozen adapter, it will
->>> assume that the device driver has gone into an infinite loop, and
->>> it will panic the kernel. There doesn't seem to be any other
->>> way of stopping a device driver that insists on spinning on I/O.
+>>> a value of 0xff on read, and writes will be dropped. If more than
+>>> EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH
+>>> assumes that the device driver has gone into an infinite loop
+>>> and prints an error to syslog.  A reboot is then required to 
+>>> get the device working again.
 
 STEP 2: MMIO Enabled
 -------------------
@@ -182,15 +185,14 @@ DMA), and then calls the mmio_enabled() callback on all affected
 device drivers.
 
 This is the "early recovery" call. IOs are allowed again, but DMA is
-not (hrm... to be discussed, I prefer not), with some restrictions. This
-is NOT a callback for the driver to start operations again, only to
-peek/poke at the device, extract diagnostic information, if any, and
-eventually do things like trigger a device local reset or some such,
-but not restart operations. This is callback is made if all drivers on
-a segment agree that they can try to recover and if no automatic link reset
-was performed by the HW. If the platform can't just re-enable IOs without
-a slot reset or a link reset, it wont call this callback, and instead
-will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
+not, with some restrictions. This is NOT a callback for the driver to
+start operations again, only to peek/poke at the device, extract diagnostic
+information, if any, and eventually do things like trigger a device local
+reset or some such, but not restart operations. This callback is made if
+all drivers on a segment agree that they can try to recover and if no automatic
+link reset was performed by the HW. If the platform can't just re-enable IOs
+without a slot reset or a link reset, it will not call this callback, and
+instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
 
 >>> The following is proposed; no platform implements this yet:
 >>> Proposal: All I/O's should be done _synchronously_ from within
@@ -214,7 +216,7 @@ The driver should return one of the following result codes:
 
 		- PCI_ERS_RESULT_NEED_RESET
 		  Driver returns this if it thinks the device is not
-		  recoverable in it's current state and it needs a slot
+		  recoverable in its current state and it needs a slot
 		  reset to proceed.
 
 		- PCI_ERS_RESULT_DISCONNECT
@@ -228,9 +230,6 @@ proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
 If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
 proceeds to STEP 4 (Slot Reset)
 
->>> The current powerpc implementation does not implement this callback.
-
-
 STEP 3: Link Reset
 ------------------
 The platform resets the link, and then calls the link_reset() callback
@@ -242,7 +241,7 @@ in working condition.
 
 The driver is not supposed to restart normal driver I/O operations
 at this point.  It should limit itself to "probing" the device to
-check it's recoverability status. If all is right, then the platform
+check its recoverability status. If all is right, then the platform
 will call resume() once all drivers have ack'd link_reset().
 
 	Result codes:
@@ -253,16 +252,33 @@ The platform then proceeds to either STEP 4 (Slot Reset) or STEP 5
 
 >>> The current powerpc implementation does not implement this callback.
 
-
 STEP 4: Slot Reset
 ------------------
-The platform performs a soft or hard reset of the device, and then
-calls the slot_reset() callback.
 
-A soft reset consists of asserting the adapter #RST line and then
+In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
+the platform will peform a slot reset on the requesting PCI device(s). 
+The actual steps taken by a platform to perform a slot reset
+will be platform-dependent. Upon completion of slot reset, the
+platform will call the device slot_reset() callback.
+
+Powerpc platforms implement two levels of slot reset:
+soft reset(default) and fundamental(optional) reset.
+
+Powerpc soft reset consists of asserting the adapter #RST line and then
 restoring the PCI BAR's and PCI configuration header to a state
 that is equivalent to what it would be after a fresh system
 power-on followed by power-on BIOS/system firmware initialization.
+Soft reset is also known as hot-reset.
+
+Powerpc fundamental reset is supported by PCI Express cards only
+and results in device's state machines, hardware logic, port states and
+configuration registers to initialize to their default conditions.
+
+For most PCI devices, a soft reset will be sufficient for recovery.
+Optional fundamental reset is provided to support a limited number
+of PCI Express PCI devices  for which a soft reset is not sufficient
+for recovery.
+
 If the platform supports PCI hotplug, then the reset might be
 performed by toggling the slot electrical power off/on.
 
@@ -274,10 +290,12 @@ may result in hung devices, kernel panics, or silent data corruption.
 
 This call gives drivers the chance to re-initialize the hardware
 (re-download firmware, etc.).  At this point, the driver may assume
-that he card is in a fresh state and is fully functional. In
-particular, interrupt generation should work normally.
+that the card is in a fresh state and is fully functional. The slot
+is unfrozen and the driver has full access to PCI config space,
+memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X)
+will also be available.
 
-Drivers should not yet restart normal I/O processing operations
+Drivers should not restart normal I/O processing operations
 at this point.  If all device drivers report success on this
 callback, the platform will call resume() to complete the sequence,
 and let the driver restart normal I/O processing.
@@ -302,11 +320,21 @@ driver performs device init only from PCI function 0:
 		- PCI_ERS_RESULT_DISCONNECT
 		Same as above.
 
+Drivers for PCI Express cards that require a fundamental reset must
+set the needs_freset bit in the pci_dev structure in their probe function.  
+For example, the QLogic qla2xxx driver sets the needs_freset bit for certain
+PCI card types:
+
++	/* Set EEH reset type to fundamental if required by hba  */
++	if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha))
++		pdev->needs_freset = 1;
++
+
 Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent
 Failure).
 
->>> The current powerpc implementation does not currently try a
->>> power-cycle reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
+>>> The current powerpc implementation does not try a power-cycle
+>>> reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
 >>> However, it probably should.
 
 
@@ -348,7 +376,7 @@ software errors.
 
 Conclusion; General Remarks
 ---------------------------
-The way those callbacks are called is platform policy. A platform with
+The way the callbacks are called is platform policy. A platform with
 no slot reset capability may want to just "ignore" drivers that can't
 recover (disconnect them) and try to let other cards on the same segment
 recover. Keep in mind that in most real life cases, though, there will
@@ -361,8 +389,8 @@ That is, the recovery API only requires that:
 
  - There is no guarantee that interrupt delivery can proceed from any
 device on the segment starting from the error detection and until the
-resume callback is sent, at which point interrupts are expected to be
-fully operational.
+slot_reset callback is called, at which point interrupts are expected
+to be fully operational.
 
  - There is no guarantee that interrupt delivery is stopped, that is,
 a driver that gets an interrupt after detecting an error, or that detects
@@ -381,16 +409,23 @@ anyway :)
 >>> Implementation details for the powerpc platform are discussed in
 >>> the file Documentation/powerpc/eeh-pci-error-recovery.txt
 
->>> As of this writing, there are six device drivers with patches
->>> implementing error recovery. Not all of these patches are in
+>>> As of this writing, there is a growing list of device drivers with
+>>> patches implementing error recovery. Not all of these patches are in
 >>> mainline yet. These may be used as "examples":
 >>>
->>> drivers/scsi/ipr.c
->>> drivers/scsi/sym53cxx_2
+>>> drivers/scsi/ipr
+>>> drivers/scsi/sym53c8xx_2
+>>> drivers/scsi/qla2xxx
+>>> drivers/scsi/lpfc
+>>> drivers/next/bnx2.c
 >>> drivers/next/e100.c
 >>> drivers/net/e1000
+>>> drivers/net/e1000e
 >>> drivers/net/ixgb
+>>> drivers/net/ixgbe
+>>> drivers/net/cxgb3
 >>> drivers/net/s2io.c
+>>> drivers/net/qlge
 
 The End
 -------
diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt
index fc73ef5d65b..2d91ae25198 100644
--- a/Documentation/PCI/pci-iov-howto.txt
+++ b/Documentation/PCI/pci-iov-howto.txt
@@ -2,6 +2,9 @@
 		Copyright (C) 2009 Intel Corporation
 		    Yu Zhao <yu.zhao@intel.com>
 
+		Update: November 2012
+			-- sysfs-based SRIOV enable-/disable-ment
+		Donald Dutile <ddutile@redhat.com>
 
 1. Overview
 
@@ -24,10 +27,21 @@ real existing PCI device.
 
 2.1 How can I enable SR-IOV capability
 
-The device driver (PF driver) will control the enabling and disabling
-of the capability via API provided by SR-IOV core. If the hardware
-has SR-IOV capability, loading its PF driver would enable it and all
-VFs associated with the PF.
+Multiple methods are available for SR-IOV enablement.
+In the first method, the device driver (PF driver) will control the
+enabling and disabling of the capability via API provided by SR-IOV core.
+If the hardware has SR-IOV capability, loading its PF driver would
+enable it and all VFs associated with the PF.  Some PF drivers require
+a module parameter to be set to determine the number of VFs to enable.
+In the second method, a write to the sysfs file sriov_numvfs will
+enable and disable the VFs associated with a PCIe PF.  This method
+enables per-PF, VF enable/disable values versus the first method,
+which applies to all PFs of the same device.  Additionally, the
+PCI SRIOV core support ensures that enable/disable operations are
+valid to reduce duplication in multiple drivers for the same
+checks, e.g., check numvfs == 0 if enabling VFs, ensure
+numvfs <= totalvfs.
+The second method is the recommended method for new/future VF devices.
 
 2.2 How can I use the Virtual Functions
 
@@ -40,20 +54,25 @@ requires device driver that is same as a normal PCI device's.
 3.1 SR-IOV API
 
 To enable SR-IOV capability:
+(a) For the first method, in the driver:
 	int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 	'nr_virtfn' is number of VFs to be enabled.
+(b) For the second method, from sysfs:
+	echo 'nr_virtfn' > \
+        /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
 
 To disable SR-IOV capability:
+(a) For the first method, in the driver:
 	void pci_disable_sriov(struct pci_dev *dev);
-
-To notify SR-IOV core of Virtual Function Migration:
-	irqreturn_t pci_sriov_migration(struct pci_dev *dev);
+(b) For the second method, from sysfs:
+	echo  0 > \
+        /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
 
 3.2 Usage example
 
 Following piece of code illustrates the usage of the SR-IOV API.
 
-static int __devinit dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
+static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
 {
 	pci_enable_sriov(dev, NR_VIRTFN);
 
@@ -62,7 +81,7 @@ static int __devinit dev_probe(struct pci_dev *dev, const struct pci_device_id *
 	return 0;
 }
 
-static void __devexit dev_remove(struct pci_dev *dev)
+static void dev_remove(struct pci_dev *dev)
 {
 	pci_disable_sriov(dev);
 
@@ -88,12 +107,29 @@ static void dev_shutdown(struct pci_dev *dev)
 	...
 }
 
+static int dev_sriov_configure(struct pci_dev *dev, int numvfs)
+{
+	if (numvfs > 0) {
+		...
+		pci_enable_sriov(dev, numvfs);
+		...
+		return numvfs;
+	}
+	if (numvfs == 0) {
+		....
+		pci_disable_sriov(dev);
+		...
+		return 0;
+	}
+}
+
 static struct pci_driver dev_driver = {
 	.name =		"SR-IOV Physical Function driver",
 	.id_table =	dev_id_table,
 	.probe =	dev_probe,
-	.remove =	__devexit_p(dev_remove),
+	.remove =	dev_remove,
 	.suspend =	dev_suspend,
 	.resume =	dev_resume,
 	.shutdown =	dev_shutdown,
+	.sriov_configure = dev_sriov_configure,
 };
diff --git a/Documentation/PCI/pci.txt b/Documentation/PCI/pci.txt
index 7f6de6ea5b4..9518006f667 100644
--- a/Documentation/PCI/pci.txt
+++ b/Documentation/PCI/pci.txt
@@ -123,8 +123,10 @@ initialization with a pointer to a structure describing the driver
 
 
 The ID table is an array of struct pci_device_id entries ending with an
-all-zero entry; use of the macro DEFINE_PCI_DEVICE_TABLE is the preferred
-method of declaring the table.  Each entry consists of:
+all-zero entry.  Definitions with static const are generally preferred.
+Use of the deprecated macro DEFINE_PCI_DEVICE_TABLE should be avoided.
+
+Each entry consists of:
 
 	vendor,device	Vendor and device ID to match (or PCI_ANY_ID)
 
@@ -183,12 +185,6 @@ Please mark the initialization and cleanup functions where appropriate
 			initializes.
 	__exit		Exit code. Ignored for non-modular drivers.
 
-
-	__devinit	Device initialization code.
-			Identical to __init if the kernel is not compiled
-			with CONFIG_HOTPLUG, normal function otherwise.
-	__devexit	The same for __exit.
-
 Tips on when/where to use the above attributes:
 	o The module_init()/module_exit() functions (and all
 	  initialization functions called _only_ from these)
@@ -196,20 +192,6 @@ Tips on when/where to use the above attributes:
 
 	o Do not mark the struct pci_driver.
 
-	o The ID table array should be marked __devinitconst; this is done
-	  automatically if the table is declared with DEFINE_PCI_DEVICE_TABLE().
-
-	o The probe() and remove() functions should be marked __devinit
-	  and __devexit respectively.  All initialization functions
-	  exclusively called by the probe() routine, can be marked __devinit.
-	  Ditto for remove() and __devexit.
-
-	o If mydriver_remove() is marked with __devexit(), then all address
-	  references to mydriver_remove must use __devexit_p(mydriver_remove)
-	  (in the struct pci_driver declaration for example).
-	  __devexit_p() will generate the function name _or_ NULL if the
-	  function will be discarded.  For an example, see drivers/net/tg3.c.
-
 	o Do NOT mark a function if you are not sure which mark to use.
 	  Better to not mark the function than mark the function wrong.
 
@@ -314,7 +296,7 @@ from the PCI device config space. Use the values in the pci_dev structure
 as the PCI "bus address" might have been remapped to a "host physical"
 address by the arch/chip-set specific kernel support.
 
-See Documentation/IO-mapping.txt for how to access device registers
+See Documentation/io-mapping.txt for how to access device registers
 or device memory.
 
 The device driver needs to call pci_request_region() to verify
@@ -545,8 +527,9 @@ corresponding register block for you.
 6. Other interesting functions
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-pci_find_slot()			Find pci_dev corresponding to given bus and
-				slot numbers.
+pci_get_domain_bus_and_slot()	Find pci_dev corresponding to given domain,
+				bus and slot and number. If the device is
+				found, its reference count is increased.
 pci_set_power_state()		Set PCI Power Management state (0=D0 ... 3=D3)
 pci_find_capability()		Find specified capability in device's capability
 				list.
@@ -581,7 +564,7 @@ to be handled by platform and generic code, not individual drivers.
 8. Vendor and device identifications
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-One is not not required to add new device ids to include/linux/pci_ids.h.
+One is not required to add new device ids to include/linux/pci_ids.h.
 Please add PCI_VENDOR_ID_xxx for vendors and a hex constant for device ids.
 
 PCI_VENDOR_ID_xxx constants are re-used. The device ids are arbitrary
@@ -602,7 +585,8 @@ having sane locking.
 
 pci_find_device()	Superseded by pci_get_device()
 pci_find_subsys()	Superseded by pci_get_subsys()
-pci_find_slot()		Superseded by pci_get_slot()
+pci_find_slot()		Superseded by pci_get_domain_bus_and_slot()
+pci_get_slot()		Superseded by pci_get_domain_bus_and_slot()
 
 
 The alternative is the traditional PCI device driver that walks PCI
diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt
index ddeb14beacc..26d3d945c3c 100644
--- a/Documentation/PCI/pcieaer-howto.txt
+++ b/Documentation/PCI/pcieaer-howto.txt
@@ -13,7 +13,7 @@ Reporting (AER) driver and provides information on how to use it, as
 well as how to enable the drivers of endpoint devices to conform with
 PCI Express AER driver.
 
-1.2 Copyright � Intel Corporation 2006.
+1.2 Copyright (C) Intel Corporation 2006.
 
 1.3 What is the PCI Express AER Driver?
 
@@ -61,21 +61,21 @@ be initiated although firmwares have no _OSC support. To enable the
 walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line
 when booting kernel. Note that forceload=n by default.
 
+nosourceid, another parameter of type bool, can be used when broken
+hardware (mostly chipsets) has root ports that cannot obtain the reporting
+source ID. nosourceid=n by default.
+
 2.3 AER error output
 When a PCI-E AER error is captured, an error message will be outputed to
 console. If it's a correctable error, it is outputed as a warning.
 Otherwise, it is printed as an error. So users could choose different
 log level to filter out correctable error messages.
 
-Below shows an example.
-+------ PCI-Express Device Error -----+
-Error Severity          : Uncorrected (Fatal)
-PCIE Bus Error type     : Transaction Layer
-Unsupported Request     : First
-Requester ID            : 0500
-VendorID=8086h, DeviceID=0329h, Bus=05h, Device=00h, Function=00h
-TLB Header:
-04000001 00200a03 05010000 00050100
+Below shows an example:
+0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
+0000:50:00.0:   device [8086:0329] error status/mask=00100000/00000000
+0000:50:00.0:    [20] Unsupported Request    (First)
+0000:50:00.0:   TLP Header: 04000001 00200a03 05010000 00050100
 
 In the example, 'Requester ID' means the ID of the device who sends
 the error message to root port. Pls. refer to pci express specs for
@@ -108,7 +108,7 @@ but the PCI Express link itself is fully functional. Fatal errors, on
 the other hand, cause the link to be unreliable.
 
 When AER is enabled, a PCI Express device will automatically send an
-error message to the PCIE root port above it when the device captures
+error message to the PCIe root port above it when the device captures
 an error. The Root Port, upon receiving an error reporting message,
 internally processes and logs the error message in its PCI Express
 capability structure. Error information being logged includes storing
@@ -194,8 +194,9 @@ to reset link, AER port service driver is required to provide the
 function to reset link. Firstly, kernel looks for if the upstream
 component has an aer driver. If it has, kernel uses the reset_link
 callback of the aer driver. If the upstream component has no aer driver
-and the port is downstream port, we will use the aer driver of the
-root port who reports the AER error. As for upstream ports,
+and the port is downstream port, we will perform a hot reset as the
+default by setting the Secondary Bus Reset bit of the Bridge Control
+register associated with the downstream port. As for upstream ports,
 they should provide their own aer service drivers with reset_link
 function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and
 reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
@@ -246,3 +247,24 @@ with the PCI Express AER Root driver?
 A: It could call the helper functions to enable AER in devices and
 cleanup uncorrectable status register. Pls. refer to section 3.3.
 
+
+4. Software error injection
+
+Debugging PCIe AER error recovery code is quite difficult because it
+is hard to trigger real hardware errors. Software based error
+injection can be used to fake various kinds of PCIe errors.
+
+First you should enable PCIe AER software error injection in kernel
+configuration, that is, following item should be in your .config.
+
+CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
+
+After reboot with new kernel or insert the module, a device file named
+/dev/aer_inject should be created.
+
+Then, you need a user space tool named aer-inject, which can be gotten
+from:
+    http://www.kernel.org/pub/linux/utils/pci/aer-inject/
+
+More information about aer-inject can be found in the document comes
+with its source code.