diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2013-05-02 14:14:04 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2013-05-02 14:14:04 -0700 |
commit | 736a2dd2571ac56b11ed95a7814d838d5311be04 (patch) | |
tree | de10d107025970c6e51d5b6faeba799ed4b9caae | |
parent | 0b2e3b6bb4a415379f16e38fc92db42379be47a1 (diff) | |
parent | 01d779a14ef800b74684d9692add4944df052461 (diff) |
Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull virtio & lguest updates from Rusty Russell:
"Lots of virtio work which wasn't quite ready for last merge window.
Plus I dived into lguest again, reworking the pagetable code so we can
move the switcher page: our fixmaps sometimes take more than 2MB now..."
Ugh. Annoying conflicts with the tcm_vhost -> vhost_scsi rename.
Hopefully correctly resolved.
* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (57 commits)
caif_virtio: Remove bouncing email addresses
lguest: improve code readability in lg_cpu_start.
virtio-net: fill only rx queues which are being used
lguest: map Switcher below fixmap.
lguest: cache last cpu we ran on.
lguest: map Switcher text whenever we allocate a new pagetable.
lguest: don't share Switcher PTE pages between guests.
lguest: expost switcher_pages array (as lg_switcher_pages).
lguest: extract shadow PTE walking / allocating.
lguest: make check_gpte et. al return bool.
lguest: assume Switcher text is a single page.
lguest: rename switcher_page to switcher_pages.
lguest: remove RESERVE_MEM constant.
lguest: check vaddr not pgd for Switcher protection.
lguest: prepare to make SWITCHER_ADDR a variable.
virtio: console: replace EMFILE with EBUSY for already-open port
virtio-scsi: reset virtqueue affinity when doing cpu hotplug
virtio-scsi: introduce multiqueue support
virtio-scsi: push vq lock/unlock into virtscsi_vq_done
virtio-scsi: pass struct virtio_scsi to virtqueue completion function
...
60 files changed, 4481 insertions, 4074 deletions
diff --git a/Documentation/virtual/00-INDEX b/Documentation/virtual/00-INDEX index 924bd462675..e952d30bbf0 100644 --- a/Documentation/virtual/00-INDEX +++ b/Documentation/virtual/00-INDEX @@ -6,6 +6,3 @@ kvm/ - Kernel Virtual Machine. See also http://linux-kvm.org uml/ - User Mode Linux, builds/runs Linux kernel as a userspace program. -virtio.txt - - Text version of draft virtio spec. - See http://ozlabs.org/~rusty/virtio-spec diff --git a/Documentation/virtual/virtio-spec.txt b/Documentation/virtual/virtio-spec.txt deleted file mode 100644 index eb094039b50..00000000000 --- a/Documentation/virtual/virtio-spec.txt +++ /dev/null @@ -1,3210 +0,0 @@ -[Generated file: see http://ozlabs.org/~rusty/virtio-spec/] -Virtio PCI Card Specification -v0.9.5 DRAFT -- - -Rusty Russell <rusty@rustcorp.com.au> IBM Corporation (Editor) - -2012 May 7. - -Purpose and Description - -This document describes the specifications of the “virtio” family -of PCI[LaTeX Command: nomenclature] devices. These are devices -are found in virtual environments[LaTeX Command: nomenclature], -yet by design they are not all that different from physical PCI -devices, and this document treats them as such. This allows the -guest to use standard PCI drivers and discovery mechanisms. - -The purpose of virtio and this specification is that virtual -environments and guests should have a straightforward, efficient, -standard and extensible mechanism for virtual devices, rather -than boutique per-environment or per-OS mechanisms. - - Straightforward: Virtio PCI devices use normal PCI mechanisms - of interrupts and DMA which should be familiar to any device - driver author. There is no exotic page-flipping or COW - mechanism: it's just a PCI device.[footnote: -This lack of page-sharing implies that the implementation of the -device (e.g. the hypervisor or host) needs full access to the -guest memory. Communication with untrusted parties (i.e. -inter-guest communication) requires copying. -] - - Efficient: Virtio PCI devices consist of rings of descriptors - for input and output, which are neatly separated to avoid cache - effects from both guest and device writing to the same cache - lines. - - Standard: Virtio PCI makes no assumptions about the environment - in which it operates, beyond supporting PCI. In fact the virtio - devices specified in the appendices do not require PCI at all: - they have been implemented on non-PCI buses.[footnote: -The Linux implementation further separates the PCI virtio code -from the specific virtio drivers: these drivers are shared with -the non-PCI implementations (currently lguest and S/390). -] - - Extensible: Virtio PCI devices contain feature bits which are - acknowledged by the guest operating system during device setup. - This allows forwards and backwards compatibility: the device - offers all the features it knows about, and the driver - acknowledges those it understands and wishes to use. - - Virtqueues - -The mechanism for bulk data transport on virtio PCI devices is -pretentiously called a virtqueue. Each device can have zero or -more virtqueues: for example, the network device has one for -transmit and one for receive. - -Each virtqueue occupies two or more physically-contiguous pages -(defined, for the purposes of this specification, as 4096 bytes), -and consists of three parts: - - -+-------------------+-----------------------------------+-----------+ -| Descriptor Table | Available Ring (padding) | Used Ring | -+-------------------+-----------------------------------+-----------+ - - -When the driver wants to send a buffer to the device, it fills in -a slot in the descriptor table (or chains several together), and -writes the descriptor index into the available ring. It then -notifies the device. When the device has finished a buffer, it -writes the descriptor into the used ring, and sends an interrupt. - -Specification - - PCI Discovery - -Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 -through 0x103F inclusive is a virtio device[footnote: -The actual value within this range is ignored -]. The device must also have a Revision ID of 0 to match this -specification. - -The Subsystem Device ID indicates which virtio device is -supported by the device. The Subsystem Vendor ID should reflect -the PCI Vendor ID of the environment (it's currently only used -for informational purposes by the guest). - - -+----------------------+--------------------+---------------+ -| Subsystem Device ID | Virtio Device | Specification | -+----------------------+--------------------+---------------+ -+----------------------+--------------------+---------------+ -| 1 | network card | Appendix C | -+----------------------+--------------------+---------------+ -| 2 | block device | Appendix D | -+----------------------+--------------------+---------------+ -| 3 | console | Appendix E | -+----------------------+--------------------+---------------+ -| 4 | entropy source | Appendix F | -+----------------------+--------------------+---------------+ -| 5 | memory ballooning | Appendix G | -+----------------------+--------------------+---------------+ -| 6 | ioMemory | - | -+----------------------+--------------------+---------------+ -| 7 | rpmsg | Appendix H | -+----------------------+--------------------+---------------+ -| 8 | SCSI host | Appendix I | -+----------------------+--------------------+---------------+ -| 9 | 9P transport | - | -+----------------------+--------------------+---------------+ -| 10 | mac80211 wlan | - | -+----------------------+--------------------+---------------+ - - - Device Configuration - -To configure the device, we use the first I/O region of the PCI -device. This contains a virtio header followed by a -device-specific region. - -There may be different widths of accesses to the I/O region; the “ -natural” access method for each field in the virtio header must -be used (i.e. 32-bit accesses for 32-bit fields, etc), but the -device-specific region can be accessed using any width accesses, -and should obtain the same results. - -Note that this is possible because while the virtio header is PCI -(i.e. little) endian, the device-specific region is encoded in -the native endian of the guest (where such distinction is -applicable). - - Device Initialization Sequence<sub:Device-Initialization-Sequence> - -We start with an overview of device initialization, then expand -on the details of the device and how each step is preformed. - - Reset the device. This is not required on initial start up. - - The ACKNOWLEDGE status bit is set: we have noticed the device. - - The DRIVER status bit is set: we know how to drive the device. - - Device-specific setup, including reading the Device Feature - Bits, discovery of virtqueues for the device, optional MSI-X - setup, and reading and possibly writing the virtio - configuration space. - - The subset of Device Feature Bits understood by the driver is - written to the device. - - The DRIVER_OK status bit is set. - - The device can now be used (ie. buffers added to the - virtqueues)[footnote: -Historically, drivers have used the device before steps 5 and 6. -This is only allowed if the driver does not use any features -which would alter this early use of the device. -] - -If any of these steps go irrecoverably wrong, the guest should -set the FAILED status bit to indicate that it has given up on the -device (it can reset the device later to restart if desired). - -We now cover the fields required for general setup in detail. - - Virtio Header - -The virtio header looks as follows: - - -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ -| Bits || 32 | 32 | 32 | 16 | 16 | 16 | 8 | 8 | -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ -| Read/Write || R | R+W | R+W | R | R+W | R+W | R+W | R | -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ -| Purpose || Device | Guest | Queue | Queue | Queue | Queue | Device | ISR | -| || Features bits 0:31 | Features bits 0:31 | Address | Size | Select | Notify | Status | Status | -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ - - -If MSI-X is enabled for the device, two additional fields -immediately follow this header:[footnote: -ie. once you enable MSI-X on the device, the other fields move. -If you turn it off again, they move back! -] - - -+------------++----------------+--------+ -| Bits || 16 | 16 | - +----------------+--------+ -+------------++----------------+--------+ -| Read/Write || R+W | R+W | -+------------++----------------+--------+ -| Purpose || Configuration | Queue | -| (MSI-X) || Vector | Vector | -+------------++----------------+--------+ - - -Immediately following these general headers, there may be -device-specific headers: - - -+------------++--------------------+ -| Bits || Device Specific | - +--------------------+ -+------------++--------------------+ -| Read/Write || Device Specific | -+------------++--------------------+ -| Purpose || Device Specific... | -| || | -+------------++--------------------+ - - - Device Status - -The Device Status field is updated by the guest to indicate its -progress. This provides a simple low-level diagnostic: it's most -useful to imagine them hooked up to traffic lights on the console -indicating the status of each device. - -The device can be reset by writing a 0 to this field, otherwise -at least one bit should be set: - - ACKNOWLEDGE (1) Indicates that the guest OS has found the - device and recognized it as a valid virtio device. - - DRIVER (2) Indicates that the guest OS knows how to drive the - device. Under Linux, drivers can be loadable modules so there - may be a significant (or infinite) delay before setting this - bit. - - DRIVER_OK (4) Indicates that the driver is set up and ready to - drive the device. - - FAILED (128) Indicates that something went wrong in the guest, - and it has given up on the device. This could be an internal - error, or the driver didn't like the device for some reason, or - even a fatal error during device operation. The device must be - reset before attempting to re-initialize. - - Feature Bits<sub:Feature-Bits> - -Thefirst configuration field indicates the features that the -device supports. The bits are allocated as follows: - - 0 to 23 Feature bits for the specific device type - - 24 to 32 Feature bits reserved for extensions to the queue and - feature negotiation mechanisms - -For example, feature bit 0 for a network device (i.e. Subsystem -Device ID 1) indicates that the device supports checksumming of -packets. - -The feature bits are negotiated: the device lists all the -features it understands in the Device Features field, and the -guest writes the subset that it understands into the Guest -Features field. The only way to renegotiate is to reset the -device. - -In particular, new fields in the device configuration header are -indicated by offering a feature bit, so the guest can check -before accessing that part of the configuration space. - -This allows for forwards and backwards compatibility: if the -device is enhanced with a new feature bit, older guests will not -write that feature bit back to the Guest Features field and it -can go into backwards compatibility mode. Similarly, if a guest -is enhanced with a feature that the device doesn't support, it -will not see that feature bit in the Device Features field and -can go into backwards compatibility mode (or, for poor -implementations, set the FAILED Device Status bit). - - Configuration/Queue Vectors - -When MSI-X capability is present and enabled in the device -(through standard PCI configuration space) 4 bytes at byte offset -20 are used to map configuration change and queue interrupts to -MSI-X vectors. In this case, the ISR Status field is unused, and -device specific configuration starts at byte offset 24 in virtio -header structure. When MSI-X capability is not enabled, device -specific configuration starts at byte offset 20 in virtio header. - -Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of -Configuration/Queue Vector registers, maps interrupts triggered -by the configuration change/selected queue events respectively to -the corresponding MSI-X vector. To disable interrupts for a -specific event type, unmap it by writing a special NO_VECTOR -value: - -/* Vector value used to disable MSI for queue */ - -#define VIRTIO_MSI_NO_VECTOR 0xffff - -Reading these registers returns vector mapped to a given event, -or NO_VECTOR if unmapped. All queue and configuration change -events are unmapped by default. - -Note that mapping an event to vector might require allocating -internal device resources, and might fail. Devices report such -failures by returning the NO_VECTOR value when the relevant -Vector field is read. After mapping an event to vector, the -driver must verify success by reading the Vector field value: on -success, the previously written value is returned, and on -failure, NO_VECTOR is returned. If a mapping failure is detected, -the driver can retry mapping with fewervectors, or disable MSI-X. - - Virtqueue Configuration<sec:Virtqueue-Configuration> - -As a device can have zero or more virtqueues for bulk data -transport (for example, the network driver has two), the driver -needs to configure them as part of the device-specific -configuration. - -This is done as follows, for each virtqueue a device has: - - Write the virtqueue index (first queue is 0) to the Queue - Select field. - - Read the virtqueue size from the Queue Size field, which is - always a power of 2. This controls how big the virtqueue is - (see below). If this field is 0, the virtqueue does not exist. - - Allocate and zero virtqueue in contiguous physical memory, on a - 4096 byte alignment. Write the physical address, divided by - 4096 to the Queue Address field.[footnote: -The 4096 is based on the x86 page size, but it's also large -enough to ensure that the separate parts of the virtqueue are on -separate cache lines. -] - - Optionally, if MSI-X capability is present and enabled on the - device, select a vector to use to request interrupts triggered - by virtqueue events. Write the MSI-X Table entry number - corresponding to this vector in Queue Vector field. Read the - Queue Vector field: on success, previously written value is - returned; on failure, NO_VECTOR value is returned. - -The Queue Size field controls the total number of bytes required -for the virtqueue according to the following formula: - -#define ALIGN(x) (((x) + 4095) & ~4095) - -static inline unsigned vring_size(unsigned int qsz) - -{ - - return ALIGN(sizeof(struct vring_desc)*qsz + sizeof(u16)*(2 -+ qsz)) - - + ALIGN(sizeof(struct vring_used_elem)*qsz); - -} - -This currently wastes some space with padding, but also allows -future extensions. The virtqueue layout structure looks like this -(qsz is the Queue Size field, which is a variable, so this code -won't compile): - -struct vring { - - /* The actual descriptors (16 bytes each) */ - - struct vring_desc desc[qsz]; - - - - /* A ring of available descriptor heads with free-running -index. */ - - struct vring_avail avail; - - - - // Padding to the next 4096 boundary. - - char pad[]; - - - - // A ring of used descriptor heads with free-running index. - - struct vring_used used; - -}; - - A Note on Virtqueue Endianness - -Note that the endian of these fields and everything else in the -virtqueue is the native endian of the guest, not little-endian as -PCI normally is. This makes for simpler guest code, and it is -assumed that the host already has to be deeply aware of the guest -endian so such an “endian-aware” device is not a significant -issue. - - Descriptor Table - -The descriptor table refers to the buffers the guest is using for -the device. The addresses are physical addresses, and the buffers -can be chained via the next field. Each descriptor describes a -buffer which is read-only or write-only, but a chain of -descriptors can contain both read-only and write-only buffers. - -No descriptor chain may be more than 2^32 bytes long in total.struct vring_desc { - - /* Address (guest-physical). */ - - u64 addr; - - /* Length. */ - - u32 len; - -/* This marks a buffer as continuing via the next field. */ - -#define VRING_DESC_F_NEXT 1 - -/* This marks a buffer as write-only (otherwise read-only). */ - -#define VRING_DESC_F_WRITE 2 - -/* This means the buffer contains a list of buffer descriptors. -*/ - -#define VRING_DESC_F_INDIRECT 4 - - /* The flags as indicated above. */ - - u16 flags; - - /* Next field if flags & NEXT */ - - u16 next; - -}; - -The number of descriptors in the table is specified by the Queue -Size field for this virtqueue. - - <sub:Indirect-Descriptors>Indirect Descriptors - -Some devices benefit by concurrently dispatching a large number -of large requests. The VIRTIO_RING_F_INDIRECT_DESC feature can be -used to allow this (see [cha:Reserved-Feature-Bits]). To increase -ring capacity it is possible to store a table of indirect -descriptors anywhere in memory, and insert a descriptor in main -virtqueue (with flags&INDIRECT on) that refers to memory buffer -containing this indirect descriptor table; fields addr and len -refer to the indirect table address and length in bytes, -respectively. The indirect table layout structure looks like this -(len is the length of the descriptor that refers to this table, -which is a variable, so this code won't compile): - -struct indirect_descriptor_table { - - /* The actual descriptors (16 bytes each) */ - - struct vring_desc desc[len / 16]; - -}; - -The first indirect descriptor is located at start of the indirect -descriptor table (index 0), additional indirect descriptors are -chained by next field. An indirect descriptor without next field -(with flags&NEXT off) signals the end of the indirect descriptor -table, and transfers control back to the main virtqueue. An -indirect descriptor can not refer to another indirect descriptor -table (flags&INDIRECT must be off). A single indirect descriptor -table can include both read-only and write-only descriptors; -write-only flag (flags&WRITE) in the descriptor that refers to it -is ignored. - - Available Ring - -The available ring refers to what descriptors we are offering the -device: it refers to the head of a descriptor chain. The “flags” -field is currently 0 or 1: 1 indicating that we do not need an -interrupt when the device consumes a descriptor from the -available ring. Alternatively, the guest can ask the device to -delay interrupts until an entry with an index specified by the “ -used_event” field is written in the used ring (equivalently, -until the idx field in the used ring will reach the value -used_event + 1). The method employed by the device is controlled -by the VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits] -). This interrupt suppression is merely an optimization; it may -not suppress interrupts entirely. - -The “idx” field indicates where we would put the next descriptor -entry (modulo the ring size). This starts at 0, and increases. - -struct vring_avail { - -#define VRING_AVAIL_F_NO_INTERRUPT 1 - - u16 flags; - - u16 idx; - - u16 ring[qsz]; /* qsz is the Queue Size field read from device -*/ - - u16 used_event; - -}; - - Used Ring - -The used ring is where the device returns buffers once it is done -with them. The flags field can be used by the device to hint that -no notification is necessary when the guest adds to the available -ring. Alternatively, the “avail_event” field can be used by the -device to hint that no notification is necessary until an entry -with an index specified by the “avail_event” is written in the -available ring (equivalently, until the idx field in the -available ring will reach th |