diff options
60 files changed, 4481 insertions, 4074 deletions
diff --git a/Documentation/virtual/00-INDEX b/Documentation/virtual/00-INDEX index 924bd462675..e952d30bbf0 100644 --- a/Documentation/virtual/00-INDEX +++ b/Documentation/virtual/00-INDEX @@ -6,6 +6,3 @@ kvm/ - Kernel Virtual Machine. See also http://linux-kvm.org uml/ - User Mode Linux, builds/runs Linux kernel as a userspace program. -virtio.txt - - Text version of draft virtio spec. - See http://ozlabs.org/~rusty/virtio-spec diff --git a/Documentation/virtual/virtio-spec.txt b/Documentation/virtual/virtio-spec.txt deleted file mode 100644 index eb094039b50..00000000000 --- a/Documentation/virtual/virtio-spec.txt +++ /dev/null @@ -1,3210 +0,0 @@ -[Generated file: see http://ozlabs.org/~rusty/virtio-spec/] -Virtio PCI Card Specification -v0.9.5 DRAFT -- - -Rusty Russell <rusty@rustcorp.com.au> IBM Corporation (Editor) - -2012 May 7. - -Purpose and Description - -This document describes the specifications of the “virtio” family -of PCI[LaTeX Command: nomenclature] devices. These are devices -are found in virtual environments[LaTeX Command: nomenclature], -yet by design they are not all that different from physical PCI -devices, and this document treats them as such. This allows the -guest to use standard PCI drivers and discovery mechanisms. - -The purpose of virtio and this specification is that virtual -environments and guests should have a straightforward, efficient, -standard and extensible mechanism for virtual devices, rather -than boutique per-environment or per-OS mechanisms. - - Straightforward: Virtio PCI devices use normal PCI mechanisms - of interrupts and DMA which should be familiar to any device - driver author. There is no exotic page-flipping or COW - mechanism: it's just a PCI device.[footnote: -This lack of page-sharing implies that the implementation of the -device (e.g. the hypervisor or host) needs full access to the -guest memory. Communication with untrusted parties (i.e. -inter-guest communication) requires copying. -] - - Efficient: Virtio PCI devices consist of rings of descriptors - for input and output, which are neatly separated to avoid cache - effects from both guest and device writing to the same cache - lines. - - Standard: Virtio PCI makes no assumptions about the environment - in which it operates, beyond supporting PCI. In fact the virtio - devices specified in the appendices do not require PCI at all: - they have been implemented on non-PCI buses.[footnote: -The Linux implementation further separates the PCI virtio code -from the specific virtio drivers: these drivers are shared with -the non-PCI implementations (currently lguest and S/390). -] - - Extensible: Virtio PCI devices contain feature bits which are - acknowledged by the guest operating system during device setup. - This allows forwards and backwards compatibility: the device - offers all the features it knows about, and the driver - acknowledges those it understands and wishes to use. - - Virtqueues - -The mechanism for bulk data transport on virtio PCI devices is -pretentiously called a virtqueue. Each device can have zero or -more virtqueues: for example, the network device has one for -transmit and one for receive. - -Each virtqueue occupies two or more physically-contiguous pages -(defined, for the purposes of this specification, as 4096 bytes), -and consists of three parts: - - -+-------------------+-----------------------------------+-----------+ -| Descriptor Table | Available Ring (padding) | Used Ring | -+-------------------+-----------------------------------+-----------+ - - -When the driver wants to send a buffer to the device, it fills in -a slot in the descriptor table (or chains several together), and -writes the descriptor index into the available ring. It then -notifies the device. When the device has finished a buffer, it -writes the descriptor into the used ring, and sends an interrupt. - -Specification - - PCI Discovery - -Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 -through 0x103F inclusive is a virtio device[footnote: -The actual value within this range is ignored -]. The device must also have a Revision ID of 0 to match this -specification. - -The Subsystem Device ID indicates which virtio device is -supported by the device. The Subsystem Vendor ID should reflect -the PCI Vendor ID of the environment (it's currently only used -for informational purposes by the guest). - - -+----------------------+--------------------+---------------+ -| Subsystem Device ID | Virtio Device | Specification | -+----------------------+--------------------+---------------+ -+----------------------+--------------------+---------------+ -| 1 | network card | Appendix C | -+----------------------+--------------------+---------------+ -| 2 | block device | Appendix D | -+----------------------+--------------------+---------------+ -| 3 | console | Appendix E | -+----------------------+--------------------+---------------+ -| 4 | entropy source | Appendix F | -+----------------------+--------------------+---------------+ -| 5 | memory ballooning | Appendix G | -+----------------------+--------------------+---------------+ -| 6 | ioMemory | - | -+----------------------+--------------------+---------------+ -| 7 | rpmsg | Appendix H | -+----------------------+--------------------+---------------+ -| 8 | SCSI host | Appendix I | -+----------------------+--------------------+---------------+ -| 9 | 9P transport | - | -+----------------------+--------------------+---------------+ -| 10 | mac80211 wlan | - | -+----------------------+--------------------+---------------+ - - - Device Configuration - -To configure the device, we use the first I/O region of the PCI -device. This contains a virtio header followed by a -device-specific region. - -There may be different widths of accesses to the I/O region; the “ -natural” access method for each field in the virtio header must -be used (i.e. 32-bit accesses for 32-bit fields, etc), but the -device-specific region can be accessed using any width accesses, -and should obtain the same results. - -Note that this is possible because while the virtio header is PCI -(i.e. little) endian, the device-specific region is encoded in -the native endian of the guest (where such distinction is -applicable). - - Device Initialization Sequence<sub:Device-Initialization-Sequence> - -We start with an overview of device initialization, then expand -on the details of the device and how each step is preformed. - - Reset the device. This is not required on initial start up. - - The ACKNOWLEDGE status bit is set: we have noticed the device. - - The DRIVER status bit is set: we know how to drive the device. - - Device-specific setup, including reading the Device Feature - Bits, discovery of virtqueues for the device, optional MSI-X - setup, and reading and possibly writing the virtio - configuration space. - - The subset of Device Feature Bits understood by the driver is - written to the device. - - The DRIVER_OK status bit is set. - - The device can now be used (ie. buffers added to the - virtqueues)[footnote: -Historically, drivers have used the device before steps 5 and 6. -This is only allowed if the driver does not use any features -which would alter this early use of the device. -] - -If any of these steps go irrecoverably wrong, the guest should -set the FAILED status bit to indicate that it has given up on the -device (it can reset the device later to restart if desired). - -We now cover the fields required for general setup in detail. - - Virtio Header - -The virtio header looks as follows: - - -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ -| Bits || 32 | 32 | 32 | 16 | 16 | 16 | 8 | 8 | -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ -| Read/Write || R | R+W | R+W | R | R+W | R+W | R+W | R | -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ -| Purpose || Device | Guest | Queue | Queue | Queue | Queue | Device | ISR | -| || Features bits 0:31 | Features bits 0:31 | Address | Size | Select | Notify | Status | Status | -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ - - -If MSI-X is enabled for the device, two additional fields -immediately follow this header:[footnote: -ie. once you enable MSI-X on the device, the other fields move. -If you turn it off again, they move back! -] - - -+------------++----------------+--------+ -| Bits || 16 | 16 | - +----------------+--------+ -+------------++----------------+--------+ -| Read/Write || R+W | R+W | -+------------++----------------+--------+ -| Purpose || Configuration | Queue | -| (MSI-X) || Vector | Vector | -+------------++----------------+--------+ - - -Immediately following these general headers, there may be -device-specific headers: - - -+------------++--------------------+ -| Bits || Device Specific | - +--------------------+ -+------------++--------------------+ -| Read/Write || Device Specific | -+------------++--------------------+ -| Purpose || Device Specific... | -| || | -+------------++--------------------+ - - - Device Status - -The Device Status field is updated by the guest to indicate its -progress. This provides a simple low-level diagnostic: it's most -useful to imagine them hooked up to traffic lights on the console -indicating the status of each device. - -The device can be reset by writing a 0 to this field, otherwise -at least one bit should be set: - - ACKNOWLEDGE (1) Indicates that the guest OS has found the - device and recognized it as a valid virtio device. - - DRIVER (2) Indicates that the guest OS knows how to drive the - device. Under Linux, drivers can be loadable modules so there - may be a significant (or infinite) delay before setting this - bit. - - DRIVER_OK (4) Indicates that the driver is set up and ready to - drive the device. - - FAILED (128) Indicates that something went wrong in the guest, - and it has given up on the device. This could be an internal - error, or the driver didn't like the device for some reason, or - even a fatal error during device operation. The device must be - reset before attempting to re-initialize. - - Feature Bits<sub:Feature-Bits> - -Thefirst configuration field indicates the features that the -device supports. The bits are allocated as follows: - - 0 to 23 Feature bits for the specific device type - - 24 to 32 Feature bits reserved for extensions to the queue and - feature negotiation mechanisms - -For example, feature bit 0 for a network device (i.e. Subsystem -Device ID 1) indicates that the device supports checksumming of -packets. - -The feature bits are negotiated: the device lists all the -features it understands in the Device Features field, and the -guest writes the subset that it understands into the Guest -Features field. The only way to renegotiate is to reset the -device. - -In particular, new fields in the device configuration header are -indicated by offering a feature bit, so the guest can check -before accessing that part of the configuration space. - -This allows for forwards and backwards compatibility: if the -device is enhanced with a new feature bit, older guests will not -write that feature bit back to the Guest Features field and it -can go into backwards compatibility mode. Similarly, if a guest -is enhanced with a feature that the device doesn't support, it -will not see that feature bit in the Device Features field and -can go into backwards compatibility mode (or, for poor -implementations, set the FAILED Device Status bit). - - Configuration/Queue Vectors - -When MSI-X capability is present and enabled in the device -(through standard PCI configuration space) 4 bytes at byte offset -20 are used to map configuration change and queue interrupts to -MSI-X vectors. In this case, the ISR Status field is unused, and -device specific configuration starts at byte offset 24 in virtio -header structure. When MSI-X capability is not enabled, device -specific configuration starts at byte offset 20 in virtio header. - -Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of -Configuration/Queue Vector registers, maps interrupts triggered -by the configuration change/selected queue events respectively to -the corresponding MSI-X vector. To disable interrupts for a -specific event type, unmap it by writing a special NO_VECTOR -value: - -/* Vector value used to disable MSI for queue */ - -#define VIRTIO_MSI_NO_VECTOR 0xffff - -Reading these registers returns vector mapped to a given event, -or NO_VECTOR if unmapped. All queue and configuration change -events are unmapped by default. - -Note that mapping an event to vector might require allocating -internal device resources, and might fail. Devices report such -failures by returning the NO_VECTOR value when the relevant -Vector field is read. After mapping an event to vector, the -driver must verify success by reading the Vector field value: on -success, the previously written value is returned, and on -failure, NO_VECTOR is returned. If a mapping failure is detected, -the driver can retry mapping with fewervectors, or disable MSI-X. - - Virtqueue Configuration<sec:Virtqueue-Configuration> - -As a device can have zero or more virtqueues for bulk data -transport (for example, the network driver has two), the driver -needs to configure them as part of the device-specific -configuration. - -This is done as follows, for each virtqueue a device has: - - Write the virtqueue index (first queue is 0) to the Queue - Select field. - - Read the virtqueue size from the Queue Size field, which is - always a power of 2. This controls how big the virtqueue is - (see below). If this field is 0, the virtqueue does not exist. - - Allocate and zero virtqueue in contiguous physical memory, on a - 4096 byte alignment. Write the physical address, divided by - 4096 to the Queue Address field.[footnote: -The 4096 is based on the x86 page size, but it's also large -enough to ensure that the separate parts of the virtqueue are on -separate cache lines. -] - - Optionally, if MSI-X capability is present and enabled on the - device, select a vector to use to request interrupts triggered - by virtqueue events. Write the MSI-X Table entry number - corresponding to this vector in Queue Vector field. Read the - Queue Vector field: on success, previously written value is - returned; on failure, NO_VECTOR value is returned. - -The Queue Size field controls the total number of bytes required -for the virtqueue according to the following formula: - -#define ALIGN(x) (((x) + 4095) & ~4095) - -static inline unsigned vring_size(unsigned int qsz) - -{ - - return ALIGN(sizeof(struct vring_desc)*qsz + sizeof(u16)*(2 -+ qsz)) - - + ALIGN(sizeof(struct vring_used_elem)*qsz); - -} - -This currently wastes some space with padding, but also allows -future extensions. The virtqueue layout structure looks like this -(qsz is the Queue Size field, which is a variable, so this code -won't compile): - -struct vring { - - /* The actual descriptors (16 bytes each) */ - - struct vring_desc desc[qsz]; - - - - /* A ring of available descriptor heads with free-running -index. */ - - struct vring_avail avail; - - - - // Padding to the next 4096 boundary. - - char pad[]; - - - - // A ring of used descriptor heads with free-running index. - - struct vring_used used; - -}; - - A Note on Virtqueue Endianness - -Note that the endian of these fields and everything else in the -virtqueue is the native endian of the guest, not little-endian as -PCI normally is. This makes for simpler guest code, and it is -assumed that the host already has to be deeply aware of the guest -endian so such an “endian-aware” device is not a significant -issue. - - Descriptor Table - -The descriptor table refers to the buffers the guest is using for -the device. The addresses are physical addresses, and the buffers -can be chained via the next field. Each descriptor describes a -buffer which is read-only or write-only, but a chain of -descriptors can contain both read-only and write-only buffers. - -No descriptor chain may be more than 2^32 bytes long in total.struct vring_desc { - - /* Address (guest-physical). */ - - u64 addr; - - /* Length. */ - - u32 len; - -/* This marks a buffer as continuing via the next field. */ - -#define VRING_DESC_F_NEXT 1 - -/* This marks a buffer as write-only (otherwise read-only). */ - -#define VRING_DESC_F_WRITE 2 - -/* This means the buffer contains a list of buffer descriptors. -*/ - -#define VRING_DESC_F_INDIRECT 4 - - /* The flags as indicated above. */ - - u16 flags; - - /* Next field if flags & NEXT */ - - u16 next; - -}; - -The number of descriptors in the table is specified by the Queue -Size field for this virtqueue. - - <sub:Indirect-Descriptors>Indirect Descriptors - -Some devices benefit by concurrently dispatching a large number -of large requests. The VIRTIO_RING_F_INDIRECT_DESC feature can be -used to allow this (see [cha:Reserved-Feature-Bits]). To increase -ring capacity it is possible to store a table of indirect -descriptors anywhere in memory, and insert a descriptor in main -virtqueue (with flags&INDIRECT on) that refers to memory buffer -containing this indirect descriptor table; fields addr and len -refer to the indirect table address and length in bytes, -respectively. The indirect table layout structure looks like this -(len is the length of the descriptor that refers to this table, -which is a variable, so this code won't compile): - -struct indirect_descriptor_table { - - /* The actual descriptors (16 bytes each) */ - - struct vring_desc desc[len / 16]; - -}; - -The first indirect descriptor is located at start of the indirect -descriptor table (index 0), additional indirect descriptors are -chained by next field. An indirect descriptor without next field -(with flags&NEXT off) signals the end of the indirect descriptor -table, and transfers control back to the main virtqueue. An -indirect descriptor can not refer to another indirect descriptor -table (flags&INDIRECT must be off). A single indirect descriptor -table can include both read-only and write-only descriptors; -write-only flag (flags&WRITE) in the descriptor that refers to it -is ignored. - - Available Ring - -The available ring refers to what descriptors we are offering the -device: it refers to the head of a descriptor chain. The “flags” -field is currently 0 or 1: 1 indicating that we do not need an -interrupt when the device consumes a descriptor from the -available ring. Alternatively, the guest can ask the device to -delay interrupts until an entry with an index specified by the “ -used_event” field is written in the used ring (equivalently, -until the idx field in the used ring will reach the value -used_event + 1). The method employed by the device is controlled -by the VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits] -). This interrupt suppression is merely an optimization; it may -not suppress interrupts entirely. - -The “idx” field indicates where we would put the next descriptor -entry (modulo the ring size). This starts at 0, and increases. - -struct vring_avail { - -#define VRING_AVAIL_F_NO_INTERRUPT 1 - - u16 flags; - - u16 idx; - - u16 ring[qsz]; /* qsz is the Queue Size field read from device -*/ - - u16 used_event; - -}; - - Used Ring - -The used ring is where the device returns buffers once it is done -with them. The flags field can be used by the device to hint that -no notification is necessary when the guest adds to the available -ring. Alternatively, the “avail_event” field can be used by the -device to hint that no notification is necessary until an entry -with an index specified by the “avail_event” is written in the -available ring (equivalently, until the idx field in the -available ring will reach the value avail_event + 1). The method -employed by the device is controlled by the guest through the -VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits] -). [footnote: -These fields are kept here because this is the only part of the -virtqueue written by the device -]. - -Each entry in the ring is a pair: the head entry of the -descriptor chain describing the buffer (this matches an entry -placed in the available ring by the guest earlier), and the total -of bytes written into the buffer. The latter is extremely useful -for guests using untrusted buffers: if you do not know exactly -how much has been written by the device, you usually have to zero -the buffer to ensure no data leakage occurs. - -/* u32 is used here for ids for padding reasons. */ - -struct vring_used_elem { - - /* Index of start of used descriptor chain. */ - - u32 id; - - /* Total length of the descriptor chain which was used -(written to) */ - - u32 len; - -}; - - - -struct vring_used { - -#define VRING_USED_F_NO_NOTIFY 1 - - u16 flags; - - u16 idx; - - struct vring_used_elem ring[qsz]; - - u16 avail_event; - -}; - - Helpers for Managing Virtqueues - -The Linux Kernel Source code contains the definitions above and -helper routines in a more usable form, in -include/linux/virtio_ring.h. This was explicitly licensed by IBM -and Red Hat under the (3-clause) BSD license so that it can be -freely used by all other projects, and is reproduced (with slight -variation to remove Linux assumptions) in Appendix A. - - Device Operation<sec:Device-Operation> - -There are two parts to device operation: supplying new buffers to -the dev |