diff options
Diffstat (limited to 'Documentation/virtual/virtio-spec.txt')
| -rw-r--r-- | Documentation/virtual/virtio-spec.txt | 2200 |
1 files changed, 0 insertions, 2200 deletions
diff --git a/Documentation/virtual/virtio-spec.txt b/Documentation/virtual/virtio-spec.txt deleted file mode 100644 index a350ae135b8..00000000000 --- a/Documentation/virtual/virtio-spec.txt +++ /dev/null @@ -1,2200 +0,0 @@ -[Generated file: see http://ozlabs.org/~rusty/virtio-spec/] -Virtio PCI Card Specification -v0.9.1 DRAFT -- - -Rusty Russell <rusty@rustcorp.com.au>IBM Corporation (Editor) - -2011 August 1. - -Purpose and Description - -This document describes the specifications of the “virtio” family -of PCI[LaTeX Command: nomenclature] devices. These are devices -are found in virtual environments[LaTeX Command: nomenclature], -yet by design they are not all that different from physical PCI -devices, and this document treats them as such. This allows the -guest to use standard PCI drivers and discovery mechanisms. - -The purpose of virtio and this specification is that virtual -environments and guests should have a straightforward, efficient, -standard and extensible mechanism for virtual devices, rather -than boutique per-environment or per-OS mechanisms. - - Straightforward: Virtio PCI devices use normal PCI mechanisms - of interrupts and DMA which should be familiar to any device - driver author. There is no exotic page-flipping or COW - mechanism: it's just a PCI device.[footnote: -This lack of page-sharing implies that the implementation of the -device (e.g. the hypervisor or host) needs full access to the -guest memory. Communication with untrusted parties (i.e. -inter-guest communication) requires copying. -] - - Efficient: Virtio PCI devices consist of rings of descriptors - for input and output, which are neatly separated to avoid cache - effects from both guest and device writing to the same cache - lines. - - Standard: Virtio PCI makes no assumptions about the environment - in which it operates, beyond supporting PCI. In fact the virtio - devices specified in the appendices do not require PCI at all: - they have been implemented on non-PCI buses.[footnote: -The Linux implementation further separates the PCI virtio code -from the specific virtio drivers: these drivers are shared with -the non-PCI implementations (currently lguest and S/390). -] - - Extensible: Virtio PCI devices contain feature bits which are - acknowledged by the guest operating system during device setup. - This allows forwards and backwards compatibility: the device - offers all the features it knows about, and the driver - acknowledges those it understands and wishes to use. - - Virtqueues - -The mechanism for bulk data transport on virtio PCI devices is -pretentiously called a virtqueue. Each device can have zero or -more virtqueues: for example, the network device has one for -transmit and one for receive. - -Each virtqueue occupies two or more physically-contiguous pages -(defined, for the purposes of this specification, as 4096 bytes), -and consists of three parts: - - -+-------------------+-----------------------------------+-----------+ -| Descriptor Table | Available Ring (padding) | Used Ring | -+-------------------+-----------------------------------+-----------+ - - -When the driver wants to send buffers to the device, it puts them -in one or more slots in the descriptor table, and writes the -descriptor indices into the available ring. It then notifies the -device. When the device has finished with the buffers, it writes -the descriptors into the used ring, and sends an interrupt. - -Specification - - PCI Discovery - -Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 -through 0x103F inclusive is a virtio device[footnote: -The actual value within this range is ignored -]. The device must also have a Revision ID of 0 to match this -specification. - -The Subsystem Device ID indicates which virtio device is -supported by the device. The Subsystem Vendor ID should reflect -the PCI Vendor ID of the environment (it's currently only used -for informational purposes by the guest). - - -+----------------------+--------------------+---------------+ -| Subsystem Device ID | Virtio Device | Specification | -+----------------------+--------------------+---------------+ -+----------------------+--------------------+---------------+ -| 1 | network card | Appendix C | -+----------------------+--------------------+---------------+ -| 2 | block device | Appendix D | -+----------------------+--------------------+---------------+ -| 3 | console | Appendix E | -+----------------------+--------------------+---------------+ -| 4 | entropy source | Appendix F | -+----------------------+--------------------+---------------+ -| 5 | memory ballooning | Appendix G | -+----------------------+--------------------+---------------+ -| 6 | ioMemory | - | -+----------------------+--------------------+---------------+ -| 9 | 9P transport | - | -+----------------------+--------------------+---------------+ - - - Device Configuration - -To configure the device, we use the first I/O region of the PCI -device. This contains a virtio header followed by a -device-specific region. - -There may be different widths of accesses to the I/O region; the “ -natural” access method for each field in the virtio header must -be used (i.e. 32-bit accesses for 32-bit fields, etc), but the -device-specific region can be accessed using any width accesses, -and should obtain the same results. - -Note that this is possible because while the virtio header is PCI -(i.e. little) endian, the device-specific region is encoded in -the native endian of the guest (where such distinction is -applicable). - - Device Initialization Sequence - -We start with an overview of device initialization, then expand -on the details of the device and how each step is preformed. - - Reset the device. This is not required on initial start up. - - The ACKNOWLEDGE status bit is set: we have noticed the device. - - The DRIVER status bit is set: we know how to drive the device. - - Device-specific setup, including reading the Device Feature - Bits, discovery of virtqueues for the device, optional MSI-X - setup, and reading and possibly writing the virtio - configuration space. - - The subset of Device Feature Bits understood by the driver is - written to the device. - - The DRIVER_OK status bit is set. - - The device can now be used (ie. buffers added to the - virtqueues)[footnote: -Historically, drivers have used the device before steps 5 and 6. -This is only allowed if the driver does not use any features -which would alter this early use of the device. -] - -If any of these steps go irrecoverably wrong, the guest should -set the FAILED status bit to indicate that it has given up on the -device (it can reset the device later to restart if desired). - -We now cover the fields required for general setup in detail. - - Virtio Header - -The virtio header looks as follows: - - -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ -| Bits || 32 | 32 | 32 | 16 | 16 | 16 | 8 | 8 | -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ -| Read/Write || R | R+W | R+W | R | R+W | R+W | R+W | R | -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ -| Purpose || Device | Guest | Queue | Queue | Queue | Queue | Device | ISR | -| || Features bits 0:31 | Features bits 0:31 | Address | Size | Select | Notify | Status | Status | -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ - - -If MSI-X is enabled for the device, two additional fields -immediately follow this header: - - -+------------++----------------+--------+ -| Bits || 16 | 16 | - +----------------+--------+ -+------------++----------------+--------+ -| Read/Write || R+W | R+W | -+------------++----------------+--------+ -| Purpose || Configuration | Queue | -| (MSI-X) || Vector | Vector | -+------------++----------------+--------+ - - -Finally, if feature bits (VIRTIO_F_FEATURES_HI) this is -immediately followed by two additional fields: - - -+------------++----------------------+---------------------- -| Bits || 32 | 32 -+------------++----------------------+---------------------- -| Read/Write || R | R+W -+------------++----------------------+---------------------- -| Purpose || Device | Guest -| || Features bits 32:63 | Features bits 32:63 -+------------++----------------------+---------------------- - - -Immediately following these general headers, there may be -device-specific headers: - - -+------------++--------------------+ -| Bits || Device Specific | - +--------------------+ -+------------++--------------------+ -| Read/Write || Device Specific | -+------------++--------------------+ -| Purpose || Device Specific... | -| || | -+------------++--------------------+ - - - Device Status - -The Device Status field is updated by the guest to indicate its -progress. This provides a simple low-level diagnostic: it's most -useful to imagine them hooked up to traffic lights on the console -indicating the status of each device. - -The device can be reset by writing a 0 to this field, otherwise -at least one bit should be set: - - ACKNOWLEDGE (1) Indicates that the guest OS has found the - device and recognized it as a valid virtio device. - - DRIVER (2) Indicates that the guest OS knows how to drive the - device. Under Linux, drivers can be loadable modules so there - may be a significant (or infinite) delay before setting this - bit. - - DRIVER_OK (3) Indicates that the driver is set up and ready to - drive the device. - - FAILED (8) Indicates that something went wrong in the guest, - and it has given up on the device. This could be an internal - error, or the driver didn't like the device for some reason, or - even a fatal error during device operation. The device must be - reset before attempting to re-initialize. - - Feature Bits - -The least significant 31 bits of the first configuration field -indicates the features that the device supports (the high bit is -reserved, and will be used to indicate the presence of future -feature bits elsewhere). If more than 31 feature bits are -supported, the device indicates so by setting feature bit 31 (see -[cha:Reserved-Feature-Bits]). The bits are allocated as follows: - - 0 to 23 Feature bits for the specific device type - - 24 to 40 Feature bits reserved for extensions to the queue and - feature negotiation mechanisms - - 41 to 63 Feature bits reserved for future extensions - -For example, feature bit 0 for a network device (i.e. Subsystem -Device ID 1) indicates that the device supports checksumming of -packets. - -The feature bits are negotiated: the device lists all the -features it understands in the Device Features field, and the -guest writes the subset that it understands into the Guest -Features field. The only way to renegotiate is to reset the -device. - -In particular, new fields in the device configuration header are -indicated by offering a feature bit, so the guest can check -before accessing that part of the configuration space. - -This allows for forwards and backwards compatibility: if the -device is enhanced with a new feature bit, older guests will not -write that feature bit back to the Guest Features field and it -can go into backwards compatibility mode. Similarly, if a guest -is enhanced with a feature that the device doesn't support, it -will not see that feature bit in the Device Features field and -can go into backwards compatibility mode (or, for poor -implementations, set the FAILED Device Status bit). - -Access to feature bits 32 to 63 is enabled by Guest by setting -feature bit 31. If this bit is unset, Device must assume that all -feature bits > 31 are unset. - - Configuration/Queue Vectors - -When MSI-X capability is present and enabled in the device -(through standard PCI configuration space) 4 bytes at byte offset -20 are used to map configuration change and queue interrupts to -MSI-X vectors. In this case, the ISR Status field is unused, and -device specific configuration starts at byte offset 24 in virtio -header structure. When MSI-X capability is not enabled, device -specific configuration starts at byte offset 20 in virtio header. - -Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of -Configuration/Queue Vector registers, maps interrupts triggered -by the configuration change/selected queue events respectively to -the corresponding MSI-X vector. To disable interrupts for a -specific event type, unmap it by writing a special NO_VECTOR -value: - -/* Vector value used to disable MSI for queue */ - -#define VIRTIO_MSI_NO_VECTOR 0xffff - -Reading these registers returns vector mapped to a given event, -or NO_VECTOR if unmapped. All queue and configuration change -events are unmapped by default. - -Note that mapping an event to vector might require allocating -internal device resources, and might fail. Devices report such -failures by returning the NO_VECTOR value when the relevant -Vector field is read. After mapping an event to vector, the -driver must verify success by reading the Vector field value: on -success, the previously written value is returned, and on -failure, NO_VECTOR is returned. If a mapping failure is detected, -the driver can retry mapping with fewervectors, or disable MSI-X. - - Virtqueue Configuration - -As a device can have zero or more virtqueues for bulk data -transport (for example, the network driver has two), the driver -needs to configure them as part of the device-specific -configuration. - -This is done as follows, for each virtqueue a device has: - - Write the virtqueue index (first queue is 0) to the Queue - Select field. - - Read the virtqueue size from the Queue Size field, which is - always a power of 2. This controls how big the virtqueue is - (see below). If this field is 0, the virtqueue does not exist. - - Allocate and zero virtqueue in contiguous physical memory, on a - 4096 byte alignment. Write the physical address, divided by - 4096 to the Queue Address field.[footnote: -The 4096 is based on the x86 page size, but it's also large -enough to ensure that the separate parts of the virtqueue are on -separate cache lines. -] - - Optionally, if MSI-X capability is present and enabled on the - device, select a vector to use to request interrupts triggered - by virtqueue events. Write the MSI-X Table entry number - corresponding to this vector in Queue Vector field. Read the - Queue Vector field: on success, previously written value is - returned; on failure, NO_VECTOR value is returned. - -The Queue Size field controls the total number of bytes required -for the virtqueue according to the following formula: - -#define ALIGN(x) (((x) + 4095) & ~4095) - -static inline unsigned vring_size(unsigned int qsz) - -{ - - return ALIGN(sizeof(struct vring_desc)*qsz + sizeof(u16)*(2 -+ qsz)) - - + ALIGN(sizeof(struct vring_used_elem)*qsz); - -} - -This currently wastes some space with padding, but also allows -future extensions. The virtqueue layout structure looks like this -(qsz is the Queue Size field, which is a variable, so this code -won't compile): - -struct vring { - - /* The actual descriptors (16 bytes each) */ - - struct vring_desc desc[qsz]; - - - - /* A ring of available descriptor heads with free-running -index. */ - - struct vring_avail avail; - - - - // Padding to the next 4096 boundary. - - char pad[]; - - - - // A ring of used descriptor heads with free-running index. - - struct vring_used used; - -}; - - A Note on Virtqueue Endianness - -Note that the endian of these fields and everything else in the -virtqueue is the native endian of the guest, not little-endian as -PCI normally is. This makes for simpler guest code, and it is -assumed that the host already has to be deeply aware of the guest -endian so such an “endian-aware” device is not a significant -issue. - - Descriptor Table - -The descriptor table refers to the buffers the guest is using for -the device. The addresses are physical addresses, and the buffers -can be chained via the next field. Each descriptor describes a -buffer which is read-only or write-only, but a chain of -descriptors can contain both read-only and write-only buffers. - -No descriptor chain may be more than 2^32 bytes long in total.struct vring_desc { - - /* Address (guest-physical). */ - - u64 addr; - - /* Length. */ - - u32 len; - -/* This marks a buffer as continuing via the next field. */ - -#define VRING_DESC_F_NEXT 1 - -/* This marks a buffer as write-only (otherwise read-only). */ - -#define VRING_DESC_F_WRITE 2 - -/* This means the buffer contains a list of buffer descriptors. -*/ - -#define VRING_DESC_F_INDIRECT 4 - - /* The flags as indicated above. */ - - u16 flags; - - /* Next field if flags & NEXT */ - - u16 next; - -}; - -The number of descriptors in the table is specified by the Queue -Size field for this virtqueue. - - <sub:Indirect-Descriptors>Indirect Descriptors - -Some devices benefit by concurrently dispatching a large number -of large requests. The VIRTIO_RING_F_INDIRECT_DESC feature can be -used to allow this (see [cha:Reserved-Feature-Bits]). To increase -ring capacity it is possible to store a table of indirect -descriptors anywhere in memory, and insert a descriptor in main -virtqueue (with flags&INDIRECT on) that refers to memory buffer -containing this indirect descriptor table; fields addr and len -refer to the indirect table address and length in bytes, -respectively. The indirect table layout structure looks like this -(len is the length of the descriptor that refers to this table, -which is a variable, so this code won't compile): - -struct indirect_descriptor_table { - - /* The actual descriptors (16 bytes each) */ - - struct vring_desc desc[len / 16]; - -}; - -The first indirect descriptor is located at start of the indirect -descriptor table (index 0), additional indirect descriptors are -chained by next field. An indirect descriptor without next field -(with flags&NEXT off) signals the end of the indirect descriptor -table, and transfers control back to the main virtqueue. An -indirect descriptor can not refer to another indirect descriptor -table (flags&INDIRECT must be off). A single indirect descriptor -table can include both read-only and write-only descriptors; -write-only flag (flags&WRITE) in the descriptor that refers to it -is ignored. - - Available Ring - -The available ring refers to what descriptors we are offering the -device: it refers to the head of a descriptor chain. The “flags” -field is currently 0 or 1: 1 indicating that we do not need an -interrupt when the device consumes a descriptor from the -available ring. Alternatively, the guest can ask the device to -delay interrupts until an entry with an index specified by the “ -used_event” field is written in the used ring (equivalently, -until the idx field in the used ring will reach the value -used_event + 1). The method employed by the device is controlled -by the VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits] -). This interrupt suppression is merely an optimization; it may -not suppress interrupts entirely. - -The “idx” field indicates where we would put the next descriptor -entry (modulo the ring size). This starts at 0, and increases. - -struct vring_avail { - -#define VRING_AVAIL_F_NO_INTERRUPT 1 - - u16 flags; - - u16 idx; - - u16 ring[qsz]; /* qsz is the Queue Size field read from device -*/ - - u16 used_event; - -}; - - Used Ring - -The used ring is where the device returns buffers once it is done -with them. The flags field can be used by the device to hint that -no notification is necessary when the guest adds to the available -ring. Alternatively, the “avail_event” field can be used by the -device to hint that no notification is necessary until an entry -with an index specified by the “avail_event” is written in the -available ring (equivalently, until the idx field in the -available ring will reach the value avail_event + 1). The method -employed by the device is controlled by the guest through the -VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits] -). [footnote: -These fields are kept here because this is the only part of the -virtqueue written by the device -]. - -Each entry in the ring is a pair: the head entry of the -descriptor chain describing the buffer (this matches an entry -placed in the available ring by the guest earlier), and the total -of bytes written into the buffer. The latter is extremely useful -for guests using untrusted buffers: if you do not know exactly -how much has been written by the device, you usually have to zero -the buffer to ensure no data leakage occurs. - -/* u32 is used here for ids for padding reasons. */ - -struct vring_used_elem { - - /* Index of start of used descriptor chain. */ - - u32 id; - - /* Total length of the descriptor chain which was used -(written to) */ - - u32 len; - -}; - - - -struct vring_used { - -#define VRING_USED_F_NO_NOTIFY 1 - - u16 flags; - - u16 idx; - - struct vring_used_elem ring[qsz]; - - u16 avail_event; - -}; - - Helpers for Managing Virtqueues - -The Linux Kernel Source code contains the definitions above and -helper routines in a more usable form, in -include/linux/virtio_ring.h. This was explicitly licensed by IBM -and Red Hat under the (3-clause) BSD license so that it can be -freely used by all other projects, and is reproduced (with slight -variation to remove Linux assumptions) in Appendix A. - - Device Operation - -There are two parts to device operation: supplying new buffers to -the device, and processing used buffers from the device. As an -example, the virtio network device has two virtqueues: the -transmit virtqueue and the receive virtqueue. The driver adds -outgoing (read-only) packets to the transmit virtqueue, and then -frees them after they are used. Similarly, incoming (write-only) -buffers are added to the receive virtqueue, and processed after -they are used. - - Supplying Buffers to The Device - -Actual transfer of buffers from the guest OS to the device -operates as follows: - - Place the buffer(s) into free descriptor(s). - - If there are no free descriptors, the guest may choose to - notify the device even if notifications are suppressed (to - reduce latency).[footnote: -The Linux drivers do this only for read-only buffers: for -write-only buffers, it is assumed that the driver is merely -trying to keep the receive buffer ring full, and no notification -of this expected condition is necessary. -] - - Place the id of the buffer in the next ring entry of the - available ring. - - The steps (1) and (2) may be performed repeatedly if batching - is possible. - - A memory barrier should be executed to ensure the device sees - the updated descriptor table and available ring before the next - step. - - The available “idx” field should be increased by the number of - entries added to the available ring. - - A memory barrier should be executed to ensure that we update - the idx field before checking for notification suppression. - - If notifications are not suppressed, the device should be - notified of the new buffers. - -Note that the above code does not take precautions against the -available ring buffer wrapping around: this is not possible since -the ring buffer is the same size as the descriptor table, so step -(1) will prevent such a condition. - -In addition, the maximum queue size is 32768 (it must be a power -of 2 which fits in 16 bits), so the 16-bit “idx” value can always -distinguish between a full and empty buffer. - -Here is a description of each stage in more detail. - - Placing Buffers Into The Descriptor Table - -A buffer consists of zero or more read-only physically-contiguous -elements followed by zero or more physically-contiguous -write-only elements (it must have at least one element). This -algorithm maps it into the descriptor table: - - for each buffer element, b: - - Get the next free descriptor table entry, d - - Set d.addr to the physical address of the start of b - - Set d.len to the length of b. - - If b is write-only, set d.flags to VRING_DESC_F_WRITE, - otherwise 0. - - If there is a buffer element after this: - - Set d.next to the index of the next free descriptor element. - - Set the VRING_DESC_F_NEXT bit in d.flags. - -In practice, the d.next fields are usually used to chain free -descriptors, and a separate count kept to check there are enough -free descriptors before beginning the mappings. - - Updating The Available Ring - -The head of the buffer we mapped is the first d in the algorithm -above. A naive implementation would do the following: - -avail->ring[avail->idx % qsz] = head; - -However, in general we can add many descriptors before we update -the “idx” field (at which point they become visible to the -device), so we keep a counter of how many we've added: - -avail->ring[(avail->idx + added++) % qsz] = head; - - Updating The Index Field - -Once the idx field of the virtqueue is updated, the device will -be able to access the descriptor entries we've created and the -memory they refer to. This is why a memory barrier is generally -used before the idx update, to ensure it sees the most up-to-date -copy. - -The idx field always increments, and we let it wrap naturally at -65536: - -avail->idx += added; - - <sub:Notifying-The-Device>Notifying The Device - -Device notification occurs by writing the 16-bit virtqueue index -of this virtqueue to the Queue Notify field of the virtio header -in the first I/O region of the PCI device. This can be expensive, -however, so the device can suppress such notifications if it -doesn't need them. We have to be careful to expose the new idx -value before checking the suppression flag: it's OK to notify -gratuitously, but not to omit a required notification. So again, -we use a memory barrier here before reading the flags or the -avail_event field. - -If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated, and if -the VRING_USED_F_NOTIFY flag is not set, we go ahead and write to -the PCI configuration space. - -If the VIRTIO_F_RING_EVENT_IDX feature is negotiated, we read the -avail_event field in the available ring structure. If the -available index crossed_the avail_event field value since the -last notification, we go ahead and write to the PCI configuration -space. The avail_event field wraps naturally at 65536 as well: - -(u16)(new_idx - avail_event - 1) < (u16)(new_idx - old_idx) - - <sub:Receiving-Used-Buffers>Receiving Used Buffers From The - Device - -Once the device has used a buffer (read from or written to it, or -parts of both, depending on the nature of the virtqueue and the -device), it sends an interrupt, following an algorithm very -similar to the algorithm used for the driver to send the device a -buffer: - - Write the head descriptor number to the next field in the used - ring. - - Update the used ring idx. - - Determine whether an interrupt is necessary: - - If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated: check - if f the VRING_AVAIL_F_NO_INTERRUPT flag is not set in avail- - >flags - - If the VIRTIO_F_RING_EVENT_IDX feature is negotiated: check - whether the used index crossed the used_event field value - since the last update. The used_event field wraps naturally - at 65536 as well:(u16)(new_idx - used_event - 1) < (u16)(new_idx - old_idx) - - If an interrupt is necessary: - - If MSI-X capability is disabled: - - Set the lower bit of the ISR Status field for the device. - - Send the appropriate PCI interrupt for the device. - - If MSI-X capability is enabled: - - Request the appropriate MSI-X interrupt message for the - device, Queue Vector field sets the MSI-X Table entry - number. - - If Queue Vector field value is NO_VECTOR, no interrupt - message is requested for this event. - -The guest interrupt handler should: - - If MSI-X capability is disabled: read the ISR Status field, - which will reset it to zero. If the lower bit is zero, the - interrupt was not for this device. Otherwise, the guest driver - should look through the used rings of each virtqueue for the - device, to see if any progress has been made by the device - which requires servicing. - - If MSI-X capability is enabled: look through the used rings of - each virtqueue mapped to the specific MSI-X vector for the - device, to see if any progress has been made by the device - which requires servicing. - -For each ring, guest should then disable interrupts by writing -VRING_AVAIL_F_NO_INTERRUPT flag in avail structure, if required. -It can then process used ring entries finally enabling interrupts -by clearing the VRING_AVAIL_F_NO_INTERRUPT flag or updating the -EVENT_IDX field in the available structure, Guest should then -execute a memory barrier, and then recheck the ring empty -condition. This is necessary to handle the case where, after the -last check and before enabling interrupts, an interrupt has been -suppressed by the device: - -vring_disable_interrupts(vq); - -for (;;) { - - if (vq->last_seen_used != vring->used.idx) { - - vring_enable_interrupts(vq); - - mb(); - - if (vq->last_seen_used != vring->used.idx) - - break; - - } - - struct vring_used_elem *e = -vring.used->ring[vq->last_seen_used%vsz]; - - process_buffer(e); - - vq->last_seen_used++; - -} - - Dealing With Configuration Changes - -Some virtio PCI devices can change the device configuration -state, as reflected in the virtio header in the PCI configuration -space. In this case: - - If MSI-X capability is disabled: an interrupt is delivered and - the second highest bit is set in the ISR Status field to - indicate that the driver should re-examine the configuration - space.Note that a single interrupt can indicate both that one - or more virtqueue has been used and that the configuration - space has changed: even if the config bit is set, virtqueues - must be scanned. - - If MSI-X capability is enabled: an interrupt message is - requested. The Configuration Vector field sets the MSI-X Table - entry number to use. If Configuration Vector field value is - NO_VECTOR, no interrupt message is requested for this event. - -Creating New Device Types - -Various considerations are necessary when creating a new device -type: - - How Many Virtqueues? - -It is possible that a very simple device will operate entirely -through its configuration space, but most will need at least one -virtqueue in which it will place requests. A device with both -input and output (eg. console and network devices described here) -need two queues: one which the driver fills with buffers to -receive input, and one which the driver places buffers to -transmit output. - - What Configuration Space Layout? - -Configuration space is generally used for rarely-changing or -initialization-time parameters. But it is a limited resource, so -it might be better to use a virtqueue to update configuration -information (the network device does this for filtering, -otherwise the table in the config space could potentially be very -large). - -Note that this space is generally the guest's native endian, -rather than PCI's little-endian. - - What Device Number? - -Currently device numbers are assigned quite freely: a simple -request mail to the author of this document or the Linux -virtualization mailing list[footnote: - -https://lists.linux-foundation.org/mailman/listinfo/virtualization -] will be sufficient to secure a unique one. - -Meanwhile for experimental drivers, use 65535 and work backwards. - - How many MSI-X vectors? - -Using the optional MSI-X capability devices can speed up -interrupt processing by removing the need to read ISR Status -register by guest driver (which might be an expensive operation), -reducing interrupt sharing between devices and queues within the -device, and handling interrupts from multiple CPUs. However, some -systems impose a limit (which might be as low as 256) on the -total number of MSI-X vectors that can be allocated to all -devices. Devices and/or device drivers should take this into -account, limiting the number of vectors used unless the device is -expected to cause a high volume of interrupts. Devices can -control the number of vectors used by limiting the MSI-X Table -Size or not presenting MSI-X capability in PCI configuration -space. Drivers can control this by mapping events to as small -number of vectors as possible, or disabling MSI-X capability -altogether. - - Message Framing - -The descriptors used for a buffer should not effect the semantics -of the message, except for the total length of the buffer. For -example, a network buffer consists of a 10 byte header followed -by the network packet. Whether this is presented in the ring -descriptor chain as (say) a 10 byte buffer and a 1514 byte -buffer, or a single 1524 byte buffer, or even three buffers, -should have no effect. - -In particular, no implementation should use the descriptor -boundaries to determine the size of any header in a request.[footnote: -The current qemu device implementations mistakenly insist that -the first descriptor cover the header in these cases exactly, so -a cautious driver should arrange it so. -] - - Device Improvements - -Any change to configuration space, or new virtqueues, or -behavioural changes, should be indicated by negotiation of a new -feature bit. This establishes clarity[footnote: -Even if it does mean documenting design or implementation -mistakes! -] and avoids future expansion problems. - -Clusters of functionality which are always implemented together -can use a single bit, but if one feature makes sense without the -others they should not be gratuitously grouped together to -conserve feature bits. We can always extend the spec when the -first person needs more than 24 feature bits for their device. - -[LaTeX Command: printnomenclature] - -Appendix A: virtio_ring.h - -#ifndef VIRTIO_RING_H - -#define VIRTIO_RING_H - -/* An interface for efficient virtio implementation. - - * - - * This header is BSD licensed so anyone can use the definitions - - * to implement compatible drivers/servers. - - * - - * Copyright 2007, 2009, IBM Corporation - - * Copyright 2011, Red Hat, Inc - - * All rights reserved. - - * - - * Redistribution and use in source and binary forms, with or -without - - * modification, are permitted provided that the following -conditions - - * are met: - - * 1. Redistributions of source code must retain the above -copyright - - * notice, this list of conditions and the following -disclaimer. - - * 2. Redistributions in binary form must reproduce the above -copyright - - * notice, this list of conditions and the following -disclaimer in the - - * documentation and/or other materials provided with the -distribution. - - * 3. Neither the name of IBM nor the names of its contributors - - * may be used to endorse or promote products derived from -this software - - * without specific prior written permission. - - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND -CONTRIBUTORS ``AS IS'' AND - - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED -TO, THE - - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A -PARTICULAR PURPOSE - - * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE -LIABLE - - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR -CONSEQUENTIAL - - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF -SUBSTITUTE GOODS - - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS -INTERRUPTION) - - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN -CONTRACT, STRICT - - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING -IN ANY WAY - - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -POSSIBILITY OF - - * SUCH DAMAGE. - - */ - - - -/* This marks a buffer as continuing via the next field. */ - -#define VRING_DESC_F_NEXT 1 - -/* This marks a buffer as write-only (otherwise read-only). */ - -#define VRING_DESC_F_WRITE 2 - - - -/* The Host uses this in used->flags to advise the Guest: don't -kick me - - * when you add a buffer. It's unreliable, so it's simply an - - * optimization. Guest will still kick if it's out of buffers. -*/ - -#define VRING_USED_F_NO_NOTIFY 1 - -/* The Guest uses this in avail->flags to advise the Host: don't - - * interrupt me when you consume a buffer. It's unreliable, so -it's - - * simply an optimization. */ - -#define VRING_AVAIL_F_NO_INTERRUPT 1 - - - -/* Virtio ring descriptors: 16 bytes. - - * These can chain together via "next". */ - -struct vring_desc { - - /* Address (guest-physical). */ - - uint64_t addr; - - /* Length. */ - - uint32_t len; - - /* The flags as indicated above. */ - - uint16_t flags; - - /* We chain unused descriptors via this, too */ - - uint16_t next; - -}; - - - -struct vring_avail { - - uint16_t flags; - - uint16_t idx; - - uint16_t ring[]; - - uint16_t used_event; - -}; - - - -/* u32 is used here for ids for padding reasons. */ - -struct vring_used_elem { - - /* Index of start of used descriptor chain. */ - - uint32_t id; - - /* Total length of the descriptor chain which was written -to. */ - - uint32_t len; - -}; - - - -struct vring_used { - - uint16_t flags; - - uint16_t idx; - - struct vring_used_elem ring[]; - - uint16_t avail_event; - -}; - - - -struct vring { - - unsigned int num; - - - - struct vring_desc *desc; - - struct vring_avail *avail; - - struct vring_used *used; - -}; - - - -/* The standard layout for the ring is a continuous chunk of -memory which - - * looks like this. We assume num is a power of 2. - - * - - * struct vring { - - * // The actual descriptors (16 bytes each) - - * struct vring_desc desc[num]; - - * - - * // A ring of available descriptor heads with free-running -index. - - * __u16 avail_flags; - - * __u16 avail_idx; - - * __u16 available[num]; - - * - - * // Padding to the next align boundary. - - * char pad[]; - - * - - * // A ring of used descriptor heads with free-running -index. - - * __u16 used_flags; - - * __u16 EVENT_IDX; - - * struct vring_used_elem used[num]; - - * }; - - * Note: for virtio PCI, align is 4096. - - */ - -static inline void vring_init(struct vring *vr, unsigned int num, -void *p, - - unsigned long align) - -{ - - vr->num = num; - - vr->desc = p; - - vr->avail = p + num*sizeof(struct vring_desc); - - vr->used = (void *)(((unsigned long)&vr->avail->ring[num] - - + align-1) - - & ~(align - 1)); - -} - - - -static inline unsigned vring_size(unsigned int num, unsigned long -align) - -{ - - return ((sizeof(struct vring_desc)*num + -sizeof(uint16_t)*(2+num) - - + align - 1) & ~(align - 1)) - - + sizeof(uint16_t)*3 + sizeof(struct -vring_used_elem)*num; - -} - - - -static inline int vring_need_event(uint16_t event_idx, uint16_t -new_idx, uint16_t old_idx) - -{ - - return (uint16_t)(new_idx - event_idx - 1) < -(uint16_t)(new_idx - old_idx); - -} - -#endif /* VIRTIO_RING_H */ - -<cha:Reserved-Feature-Bits>Appendix B: Reserved Feature Bits - -Currently there are five device-independent feature bits defined: - - VIRTIO_F_NOTIFY_ON_EMPTY (24) Negotiating this feature - indicates that the driver wants an interrupt if the device runs - out of available descriptors on a virtqueue, even though - interrupts are suppressed using the VRING_AVAIL_F_NO_INTERRUPT - flag or the used_event field. An example of this is the - networking driver: it doesn't need to know every time a packet - is transmitted, but it does need to free the transmitted - packets a finite time after they are transmitted. It can avoid - using a timer if the device interrupts it when all the packets - are transmitted. - - VIRTIO_F_RING_INDIRECT_DESC (28) Negotiating this feature - indicates that the driver can use descriptors with the - VRING_DESC_F_INDIRECT flag set, as described in [sub:Indirect-Descriptors] - . - - VIRTIO_F_RING_EVENT_IDX(29) This feature enables the used_event - and the avail_event fields. If set, it indicates that the - device should ignore the flags field in the available ring - structure. Instead, the used_event field in this structure is - used by guest to suppress device interrupts. Further, the - driver should ignore the flags field in the used ring - structure. Instead, the avail_event field in this structure is - used by the device to suppress notifications. If unset, the - driver should ignore the used_event field; the device should - ignore the avail_event field; the flags field is used - - VIRTIO_F_BAD_FEATURE(30) This feature should never be - negotiated by the guest; doing so is an indication that the - guest is faulty[footnote: -An experimental virtio PCI driver contained in Linux version -2.6.25 had this problem, and this feature bit can be used to -detect it. -] - - VIRTIO_F_FEATURES_HIGH(31) This feature indicates that the - device supports feature bits 32:63. If unset, feature bits - 32:63 are unset. - -Appendix C: Network Device - -The virtio network device is a virtual ethernet card, and is the -most complex of the devices supported so far by virtio. It has -enhanced rapidly and demonstrates clearly how support for new -features should be added to an existing device. Empty buffers are -placed in one virtqueue for receiving packets, and outgoing -packets are enqueued into another for transmission in that order. -A third command queue is used to control advanced filtering -features. - - Configuration - - Subsystem Device ID 1 - - Virtqueues 0:receiveq. 1:transmitq. 2:controlq[footnote: -Only if VIRTIO_NET_F_CTRL_VQ set -] - - Feature bits - - VIRTIO_NET_F_CSUM (0) Device handles packets with partial - checksum - - VIRTIO_NET_F_GUEST_CSUM (1) Guest handles packets with partial - checksum - - VIRTIO_NET_F_MAC (5) Device has given MAC address. - - VIRTIO_NET_F_GSO (6) (Deprecated) device handles packets with - any GSO type.[footnote: -It was supposed to indicate segmentation offload support, but -upon further investigation it became clear that multiple bits -were required. -] - - VIRTIO_NET_F_GUEST_TSO4 (7) Guest can receive TSOv4. - - VIRTIO_NET_F_GUEST_TSO6 (8) Guest can receive TSOv6. - - VIRTIO_NET_F_GUEST_ECN (9) Guest can receive TSO with ECN. - - VIRTIO_NET_F_GUEST_UFO (10) Guest can receive UFO. - - VIRTIO_NET_F_HOST_TSO4 (11) Device can receive TSOv4. - - VIRTIO_NET_F_HOST_TSO6 (12) Device can receive TSOv6. - - VIRTIO_NET_F_HOST_ECN (13) Device can receive TSO with ECN. - - VIRTIO_NET_F_HOST_UFO (14) Device can receive UFO. - - VIRTIO_NET_F_MRG_RXBUF (15) Guest can merge receive buffers. - - VIRTIO_NET_F_STATUS (16) Configuration status field is - available. - - VIRTIO_NET_F_CTRL_VQ (17) Control channel is available. - - VIRTIO_NET_F_CTRL_RX (18) Control channel RX mode support. - - VIRTIO_NET_F_CTRL_VLAN (19) Control channel VLAN filtering. - - Device configuration layout Two configuration fields are - currently defined. The mac address field always exists (though - is only valid if VIRTIO_NET_F_MAC is set), and the status field - only exists if VIRTIO_NET_F_STATUS is set. Only one bit is - currently defined for the status field: VIRTIO_NET_S_LINK_UP. #define VIRTIO_NET_S_LINK_UP 1 - - - -struct virtio_net_config { - - u8 mac[6]; - - u16 status; - -}; - - Device Initialization - - The initialization routine should identify the receive and - transmission virtqueues. - - If the VIRTIO_NET_F_MAC feature bit is set, the configuration - space “mac” entry indicates the “physical” address of the the - network card, otherwise a private MAC address should be - assigned. All guests are expected to negotiate this feature if - it is set. - - If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated, identify - the control virtqueue. - - If the VIRTIO_NET_F_STATUS feature bit is negotiated, the link - status can be read from the bottom bit of the “status” config - field. Otherwise, the link should be assumed active. - - The receive virtqueue should be filled with receive buffers. - This is described in detail below in “Setting Up Receive - Buffers”. - - A driver can indicate that it will generate checksumless - packets by negotating the VIRTIO_NET_F_CSUM feature. This “ - checksum offload” is a common feature on modern network cards. - - If that feature is negotiated, a driver can use TCP or UDP - segmentation offload by negotiating the VIRTIO_NET_F_HOST_TSO4 - (IPv4 TCP), VIRTIO_NET_F_HOST_TSO6 (IPv6 TCP) and - VIRTIO_NET_F_HOST_UFO (UDP fragmentation) features. It should - not send TCP packets requiring segmentation offload which have - the Explicit Congestion Notification bit set, unless the - VIRTIO_NET_F_HOST_ECN feature is negotiated.[footnote: -This is a common restriction in real, older network cards. -] - - The converse features are also available: a driver can save the - virtual device some work by negotiating these features.[footnote: -For example, a network packet transported between two guests on -the same system may not require checksumming at all, nor -segmentation, if both guests are amenable. -] The VIRTIO_NET_F_GUEST_CSUM feature indicates that partially - checksummed packets can be received, and if it can do that then - the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, - VIRTIO_NET_F_GUEST_UFO and VIRTIO_NET_F_GUEST_ECN are the input - equivalents of the features described above. See “Receiving - Packets” below. - - Device Operation - -Packets are transmitted by placing them in the transmitq, and -buffers for incoming packets are placed in the receiveq. In each -case, the packet itself is preceeded by a header: - -struct virtio_net_hdr { - -#define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 - - u8 flags; - -#define VIRTIO_NET_HDR_GSO_NONE 0 - -#define VIRTIO_NET_HDR_GSO_TCPV4 1 - -#define VIRTIO_NET_HDR_GSO_UDP 3 - -#define VIRTIO_NET_HDR_GSO_TCPV6 4 - -#define VIRTIO_NET_HDR_GSO_ECN 0x80 - - u8 gso_type; - - u16 hdr_len; - - u16 gso_size; - - u16 csum_start; - - u16 csum_offset; - -/* Only if VIRTIO_NET_F_MRG_RXBUF: */ - - u16 num_buffers - -}; - -The controlq is used to control device features such as -filtering. - - Packet Transmission - -Transmitting a single packet is simple, but varies depending on -the different features the driver negotiated. - - If the driver negotiated VIRTIO_NET_F_CSUM, and the packet has - not been fully checksummed, then the virtio_net_hdr's fields - are set as follows. Otherwise, the packet must be fully - checksummed, and flags is zero. - - flags has the VIRTIO_NET_HDR_F_NEEDS_CSUM set, - - <ite:csum_start-is-set>csum_start is set to the offset within - the packet to begin checksumming, and - - csum_offset indicates how many bytes after the csum_start the - new (16 bit ones' complement) checksum should be placed.[footnote: -For example, consider a partially checksummed TCP (IPv4) packet. -It will have a 14 byte ethernet header and 20 byte IP header -followed by the TCP header (with the TCP checksum field 16 bytes -into that header). csum_start will be 14+20 = 34 (the TCP -checksum includes the header), and csum_offset will be 16. The -value in the TCP checksum field will be the sum of the TCP pseudo -header, so that replacing it by the ones' complement checksum of -the TCP header and body will give the correct result. -] - - <enu:If-the-driver>If the driver negotiated - VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO, and the packet requires - TCP segmentation or UDP fragmentation, then the “gso_type” - field is set to VIRTIO_NET_HDR_GSO_TCPV4, TCPV6 or UDP. - (Otherwise, it is set to VIRTIO_NET_HDR_GSO_NONE). In this - case, packets larger than 1514 bytes can be transmitted: the - metadata indicates how to replicate the packet header to cut it - into smaller packets. The other gso fields are set: - - hdr_len is a hint to the device as to how much of the header - needs to be kept to copy into each packet, usually set to the - length of the headers, including the transport header.[footnote: -Due to various bugs in implementations, this field is not useful -as a guarantee of the transport header size. -] - - gso_size is the size of the packet beyond that header (ie. - MSS). - - If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, the - VIRTIO_NET_HDR_GSO_ECN bit may be set in “gso_type” as well, - indicating that the TCP packet has the ECN bit set.[footnote: -This case is not handled by some older hardware, so is called out -specifically in the protocol. -] - - If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature, - the num_buffers field is set to zero. - - The header and packet are added as one output buffer to the - transmitq, and the device is notified of the new entry (see [sub:Notifying-The-Device] - ).[footnote: -Note that the header will be two bytes longer for the -VIRTIO_NET_F_MRG_RXBUF case. -] - - Packet Transmission Interrupt - -Often a driver will suppress transmission interrupts using the -VRING_AVAIL_F_NO_INTERRUPT flag (see [sub:Receiving-Used-Buffers] -) and check for used packets in the transmit path of following -packets. However, it will still receive interrupts if the -VIRTIO_F_NOTIFY_ON_EMPTY feature is negotiated, indicating that -the transmission queue is completely emptied. - -The normal behavior in this interrupt handler is to retrieve and -new descriptors from the used ring and free the corresponding -headers and packets. - - Setting Up Receive Buffers - -It is generally a good idea to keep the receive virtqueue as -fully populated as possible: if it runs out, network performance -will suffer. - -If the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6 or -VIRTIO_NET_F_GUEST_UFO features are used, the Guest will need to -accept packets of up to 65550 bytes long (the maximum size of a -TCP or UDP packet, plus the 14 byte ethernet header), otherwise -1514 bytes. So unless VIRTIO_NET_F_MRG_RXBUF is negotiated, every -buffer in the receive queue needs to be at least this length [footnote: -Obviously each one can be split across multiple descriptor -elements. -]. - -If VIRTIO_NET_F_MRG_RXBUF is negotiated, each buffer must be at -least the size of the struct virtio_net_hdr. - - Packet Receive Interrupt - -When a packet is copied into a buffer in the receiveq, the -optimal path is to disable further interrupts for the receiveq -(see [sub:Receiving-Used-Buffers]) and process packets until no -more are found, then re-enable them. - -Processing packet involves: - - If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature, - then the “num_buffers” field indicates how many descriptors - this packet is spread over (including this one). This allows - receipt of large packets without having to allocate large - buffers. In this case, there will be at least “num_buffers” in - the used ring, and they should be chained together to form a - single packet. The other buffers will not begin with a struct - virtio_net_hdr. - - If the VIRTIO_NET_F_MRG_RXBUF feature was not negotiated, or - the “num_buffers” field is one, then the entire packet will be - contained within this buffer, immediately following the struct - virtio_net_hdr. - - If the VIRTIO_NET_F_GUEST_CSUM feature was negotiated, the - VIRTIO_NET_HDR_F_NEEDS_CSUM bit in the “flags” field may be - set: if so, the checksum on the packet is incomplete and the “ - csum_start” and “csum_offset” fields indicate how to calculate - it (see [ite:csum_start-is-set]). - - If the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options were - negotiated, then the “gso_type” may be something other than - VIRTIO_NET_HDR_GSO_NONE, and the “gso_size” field indicates the - desired MSS (see [enu:If-the-driver]).Control Virtqueue - -The driver uses the control virtqueue (if VIRTIO_NET_F_VTRL_VQ is -negotiated) to send commands to manipulate various features of -the device which would not easily map into the configuration -space. - -All commands are of the following form: - -struct virtio_net_ctrl { - - u8 class; - - u8 command; - - u8 command-specific-data[]; - - u8 ack; - -}; - - - -/* ack values */ - -#define VIRTIO_NET_OK 0 - -#define VIRTIO_NET_ERR 1 - -The class, command and command-specific-data are set by the -driver, and the device sets the ack byte. There is little it can -do except issue a diagnostic if the ack byte is not -VIRTIO_NET_OK. - - Packet Receive Filtering - -If the VIRTIO_NET_F_CTRL_RX feature is negotiated, the driver can -send control commands for promiscuous mode, multicast receiving, -and filtering of MAC addresses. - -Note that in general, these commands are best-effort: unwanted -packets may still arrive. - - Setting Promiscuous Mode - -#define VIRTIO_NET_CTRL_RX 0 - - #define VIRTIO_NET_CTRL_RX_PROMISC 0 - - #define VIRTIO_NET_CTRL_RX_ALLMULTI 1 - -The class VIRTIO_NET_CTRL_RX has two commands: -VIRTIO_NET_CTRL_RX_PROMISC turns promiscuous mode on and off, and -VIRTIO_NET_CTRL_RX_ALLMULTI turns all-multicast receive on and -off. The command-specific-data is one byte containing 0 (off) or -1 (on). - - Setting MAC Address Filtering - -struct virtio_net_ctrl_mac { - - u32 entries; - - u8 macs[entries][ETH_ALEN]; - -}; - - - -#define VIRTIO_NET_CTRL_MAC 1 - - #define VIRTIO_NET_CTRL_MAC_TABLE_SET 0 - -The device can filter incoming packets by any number of -destination MAC addresses.[footnote: -Since there are no guarentees, it can use a hash filter -orsilently switch to allmulti or promiscuous mode if it is given -too many addresses. -] This table is set using the class VIRTIO_NET_CTRL_MAC and the -command VIRTIO_NET_CTRL_MAC_TABLE_SET. The command-specific-data -is two variable length tables of 6-byte MAC addresses. The first -table contains unicast addresses, and the second contains -multicast addresses. - - VLAN Filtering - -If the driver negotiates the VIRTION_NET_F_CTRL_VLAN feature, it -can control a VLAN filter table in the device. - -#define VIRTIO_NET_CTRL_VLAN 2 - - #define VIRTIO_NET_CTRL_VLAN_ADD 0 - - #define VIRTIO_NET_CTRL_VLAN_DEL 1 - -Both the VIRTIO_NET_CTRL_VLAN_ADD and VIRTIO_NET_CTRL_VLAN_DEL -command take a 16-bit VLAN id as the command-specific-data. - -Appendix D: Block Device - -The virtio block device is a simple virtual block device (ie. -disk). Read and write requests (and other exotic requests) are -placed in the queue, and serviced (probably out of order) by the -device except where noted. - - Configuration - - Subsystem Device ID 2 - - Virtqueues 0:requestq. - - Feature bits - - VIRTIO_BLK_F_BARRIER (0) Host supports request barriers. - - VIRTIO_BLK_F_SIZE_MAX (1) Maximum size of any single segment is - in “size_max”. - - VIRTIO_BLK_F_SEG_MAX (2) Maximum number of segments in a - request is in “seg_max”. - - VIRTIO_BLK_F_GEOMETRY (4) Disk-style geometry specified in “ - geometry”. - - VIRTIO_BLK_F_RO (5) Device is read-only. - - VIRTIO_BLK_F_BLK_SIZE (6) Block size of disk is in “blk_size”. - - VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands. - - VIRTIO_BLK_F_FLUSH (9) Cache flush command support. - - - - Device configuration layout The capacity of the device - (expressed in 512-byte sectors) is always present. The - availability of the others all depend on various feature bits - as indicated above. struct virtio_blk_config { - - u64 capacity; - - u32 size_max; - - u32 seg_max; - - struct virtio_blk_geometry { - - u16 cylinders; - - u8 heads; - - u8 sectors; - - } geometry; - - u32 blk_size; - - - -}; - - Device Initialization - - The device size should be read from the “capacity” - configuration field. No requests should be submitted which goes - beyond this limit. - - If the VIRTIO_BLK_F_BLK_SIZE feature is negotiated, the - blk_size field can be read to determine the optimal sector size - for the driver to use. This does not effect the units used in - the protocol (always 512 bytes), but awareness of the correct - value can effect performance. - - If the VIRTIO_BLK_F_RO feature is set by the device, any write - requests will fail. - - - - Device Operation - -The driver queues requests to the virtqueue, and they are used by -the device (not necessarily in order). Each request is of form: - -struct virtio_blk_req { - - - - u32 type; - - u32 ioprio; - - u64 sector; - - char data[][512]; - - u8 status; - -}; - -If the device has VIRTIO_BLK_F_SCSI feature, it can also support -scsi packet command requests, each of these requests is of form:struct virtio_scsi_pc_req { - - u32 type; - - u32 ioprio; - - u64 sector; - - char cmd[]; - - char data[][512]; - -#define SCSI_SENSE_BUFFERSIZE 96 - - u8 sense[SCSI_SENSE_BUFFERSIZE]; - - u32 errors; - - u32 data_len; - - u32 sense_len; - - u32 residual; - - u8 status; - -}; - -The type of the request is either a read (VIRTIO_BLK_T_IN), a -write (VIRTIO_BLK_T_OUT), a scsi packet command -(VIRTIO_BLK_T_SCSI_CMD or VIRTIO_BLK_T_SCSI_CMD_OUT[footnote: -the SCSI_CMD and SCSI_CMD_OUT types are equivalent, the device -does not distinguish between them -]) or a flush (VIRTIO_BLK_T_FLUSH or VIRTIO_BLK_T_FLUSH_OUT[footnote: -the FLUSH and FLUSH_OUT types are equivalent, the device does not -distinguish between them -]). If the device has VIRTIO_BLK_F_BARRIER feature the high bit -(VIRTIO_BLK_T_BARRIER) indicates that this request acts as a -barrier and that all preceeding requests must be complete before -this one, and all following requests must not be started until -this is complete. Note that a barrier does not flush caches in -the underlying backend device in host, and thus does not serve as -data consistency guarantee. Driver must use FLUSH request to -flush the host cache. - -#define VIRTIO_BLK_T_IN 0 - -#define VIRTIO_BLK_T_OUT 1 - -#define VIRTIO_BLK_T_SCSI_CMD 2 - -#define VIRTIO_BLK_T_SCSI_CMD_OUT 3 - -#define VIRTIO_BLK_T_FLUSH 4 - -#define VIRTIO_BLK_T_FLUSH_OUT 5 - -#define VIRTIO_BLK_T_BARRIER 0x80000000 - -The ioprio field is a hint about the relative priorities of -requests to the device: higher numbers indicate more important -requests. - -The sector number indicates the offset (multiplied by 512) where -the read or write is to occur. This field is unused and set to 0 -for scsi packet commands and for flush commands. - -The cmd field is only present for scsi packet command requests, -and indicates the command to perform. This field must reside in a -single, separate read-only buffer; command length can be derived -from the length of this buffer. - -Note that these first three (four for scsi packet commands) -fields are always read-only: the data field is either read-only -or write-only, depending on the request. The size of the read or -write can be derived from the total size of the request buffers. - -The sense field is only present for scsi packet command requests, -and indicates the buffer for scsi sense data. - -The data_len field is only present for scsi packet command -requests, this field is deprecated, and should be ignored by the -driver. Historically, devices copied data length there. - -The sense_len field is only present for scsi packet command -requests and indicates the number of bytes actually written to -the sense buffer. - -The residual field is only present for scsi packet command -requests and indicates the residual size, calculated as data -length - number of bytes actually transferred. - -The final status byte is written by the device: either -VIRTIO_BLK_S_OK for success, VIRTIO_BLK_S_IOERR for host or guest -error or VIRTIO_BLK_S_UNSUPP for a request unsupported by host:#define VIRTIO_BLK_S_OK 0 - -#define VIRTIO_BLK_S_IOERR 1 - -#define VIRTIO_BLK_S_UNSUPP 2 - -Historically, devices assumed that the fields type, ioprio and -sector reside in a single, separate read-only buffer; the fields -errors, data_len, sense_len and residual reside in a single, -separate write-only buffer; the sense field in a separate -write-only buffer of size 96 bytes, by itself; the fields errors, -data_len, sense_len and residual in a single write-only buffer; -and the status field is a separate read-only buffer of size 1 -byte, by itself. - -Appendix E: Console Device - -The virtio console device is a simple device for data input and -output. A device may have one or more ports. Each port has a pair -of input and output virtqueues. Moreover, a device has a pair of -control IO virtqueues. The control virtqueues are used to -communicate information between the device and the driver about -ports being opened and closed on either side of the connection, -indication from the host about whether a particular port is a -console port, adding new ports, port hot-plug/unplug, etc., and -indication from the guest about whether a port or a device was -successfully added, port open/close, etc.. For data IO, one or -more empty buffers are placed in the receive queue for incoming -data and outgoing characters are placed in the transmit queue. - - Configuration - - Subsystem Device ID 3 - - Virtqueues 0:receiveq(port0). 1:transmitq(port0), 2:control - receiveq[footnote: -Ports 2 onwards only if VIRTIO_CONSOLE_F_MULTIPORT is set -], 3:control transmitq, 4:receiveq(port1), 5:transmitq(port1), - ... - - Feature bits - - VIRTIO_CONSOLE_F_SIZE (0) Configuration cols and rows fields - are valid. - - VIRTIO_CONSOLE_F_MULTIPORT(1) Device has support for multiple - ports; configuration fields nr_ports and max_nr_ports are - valid and control virtqueues will be used. - - Device configuration layout The size of the console is supplied - in the configuration space if the VIRTIO_CONSOLE_F_SIZE feature - is set. Furthermore, if the VIRTIO_CONSOLE_F_MULTIPORT feature - is set, the maximum number of ports supported by the device can - be fetched.struct virtio_console_config { - - u16 cols; - - u16 rows; - - - - u32 max_nr_ports; - -}; - - Device Initialization - - If the VIRTIO_CONSOLE_F_SIZE feature is negotiated, the driver - can read the console dimensions from the configuration fields. - - If the VIRTIO_CONSOLE_F_MULTIPORT feature is negotiated, the - driver can spawn multiple ports, not all of which may be - attached to a console. Some could be generic ports. In this - case, the control virtqueues are enabled and according to the - max_nr_ports configuration-space value, the appropriate number - of virtqueues are created. A control message indicating the - driver is ready is sent to the host. The host can then send - control messages for adding new ports to the device. After - creating and initializing each port, a - VIRTIO_CONSOLE_PORT_READY control message is sent to the host - for that port so the host can let us know of any additional - configuration options set for that port. - - The receiveq for each port is populated with one or more - receive buffers. - - Device Operation - - For output, a buffer containing the characters is placed in the - port's transmitq.[footnote: -Because this is high importance and low bandwidth, the current -Linux implementation polls for the buffer to be used, rather than -waiting for an interrupt, simplifying the implementation -significantly. However, for generic serial ports with the -O_NONBLOCK flag set, the polling limitation is relaxed and the -consumed buffers are freed upon the next write or poll call or -when a port is closed or hot-unplugged. -] - - When a buffer is used in the receiveq (signalled by an - interrupt), the contents is the input to the port associated - with the virtqueue for which the notification was received. - - If the driver negotiated the VIRTIO_CONSOLE_F_SIZE feature, a - configuration change interrupt may occur. The updated size can - be read from the configuration fields. - - If the driver negotiated the VIRTIO_CONSOLE_F_MULTIPORT - feature, active ports are announced by the host using the - VIRTIO_CONSOLE_PORT_ADD control message. The same message is - used for port hot-plug as well. - - If the host specified a port `name', a sysfs attribute is - created with the name filled in, so that udev rules can be - written that can create a symlink from the port's name to the - char device for port discovery by applications in the guest. - - Changes to ports' state are effected by control messages. - Appropriate action is taken on the port indicated in the - control message. The layout of the structure of the control - buffer and the events associated are:struct virtio_console_control { - - uint32_t id; /* Port number */ - - uint16_t event; /* The kind of control event */ - - uint16_t value; /* Extra information for the event */ - -}; - - - -/* Some events for the internal messages (control packets) */ - - - -#define VIRTIO_CONSOLE_DEVICE_READY 0 - -#define VIRTIO_CONSOLE_PORT_ADD 1 - -#define VIRTIO_CONSOLE_PORT_REMOVE 2 - -#define VIRTIO_CONSOLE_PORT_READY 3 - -#define VIRTIO_CONSOLE_CONSOLE_PORT 4 - -#define VIRTIO_CONSOLE_RESIZE 5 - -#define VIRTIO_CONSOLE_PORT_OPEN 6 - -#define VIRTIO_CONSOLE_PORT_NAME 7 - -Appendix F: Entropy Device - -The virtio entropy device supplies high-quality randomness for -guest use. - - Configuration - - Subsystem Device ID 4 - - Virtqueues 0:requestq. - - Feature bits None currently defined - - Device configuration layout None currently defined. - - Device Initialization - - The virtqueue is initialized - - Device Operation - -When the driver requires random bytes, it places the descriptor -of one or more buffers in the queue. It will be completely filled -by random data by the device. - -Appendix G: Memory Balloon Device - -The virtio memory balloon device is a primitive device for -managing guest memory: the device asks for a certain amount of -memory, and the guest supplies it (or withdraws it, if the device -has more than it asks for). This allows the guest to adapt to -changes in allowance of underlying physical memory. If the -feature is negotiated, the device can also be used to communicate -guest memory statistics to the host. - - Configuration - - Subsystem Device ID 5 - - Virtqueues 0:inflateq. 1:deflateq. 2:statsq.[footnote: -Only if VIRTIO_BALLON_F_STATS_VQ set -] - - Feature bits - - VIRTIO_BALLOON_F_MUST_TELL_HOST (0) Host must be told before - pages from the balloon are used. - - VIRTIO_BALLOON_F_STATS_VQ (1) A virtqueue for reporting guest - memory statistics is present. - - Device configuration layout Both fields of this configuration - are always available. Note that they are little endian, despite - convention that device fields are guest endian:struct virtio_balloon_config { - - u32 num_pages; - - u32 actual; - -}; - - Device Initialization - - The inflate and deflate virtqueues are identified. - - If the VIRTIO_BALLOON_F_STATS_VQ feature bit is negotiated: - - Identify the stats virtqueue. - - Add one empty buffer to the stats virtqueue and notify the - host. - -Device operation begins immediately. - - Device Operation - - Memory Ballooning The device is driven by the receipt of a - configuration change interrupt. - - The “num_pages” configuration field is examined. If this is - greater than the “actual” number of pages, memory must be given - to the balloon. If it is less than the “actual” number of - pages, memory may be taken back from the balloon for general - use. - - To supply memory to the balloon (aka. inflate): - - The driver constructs an array of addresses of unused memory - pages. These addresses are divided by 4096[footnote: -This is historical, and independent of the guest page size -] and the descriptor describing the resulting 32-bit array is - added to the inflateq. - - To remove memory from the balloon (aka. deflate): - - The driver constructs an array of addresses of memory pages it - has previously given to the balloon, as described above. This - descriptor is added to the deflateq. - - If the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is set, the - guest may not use these requested pages until that descriptor - in the deflateq has been used by the device. - - Otherwise, the guest may begin to re-use pages previously given - to the balloon before the device has acknowledged their - withdrawl. [footnote: -In this case, deflation advice is merely a courtesy -] - - In either case, once the device has completed the inflation or - deflation, the “actual” field of the configuration should be - updated to reflect the new number of pages in the balloon.[footnote: -As updates to configuration space are not atomic, this field -isn't particularly reliable, but can be used to diagnose buggy -guests. -] - - Memory Statistics - -The stats virtqueue is atypical because communication is driven -by the device (not the driver). The channel becomes active at -driver initialization time when the driver adds an empty buffer -and notifies the device. A request for memory statistics proceeds -as follows: - - The device pushes the buffer onto the used ring and sends an - interrupt. - - The driver pops the used buffer and discards it. - - The driver collects memory statistics and writes them into a - new buffer. - - The driver adds the buffer to the virtqueue and notifies the - device. - - The device pops the buffer (retaining it to initiate a - subsequent request) and consumes the statistics. - - Memory Statistics Format Each statistic consists of a 16 bit - tag and a 64 bit value. Both quantities are represented in the - native endian of the guest. All statistics are optional and the - driver may choose which ones to supply. To guarantee backwards - compatibility, unsupported statistics should be omitted. - - struct virtio_balloon_stat { - -#define VIRTIO_BALLOON_S_SWAP_IN 0 - -#define VIRTIO_BALLOON_S_SWAP_OUT 1 - -#define VIRTIO_BALLOON_S_MAJFLT 2 - -#define VIRTIO_BALLOON_S_MINFLT 3 - -#define VIRTIO_BALLOON_S_MEMFREE 4 - -#define VIRTIO_BALLOON_S_MEMTOT 5 - - u16 tag; - - u64 val; - -} __attribute__((packed)); - - Tags - - VIRTIO_BALLOON_S_SWAP_IN The amount of memory that has been - swapped in (in bytes). - - VIRTIO_BALLOON_S_SWAP_OUT The amount of memory that has been - swapped out to disk (in bytes). - - VIRTIO_BALLOON_S_MAJFLT The number of major page faults that - have occurred. - - VIRTIO_BALLOON_S_MINFLT The number of minor page faults that - have occurred. - - VIRTIO_BALLOON_S_MEMFREE The amount of memory not being used - for any purpose (in bytes). - - VIRTIO_BALLOON_S_MEMTOT The total amount of memory available - (in bytes). - |
