diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2013-05-02 14:14:04 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2013-05-02 14:14:04 -0700 |
commit | 736a2dd2571ac56b11ed95a7814d838d5311be04 (patch) | |
tree | de10d107025970c6e51d5b6faeba799ed4b9caae /Documentation | |
parent | 0b2e3b6bb4a415379f16e38fc92db42379be47a1 (diff) | |
parent | 01d779a14ef800b74684d9692add4944df052461 (diff) |
Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull virtio & lguest updates from Rusty Russell:
"Lots of virtio work which wasn't quite ready for last merge window.
Plus I dived into lguest again, reworking the pagetable code so we can
move the switcher page: our fixmaps sometimes take more than 2MB now..."
Ugh. Annoying conflicts with the tcm_vhost -> vhost_scsi rename.
Hopefully correctly resolved.
* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (57 commits)
caif_virtio: Remove bouncing email addresses
lguest: improve code readability in lg_cpu_start.
virtio-net: fill only rx queues which are being used
lguest: map Switcher below fixmap.
lguest: cache last cpu we ran on.
lguest: map Switcher text whenever we allocate a new pagetable.
lguest: don't share Switcher PTE pages between guests.
lguest: expost switcher_pages array (as lg_switcher_pages).
lguest: extract shadow PTE walking / allocating.
lguest: make check_gpte et. al return bool.
lguest: assume Switcher text is a single page.
lguest: rename switcher_page to switcher_pages.
lguest: remove RESERVE_MEM constant.
lguest: check vaddr not pgd for Switcher protection.
lguest: prepare to make SWITCHER_ADDR a variable.
virtio: console: replace EMFILE with EBUSY for already-open port
virtio-scsi: reset virtqueue affinity when doing cpu hotplug
virtio-scsi: introduce multiqueue support
virtio-scsi: push vq lock/unlock into virtscsi_vq_done
virtio-scsi: pass struct virtio_scsi to virtqueue completion function
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/virtual/00-INDEX | 3 | ||||
-rw-r--r-- | Documentation/virtual/virtio-spec.txt | 3210 |
2 files changed, 0 insertions, 3213 deletions
diff --git a/Documentation/virtual/00-INDEX b/Documentation/virtual/00-INDEX index 924bd462675..e952d30bbf0 100644 --- a/Documentation/virtual/00-INDEX +++ b/Documentation/virtual/00-INDEX @@ -6,6 +6,3 @@ kvm/ - Kernel Virtual Machine. See also http://linux-kvm.org uml/ - User Mode Linux, builds/runs Linux kernel as a userspace program. -virtio.txt - - Text version of draft virtio spec. - See http://ozlabs.org/~rusty/virtio-spec diff --git a/Documentation/virtual/virtio-spec.txt b/Documentation/virtual/virtio-spec.txt deleted file mode 100644 index eb094039b50..00000000000 --- a/Documentation/virtual/virtio-spec.txt +++ /dev/null @@ -1,3210 +0,0 @@ -[Generated file: see http://ozlabs.org/~rusty/virtio-spec/] -Virtio PCI Card Specification -v0.9.5 DRAFT -- - -Rusty Russell <rusty@rustcorp.com.au> IBM Corporation (Editor) - -2012 May 7. - -Purpose and Description - -This document describes the specifications of the “virtio” family -of PCI[LaTeX Command: nomenclature] devices. These are devices -are found in virtual environments[LaTeX Command: nomenclature], -yet by design they are not all that different from physical PCI -devices, and this document treats them as such. This allows the -guest to use standard PCI drivers and discovery mechanisms. - -The purpose of virtio and this specification is that virtual -environments and guests should have a straightforward, efficient, -standard and extensible mechanism for virtual devices, rather -than boutique per-environment or per-OS mechanisms. - - Straightforward: Virtio PCI devices use normal PCI mechanisms - of interrupts and DMA which should be familiar to any device - driver author. There is no exotic page-flipping or COW - mechanism: it's just a PCI device.[footnote: -This lack of page-sharing implies that the implementation of the -device (e.g. the hypervisor or host) needs full access to the -guest memory. Communication with untrusted parties (i.e. -inter-guest communication) requires copying. -] - - Efficient: Virtio PCI devices consist of rings of descriptors - for input and output, which are neatly separated to avoid cache - effects from both guest and device writing to the same cache - lines. - - Standard: Virtio PCI makes no assumptions about the environment - in which it operates, beyond supporting PCI. In fact the virtio - devices specified in the appendices do not require PCI at all: - they have been implemented on non-PCI buses.[footnote: -The Linux implementation further separates the PCI virtio code -from the specific virtio drivers: these drivers are shared with -the non-PCI implementations (currently lguest and S/390). -] - - Extensible: Virtio PCI devices contain feature bits which are - acknowledged by the guest operating system during device setup. - This allows forwards and backwards compatibility: the device - offers all the features it knows about, and the driver - acknowledges those it understands and wishes to use. - - Virtqueues - -The mechanism for bulk data transport on virtio PCI devices is -pretentiously called a virtqueue. Each device can have zero or -more virtqueues: for example, the network device has one for -transmit and one for receive. - -Each virtqueue occupies two or more physically-contiguous pages -(defined, for the purposes of this specification, as 4096 bytes), -and consists of three parts: - - -+-------------------+-----------------------------------+-----------+ -| Descriptor Table | Available Ring (padding) | Used Ring | -+-------------------+-----------------------------------+-----------+ - - -When the driver wants to send a buffer to the device, it fills in -a slot in the descriptor table (or chains several together), and -writes the descriptor index into the available ring. It then -notifies the device. When the device has finished a buffer, it -writes the descriptor into the used ring, and sends an interrupt. - -Specification - - PCI Discovery - -Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 -through 0x103F inclusive is a virtio device[footnote: -The actual value within this range is ignored -]. The device must also have a Revision ID of 0 to match this -specification. - -The Subsystem Device ID indicates which virtio device is -supported by the device. The Subsystem Vendor ID should reflect -the PCI Vendor ID of the environment (it's currently only used -for informational purposes by the guest). - - -+----------------------+--------------------+---------------+ -| Subsystem Device ID | Virtio Device | Specification | -+----------------------+--------------------+---------------+ -+----------------------+--------------------+---------------+ -| 1 | network card | Appendix C | -+----------------------+--------------------+---------------+ -| 2 | block device | Appendix D | -+----------------------+--------------------+---------------+ -| 3 | console | Appendix E | -+----------------------+--------------------+---------------+ -| 4 | entropy source | Appendix F | -+----------------------+--------------------+---------------+ -| 5 | memory ballooning | Appendix G | -+----------------------+--------------------+---------------+ -| 6 | ioMemory | - | -+----------------------+--------------------+---------------+ -| 7 | rpmsg | Appendix H | -+----------------------+--------------------+---------------+ -| 8 | SCSI host | Appendix I | -+----------------------+--------------------+---------------+ -| 9 | 9P transport | - | -+----------------------+--------------------+---------------+ -| 10 | mac80211 wlan | - | -+----------------------+--------------------+---------------+ - - - Device Configuration - -To configure the device, we use the first I/O region of the PCI -device. This contains a virtio header followed by a -device-specific region. - -There may be different widths of accesses to the I/O region; the “ -natural” access method for each field in the virtio header must -be used (i.e. 32-bit accesses for 32-bit fields, etc), but the -device-specific region can be accessed using any width accesses, -and should obtain the same results. - -Note that this is possible because while the virtio header is PCI -(i.e. little) endian, the device-specific region is encoded in -the native endian of the guest (where such distinction is -applicable). - - Device Initialization Sequence<sub:Device-Initialization-Sequence> - -We start with an overview of device initialization, then expand -on the details of the device and how each step is preformed. - - Reset the device. This is not required on initial start up. - - The ACKNOWLEDGE status bit is set: we have noticed the device. - - The DRIVER status bit is set: we know how to drive the device. - - Device-specific setup, including reading the Device Feature - Bits, discovery of virtqueues for the device, optional MSI-X - setup, and reading and possibly writing the virtio - configuration space. - - The subset of Device Feature Bits understood by the driver is - written to the device. - - The DRIVER_OK status bit is set. - - The device can now be used (ie. buffers added to the - virtqueues)[footnote: -Historically, drivers have used the device before steps 5 and 6. -This is only allowed if the driver does not use any features -which would alter this early use of the device. -] - -If any of these steps go irrecoverably wrong, the guest should -set the FAILED status bit to indicate that it has given up on the -device (it can reset the device later to restart if desired). - -We now cover the fields required for general setup in detail. - - Virtio Header - -The virtio header looks as follows: - - -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ -| Bits || 32 | 32 | 32 | 16 | 16 | 16 | 8 | 8 | -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ -| Read/Write || R | R+W | R+W | R | R+W | R+W | R+W | R | -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ -| Purpose || Device | Guest | Queue | Queue | Queue | Queue | Device | ISR | -| || Features bits 0:31 | Features bits 0:31 | Address | Size | Select | Notify | Status | Status | -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ - - -If MSI-X is enabled for the device, two additional fields -immediately follow this header:[footnote: -ie. once you enable MSI-X on the device, the other fields move. -If you turn it off again, they move back! -] - - -+------------++----------------+--------+ -| Bits || 16 | 16 | - +----------------+--------+ -+------------++----------------+--------+ -| Read/Write || R+W | R+W | -+------------++----------------+--------+ -| Purpose || Configuration | Queue | -| (MSI-X) || Vector | Vector | -+------------++----------------+--------+ - - -Immediately following these general headers, there may be -device-specific headers: - - -+------------++--------------------+ -| Bits || Device Specific | - +--------------------+ -+------------++--------------------+ -| Read/Write || Device Specific | -+------------++--------------------+ -| Purpose || Device Specific... | -| || | -+------------++--------------------+ - - - Device Status - -The Device Status field is updated by the guest to indicate its -progress. This provides a simple low-level diagnostic: it's most -useful to imagine them hooked up to traffic lights on the console -indicating the status of each device. - -The device can be reset by writing a 0 to this field, otherwise -at least one bit should be set: - - ACKNOWLEDGE (1) Indicates that the guest OS has found the - device and recognized it as a valid virtio device. - - DRIVER (2) Indicates that the guest OS knows how to drive the - device. Under Linux, drivers can be loadable modules so there - may be a significant (or infinite) delay before setting this - bit. - - DRIVER_OK (4) Indicates that the driver is set up and ready to - drive the device. - - FAILED (128) Indicates that something went wrong in the guest, - and it has given up on the device. This could be an internal - error, or the driver didn't like the device for some reason, or - even a fatal error during device operation. The device must be - reset before attempting to re-initialize. - - Feature Bits<sub:Feature-Bits> - -Thefirst configuration field indicates the features that the -device supports. The bits are allocated as follows: - - 0 to 23 Feature bits for the specific device type - - 24 to 32 Feature bits reserved for extensions to the queue and - feature negotiation mechanisms - -For example, feature bit 0 for a network device (i.e. Subsystem -Device ID 1) indicates that the device supports checksumming of -packets. - -The feature bits are negotiated: the device lists all the -features it understands in the Device Features field, and the -guest writes the subset that it understands into the Guest -Features field. The only way to renegotiate is to reset the -device. - -In particular, new fields in the device configuration header are -indicated by offering a feature bit, so the guest can check -before accessing that part of the configuration space. - -This allows for forwards and backwards compatibility: if the -device is enhanced with a new feature bit, older guests will not -write that feature bit back to the Guest Features field and it -can go into backwards compatibility mode. Similarly, if a guest -is enhanced with a feature that the device doesn't support, it -will not see that feature bit in the Device Features field and -can go into backwards compatibility mode (or, for poor -implementations, set the FAILED Device Status bit). - - Configuration/Queue Vectors - -When MSI-X capability is present and enabled in the device -(through standard PCI configuration space) 4 bytes at byte offset -20 are used to map configuration change and queue interrupts to -MSI-X vectors. In this case, the ISR Status field is unused, and -device specific configuration starts at byte offset 24 in virtio -header structure. When MSI-X capability is not enabled, device -specific configuration starts at byte offset 20 in virtio header. - -Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of -Configuration/Queue Vector registers, maps interrupts triggered -by the configuration change/selected queue events respectively to -the corresponding MSI-X vector. To disable interrupts for a -specific event type, unmap it by writing a special NO_VECTOR -value: - -/* Vector value used to disable MSI for queue */ - -#define VIRTIO_MSI_NO_VECTOR 0xffff - -Reading these registers returns vector mapped to a given event, -or NO_VECTOR if unmapped. All queue and configuration change -events are unmapped by default. - -Note that mapping an event to vector might require allocating -internal device resources, and might fail. Devices report such -failures by returning the NO_VECTOR value when the relevant -Vector field is read. After mapping an event to vector, the -driver must verify success by reading the Vector field value: on -success, the previously written value is returned, and on -failure, NO_VECTOR is returned. If a mapping failure is detected, -the driver can retry mapping with fewervectors, or disable MSI-X. - - Virtqueue Configuration<sec:Virtqueue-Configuration> - -As a device can have zero or more virtqueues for bulk data -transport (for example, the network driver has two), the driver -needs to configure them as part of the device-specific -configuration. - -This is done as follows, for each virtqueue a device has: - - Write the virtqueue index (first queue is 0) to the Queue - Select field. - - Read the virtqueue size from the Queue Size field, which is - always a power of 2. This controls how big the virtqueue is - (see below). If this field is 0, the virtqueue does not exist. - - Allocate and zero virtqueue in contiguous physical memory, on a - 4096 byte alignment. Write the physical address, divided by - 4096 to the Queue Address field.[footnote: -The 4096 is based on the x86 page size, but it's also large -enough to ensure that the separate parts of the virtqueue are on -separate cache lines. -] - - Optionally, if MSI-X capability is present and enabled on the - device, select a vector to use to request interrupts triggered - by virtqueue events. Write the MSI-X Table entry number - corresponding to this vector in Queue Vector field. Read the - Queue Vector field: on success, previously written value is - returned; on failure, NO_VECTOR value is returned. - -The Queue Size field controls the total number of bytes required -for the virtqueue according to the following formula: - -#define ALIGN(x) (((x) + 4095) & ~4095) - -static inline unsigned vring_size(unsigned int qsz) - -{ - - return ALIGN(sizeof(struct vring_desc)*qsz + sizeof(u16)*(2 -+ qsz)) - - + ALIGN(sizeof(struct vring_used_elem)*qsz); - -} - -This currently wastes some space with padding, but also allows -future extensions. The virtqueue layout structure looks like this -(qsz is the Queue Size field, which is a variable, so this code -won't compile): - -struct vring { - - /* The actual descriptors (16 bytes each) */ - - struct vring_desc desc[qsz]; - - - - /* A ring of available descriptor heads with free-running -index. */ - - struct vring_avail avail; - - - - // Padding to the next 4096 boundary. - - char pad[]; - - - - // A ring of used descriptor heads with free-running index. - - struct vring_used used; - -}; - - A Note on Virtqueue Endianness - -Note that the endian of these fields and everything else in the -virtqueue is the native endian of the guest, not little-endian as -PCI normally is. This makes for simpler guest code, and it is -assumed that the host already has to be deeply aware of the guest -endian so such an “endian-aware” device is not a significant -issue. - - Descriptor Table - -The descriptor table refers to the buffers the guest is using for -the device. The addresses are physical addresses, and the buffers -can be chained via the next field. Each descriptor describes a -buffer which is read-only or write-only, but a chain of -descriptors can contain both read-only and write-only buffers. - -No descriptor chain may be more than 2^32 bytes long in total.struct vring_desc { - - /* Address (guest-physical). */ - - u64 addr; - - /* Length. */ - - u32 len; - -/* This marks a buffer as continuing via the next field. */ - -#define VRING_DESC_F_NEXT 1 - -/* This marks a buffer as write-only (otherwise read-only). */ - -#define VRING_DESC_F_WRITE 2 - -/* This means the buffer contains a list of buffer descriptors. -*/ - -#define VRING_DESC_F_INDIRECT 4 - - /* The flags as indicated above. */ - - u16 flags; - - /* Next field if flags & NEXT */ - - u16 next; - -}; - -The number of descriptors in the table is specified by the Queue -Size field for this virtqueue. - - <sub:Indirect-Descriptors>Indirect Descriptors - -Some devices benefit by concurrently dispatching a large number -of large requests. The VIRTIO_RING_F_INDIRECT_DESC feature can be -used to allow this (see [cha:Reserved-Feature-Bits]). To increase -ring capacity it is possible to store a table of indirect -descriptors anywhere in memory, and insert a descriptor in main -virtqueue (with flags&INDIRECT on) that refers to memory buffer -containing this indirect descriptor table; fields addr and len -refer to the indirect table address and length in bytes, -respectively. The indirect table layout structure looks like this -(len is the length of the descriptor that refers to this table, -which is a variable, so this code won't compile): - -struct indirect_descriptor_table { - - /* The actual descriptors (16 bytes each) */ - - struct vring_desc desc[len / 16]; - -}; - -The first indirect descriptor is located at start of the indirect -descriptor table (index 0), additional indirect descriptors are -chained by next field. An indirect descriptor without next field -(with flags&NEXT off) signals the end of the indirect descriptor -table, and transfers control back to the main virtqueue. An -indirect descriptor can not refer to another indirect descriptor -table (flags&INDIRECT must be off). A single indirect descriptor -table can include both read-only and write-only descriptors; -write-only flag (flags&WRITE) in the descriptor that refers to it -is ignored. - - Available Ring - -The available ring refers to what descriptors we are offering the -device: it refers to the head of a descriptor chain. The “flags” -field is currently 0 or 1: 1 indicating that we do not need an -interrupt when the device consumes a descriptor from the -available ring. Alternatively, the guest can ask the device to -delay interrupts until an entry with an index specified by the “ -used_event” field is written in the used ring (equivalently, -until the idx field in the used ring will reach the value -used_event + 1). The method employed by the device is controlled -by the VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits] -). This interrupt suppression is merely an optimization; it may -not suppress interrupts entirely. - -The “idx” field indicates where we would put the next descriptor -entry (modulo the ring size). This starts at 0, and increases. - -struct vring_avail { - -#define VRING_AVAIL_F_NO_INTERRUPT 1 - - u16 flags; - - u16 idx; - - u16 ring[qsz]; /* qsz is the Queue Size field read from device -*/ - - u16 used_event; - -}; - - Used Ring - -The used ring is where the device returns buffers once it is done -with them. The flags field can be used by the device to hint that -no notification is necessary when the guest adds to the available -ring. Alternatively, the “avail_event” field can be used by the -device to hint that no notification is necessary until an entry -with an index specified by the “avail_event” is written in the -available ring (equivalently, until the idx field in the -available ring will reach the value avail_event + 1). The method -employed by the device is controlled by the guest through the -VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits] -). [footnote: -These fields are kept here because this is the only part of the -virtqueue written by the device -]. - -Each entry in the ring is a pair: the head entry of the -descriptor chain describing the buffer (this matches an entry -placed in the available ring by the guest earlier), and the total -of bytes written into the buffer. The latter is extremely useful -for guests using untrusted buffers: if you do not know exactly -how much has been written by the device, you usually have to zero -the buffer to ensure no data leakage occurs. - -/* u32 is used here for ids for padding reasons. */ - -struct vring_used_elem { - - /* Index of start of used descriptor chain. */ - - u32 id; - - /* Total length of the descriptor chain which was used -(written to) */ - - u32 len; - -}; - - - -struct vring_used { - -#define VRING_USED_F_NO_NOTIFY 1 - - u16 flags; - - u16 idx; - - struct vring_used_elem ring[qsz]; - - u16 avail_event; - -}; - - Helpers for Managing Virtqueues - -The Linux Kernel Source code contains the definitions above and -helper routines in a more usable form, in -include/linux/virtio_ring.h. This was explicitly licensed by IBM -and Red Hat under the (3-clause) BSD license so that it can be -freely used by all other projects, and is reproduced (with slight -variation to remove Linux assumptions) in Appendix A. - - Device Operation<sec:Device-Operation> - -There are two parts to device operation: supplying new buffers to -the device, and processing used buffers from the device. As an -example, the virtio network device has two virtqueues: the -transmit virtqueue and the receive virtqueue. The driver adds -outgoing (read-only) packets to the transmit virtqueue, and then -frees them after they are used. Similarly, incoming (write-only) -buffers are added to the receive virtqueue, and processed after -they are used. - - Supplying Buffers to The Device - -Actual transfer of buffers from the guest OS to the device -operates as follows: - - Place the buffer(s) into free descriptor(s). - - If there are no free descriptors, the guest may choose to - notify the device even if notifications are suppressed (to - reduce latency).[footnote: -The Linux drivers do this only for read-only buffers: for -write-only buffers, it is assumed that the driver is merely -trying to keep the receive buffer ring full, and no notification -of this expected condition is necessary. -] - - Place the id of the buffer in the next ring entry of the - available ring. - - The steps (1) and (2) may be performed repeatedly if batching - is possible. - - A memory barrier should be executed to ensure the device sees - the updated descriptor table and available ring before the next - step. - - The available “idx” field should be increased by the number of - entries added to the available ring. - - A memory barrier should be executed to ensure that we update - the idx field before checking for notification suppression. - - If notifications are not suppressed, the device should be - notified of the new buffers. - -Note that the above code does not take precautions against the -available ring buffer wrapping around: this is not possible since -the ring buffer is the same size as the descriptor table, so step -(1) will prevent such a condition. - -In addition, the maximum queue size is 32768 (it must be a power -of 2 which fits in 16 bits), so the 16-bit “idx” value can always -distinguish between a full and empty buffer. - -Here is a description of each stage in more detail. - - Placing Buffers Into The Descriptor Table - -A buffer consists of zero or more read-only physically-contiguous -elements followed by zero or more physically-contiguous -write-only elements (it must have at least one element). This -algorithm maps it into the descriptor table: - - for each buffer element, b: - - Get the next free descriptor table entry, d - - Set d.addr to the physical address of the start of b - - Set d.len to the length of b. - - If b is write-only, set d.flags to VRING_DESC_F_WRITE, - otherwise 0. - - If there is a buffer element after this: - - Set d.next to the index of the next free descriptor element. - - Set the VRING_DESC_F_NEXT bit in d.flags. - -In practice, the d.next fields are usually used to chain free -descriptors, and a separate count kept to check there are enough -free descriptors before beginning the mappings. - - Updating The Available Ring - -The head of the buffer we mapped is the first d in the algorithm -above. A naive implementation would do the following: - -avail->ring[avail->idx % qsz] = head; - -However, in general we can add many descriptors before we update -the “idx” field (at which point they become visible to the -device), so we keep a counter of how many we've added: - -avail->ring[(avail->idx + added++) % qsz] = head; - - Updating The Index Field - -Once the idx field of the virtqueue is updated, the device will -be able to access the descriptor entries we've created and the -memory they refer to. This is why a memory barrier is generally -used before the idx update, to ensure it sees the most up-to-date -copy. - -The idx field always increments, and we let it wrap naturally at -65536: - -avail->idx += added; - - <sub:Notifying-The-Device>Notifying The Device - -Device notification occurs by writing the 16-bit virtqueue index -of this virtqueue to the Queue Notify field of the virtio header -in the first I/O region of the PCI device. This can be expensive, -however, so the device can suppress such notifications if it -doesn't need them. We have to be careful to expose the new idx -value before checking the suppression flag: it's OK to notify -gratuitously, but not to omit a required notification. So again, -we use a memory barrier here before reading the flags or the -avail_event field. - -If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated, and if -the VRING_USED_F_NOTIFY flag is not set, we go ahead and write to -the PCI configuration space. - -If the VIRTIO_F_RING_EVENT_IDX feature is negotiated, we read the -avail_event field in the available ring structure. If the -available index crossed_the avail_event field value since the -last notification, we go ahead and write to the PCI configuration -space. The avail_event field wraps naturally at 65536 as well: - -(u16)(new_idx - avail_event - 1) < (u16)(new_idx - old_idx) - - <sub:Receiving-Used-Buffers>Receiving Used Buffers From The - Device - -Once the device has used a buffer (read from or written to it, or -parts of both, depending on the nature of the virtqueue and the -device), it sends an interrupt, following an algorithm very -similar to the algorithm used for the driver to send the device a -buffer: - - Write the head descriptor number to the next field in the used - ring. - - Update the used ring idx. - - Determine whether an interrupt is necessary: - - If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated: check - if f the VRING_AVAIL_F_NO_INTERRUPT flag is not set in avail- - >flags - - If the VIRTIO_F_RING_EVENT_IDX feature is negotiated: check - whether the used index crossed the used_event field value - since the last update. The used_event field wraps naturally - at 65536 as well:(u16)(new_idx - used_event - 1) < (u16)(new_idx - old_idx) - - If an interrupt is necessary: - - If MSI-X capability is disabled: - - Set the lower bit of the ISR Status field for the device. - - Send the appropriate PCI interrupt for the device. - - If MSI-X capability is enabled: - - Request the appropriate MSI-X interrupt message for the - device, Queue Vector field sets the MSI-X Table entry - number. - - If Queue Vector field value is NO_VECTOR, no interrupt - message is requested for this event. - -The guest interrupt handler should: - - If MSI-X capability is disabled: read the ISR Status field, - which will reset it to zero. If the lower bit is zero, the - interrupt was not for this device. Otherwise, the guest driver - should look through the used rings of each virtqueue for the - device, to see if any progress has been made by the device - which requires servicing. - - If MSI-X capability is enabled: look through the used rings of - each virtqueue mapped to the specific MSI-X vector for the - device, to see if any progress has been made by the device - which requires servicing. - -For each ring, guest should then disable interrupts by writing -VRING_AVAIL_F_NO_INTERRUPT flag in avail structure, if required. -It can then process used ring entries finally enabling interrupts -by clearing the VRING_AVAIL_F_NO_INTERRUPT flag or updating the -EVENT_IDX field in the available structure, Guest should then -execute a memory barrier, and then recheck the ring empty -condition. This is necessary to handle the case where, after the -last check and before enabling interrupts, an interrupt has been -suppressed by the device: - -vring_disable_interrupts(vq); - -for (;;) { - - if (vq->last_seen_used != vring->used.idx) { - - vring_enable_interrupts(vq); - - mb(); - - if (vq->last_seen_used != vring->used.idx) - - break; - - } - - struct vring_used_elem *e = -vring.used->ring[vq->last_seen_used%vsz]; - - process_buffer(e); - - vq->last_seen_used++; - -} - - Dealing With Configuration Changes<sub:Dealing-With-Configuration> - -Some virtio PCI devices can change the device configuration -state, as reflected in the virtio header in the PCI configuration -space. In this case: - - If MSI-X capability is disabled: an interrupt is delivered and - the second highest bit is set in the ISR Status field to - indicate that the driver should re-examine the configuration - space.Note that a single interrupt can indicate both that one - or more virtqueue has been used and that the configuration - space has changed: even if the config bit is set, virtqueues - must be scanned. - - If MSI-X capability is enabled: an interrupt message is - requested. The Configuration Vector field sets the MSI-X Table - entry number to use. If Configuration Vector field value is - NO_VECTOR, no interrupt message is requested for this event. - -Creating New Device Types - -Various considerations are necessary when creating a new device -type: - - How Many Virtqueues? - -It is possible that a very simple device will operate entirely -through its configuration space, but most will need at least one -virtqueue in which it will place requests. A device with both -input and output (eg. console and network devices described here) -need two queues: one which the driver fills with buffers to -receive input, and one which the driver places buffers to -transmit output. - - What Configuration Space Layout? - -Configuration space is generally used for rarely-changing or -initialization-time parameters. But it is a limited resource, so -it might be better to use a virtqueue to update configuration -information (the network device does this for filtering, -otherwise the table in the config space could potentially be very -large). - -Note that this space is generally the guest's native endian, -rather than PCI's little-endian. - - What Device Number? - -Currently device numbers are assigned quite freely: a simple -request mail to the author of this document or the Linux -virtualization mailing list[footnote: - -https://lists.linux-foundation.org/mailman/listinfo/virtualization -] will be sufficient to secure a unique one. - -Meanwhile for experimental drivers, use 65535 and work backwards. - - How many MSI-X vectors? - -Using the optional MSI-X capability devices can speed up -interrupt processing by removing the need to read ISR Status -register by guest driver (which might be an expensive operation), -reducing interrupt sharing between devices and queues within the -device, and handling interrupts from multiple CPUs. However, some -systems impose a limit (which might be as low as 256) on the -total number of MSI-X vectors that can be allocated to all -devices. Devices and/or device drivers should take this into -account, limiting the number of vectors used unless the device is -expected to cause a high volume of interrupts. Devices can -control the number of vectors used by limiting the MSI-X Table -Size or not presenting MSI-X capability in PCI configuration -space. Drivers can control this by mapping events to as small -number of vectors as possible, or disabling MSI-X capability -altogether. - - Message Framing - -The descriptors used for a buffer should not effect the semantics -of the message, except for the total length of the buffer. For -example, a network buffer consists of a 10 byte header followed -by the network packet. Whether this is presented in the ring -descriptor chain as (say) a 10 byte buffer and a 1514 byte -buffer, or a single 1524 byte buffer, or even three buffers, -should have no effect. - -In particular, no implementation should use the descriptor -boundaries to determine the size of any header in a request.[footnote: -The current qemu device implementations mistakenly insist that -the first descriptor cover the header in these cases exactly, so -a cautious driver should arrange it so. -] - - Device Improvements - -Any change to configuration space, or new virtqueues, or -behavioural changes, should be indicated by negotiation of a new -feature bit. This establishes clarity[footnote: -Even if it does mean documenting design or implementation -mistakes! -] and avoids future expansion problems. - -Clusters of functionality which are always implemented together -can use a single bit, but if one feature makes sense without the -others they should not be gratuitously grouped together to -conserve feature bits. We can always extend the spec when the -first person needs more than 24 feature bits for their device. - -[LaTeX Command: printnomenclature] - -Appendix A: virtio_ring.h - -#ifndef VIRTIO_RING_H - -#define VIRTIO_RING_H - -/* An interface for efficient virtio implementation. - - * - - * This header is BSD licensed so anyone can use the definitions - - * to implement compatible drivers/servers. - - * - - * Copyright 2007, 2009, IBM Corporation - - * Copyright 2011, Red Hat, Inc - - * All rights reserved. - - * - - * Redistribution and use in source and binary forms, with or -without - - * modification, are permitted provided that the following -conditions - - * are met: |