diff options
Diffstat (limited to 'Documentation/block')
-rw-r--r-- | Documentation/block/as-iosched.txt | 165 | ||||
-rw-r--r-- | Documentation/block/biodoc.txt | 1213 | ||||
-rw-r--r-- | Documentation/block/deadline-iosched.txt | 78 | ||||
-rw-r--r-- | Documentation/block/request.txt | 88 |
4 files changed, 1544 insertions, 0 deletions
diff --git a/Documentation/block/as-iosched.txt b/Documentation/block/as-iosched.txt new file mode 100644 index 00000000000..6f47332c883 --- /dev/null +++ b/Documentation/block/as-iosched.txt @@ -0,0 +1,165 @@ +Anticipatory IO scheduler +------------------------- +Nick Piggin <piggin@cyberone.com.au> 13 Sep 2003 + +Attention! Database servers, especially those using "TCQ" disks should +investigate performance with the 'deadline' IO scheduler. Any system with high +disk performance requirements should do so, in fact. + +If you see unusual performance characteristics of your disk systems, or you +see big performance regressions versus the deadline scheduler, please email +me. Database users don't bother unless you're willing to test a lot of patches +from me ;) its a known issue. + +Also, users with hardware RAID controllers, doing striping, may find +highly variable performance results with using the as-iosched. The +as-iosched anticipatory implementation is based on the notion that a disk +device has only one physical seeking head. A striped RAID controller +actually has a head for each physical device in the logical RAID device. + +However, setting the antic_expire (see tunable parameters below) produces +very similar behavior to the deadline IO scheduler. + + +Selecting IO schedulers +----------------------- +To choose IO schedulers at boot time, use the argument 'elevator=deadline'. +'noop' and 'as' (the default) are also available. IO schedulers are assigned +globally at boot time only presently. + + +Anticipatory IO scheduler Policies +---------------------------------- +The as-iosched implementation implements several layers of policies +to determine when an IO request is dispatched to the disk controller. +Here are the policies outlined, in order of application. + +1. one-way Elevator algorithm. + +The elevator algorithm is similar to that used in deadline scheduler, with +the addition that it allows limited backward movement of the elevator +(i.e. seeks backwards). A seek backwards can occur when choosing between +two IO requests where one is behind the elevator's current position, and +the other is in front of the elevator's position. If the seek distance to +the request in back of the elevator is less than half the seek distance to +the request in front of the elevator, then the request in back can be chosen. +Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors. +This favors forward movement of the elevator, while allowing opportunistic +"short" backward seeks. + +2. FIFO expiration times for reads and for writes. + +This is again very similar to the deadline IO scheduler. The expiration +times for requests on these lists is tunable using the parameters read_expire +and write_expire discussed below. When a read or a write expires in this way, +the IO scheduler will interrupt its current elevator sweep or read anticipation +to service the expired request. + +3. Read and write request batching + +A batch is a collection of read requests or a collection of write +requests. The as scheduler alternates dispatching read and write batches +to the driver. In the case a read batch, the scheduler submits read +requests to the driver as long as there are read requests to submit, and +the read batch time limit has not been exceeded (read_batch_expire). +The read batch time limit begins counting down only when there are +competing write requests pending. + +In the case of a write batch, the scheduler submits write requests to +the driver as long as there are write requests available, and the +write batch time limit has not been exceeded (write_batch_expire). +However, the length of write batches will be gradually shortened +when read batches frequently exceed their time limit. + +When changing between batch types, the scheduler waits for all requests +from the previous batch to complete before scheduling requests for the +next batch. + +The read and write fifo expiration times described in policy 2 above +are checked only when in scheduling IO of a batch for the corresponding +(read/write) type. So for example, the read FIFO timeout values are +tested only during read batches. Likewise, the write FIFO timeout +values are tested only during write batches. For this reason, +it is generally not recommended for the read batch time +to be longer than the write expiration time, nor for the write batch +time to exceed the read expiration time (see tunable parameters below). + +When the IO scheduler changes from a read to a write batch, +it begins the elevator from the request that is on the head of the +write expiration FIFO. Likewise, when changing from a write batch to +a read batch, scheduler begins the elevator from the first entry +on the read expiration FIFO. + +4. Read anticipation. + +Read anticipation occurs only when scheduling a read batch. +This implementation of read anticipation allows only one read request +to be dispatched to the disk controller at a time. In +contrast, many write requests may be dispatched to the disk controller +at a time during a write batch. It is this characteristic that can make +the anticipatory scheduler perform anomalously with controllers supporting +TCQ, or with hardware striped RAID devices. Setting the antic_expire +queue paramter (see below) to zero disables this behavior, and the anticipatory +scheduler behaves essentially like the deadline scheduler. + +When read anticipation is enabled (antic_expire is not zero), reads +are dispatched to the disk controller one at a time. +At the end of each read request, the IO scheduler examines its next +candidate read request from its sorted read list. If that next request +is from the same process as the request that just completed, +or if the next request in the queue is "very close" to the +just completed request, it is dispatched immediately. Otherwise, +statistics (average think time, average seek distance) on the process +that submitted the just completed request are examined. If it seems +likely that that process will submit another request soon, and that +request is likely to be near the just completed request, then the IO +scheduler will stop dispatching more read requests for up time (antic_expire) +milliseconds, hoping that process will submit a new request near the one +that just completed. If such a request is made, then it is dispatched +immediately. If the antic_expire wait time expires, then the IO scheduler +will dispatch the next read request from the sorted read queue. + +To decide whether an anticipatory wait is worthwhile, the scheduler +maintains statistics for each process that can be used to compute +mean "think time" (the time between read requests), and mean seek +distance for that process. One observation is that these statistics +are associated with each process, but those statistics are not associated +with a specific IO device. So for example, if a process is doing IO +on several file systems on separate devices, the statistics will be +a combination of IO behavior from all those devices. + + +Tuning the anticipatory IO scheduler +------------------------------------ +When using 'as', the anticipatory IO scheduler there are 5 parameters under +/sys/block/*/queue/iosched/. All are units of milliseconds. + +The parameters are: +* read_expire + Controls how long until a read request becomes "expired". It also controls the + interval between which expired requests are served, so set to 50, a request + might take anywhere < 100ms to be serviced _if_ it is the next on the + expired list. Obviously request expiration strategies won't make the disk + go faster. The result basically equates to the timeslice a single reader + gets in the presence of other IO. 100*((seek time / read_expire) + 1) is + very roughly the % streaming read efficiency your disk should get with + multiple readers. + +* read_batch_expire + Controls how much time a batch of reads is given before pending writes are + served. A higher value is more efficient. This might be set below read_expire + if writes are to be given higher priority than reads, but reads are to be + as efficient as possible when there are no writes. Generally though, it + should be some multiple of read_expire. + +* write_expire, and +* write_batch_expire are equivalent to the above, for writes. + +* antic_expire + Controls the maximum amount of time we can anticipate a good read (one + with a short seek distance from the most recently completed request) before + giving up. Many other factors may cause anticipation to be stopped early, + or some processes will not be "anticipated" at all. Should be a bit higher + for big seek time devices though not a linear correspondence - most + processes have only a few ms thinktime. + diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt new file mode 100644 index 00000000000..6dd274d7e1c --- /dev/null +++ b/Documentation/block/biodoc.txt @@ -0,0 +1,1213 @@ + Notes on the Generic Block Layer Rewrite in Linux 2.5 + ===================================================== + +Notes Written on Jan 15, 2002: + Jens Axboe <axboe@suse.de> + Suparna Bhattacharya <suparna@in.ibm.com> + +Last Updated May 2, 2002 +September 2003: Updated I/O Scheduler portions + Nick Piggin <piggin@cyberone.com.au> + +Introduction: + +These are some notes describing some aspects of the 2.5 block layer in the +context of the bio rewrite. The idea is to bring out some of the key +changes and a glimpse of the rationale behind those changes. + +Please mail corrections & suggestions to suparna@in.ibm.com. + +Credits: +--------- + +2.5 bio rewrite: + Jens Axboe <axboe@suse.de> + +Many aspects of the generic block layer redesign were driven by and evolved +over discussions, prior patches and the collective experience of several +people. See sections 8 and 9 for a list of some related references. + +The following people helped with review comments and inputs for this +document: + Christoph Hellwig <hch@infradead.org> + Arjan van de Ven <arjanv@redhat.com> + Randy Dunlap <rddunlap@osdl.org> + Andre Hedrick <andre@linux-ide.org> + +The following people helped with fixes/contributions to the bio patches +while it was still work-in-progress: + David S. Miller <davem@redhat.com> + + +Description of Contents: +------------------------ + +1. Scope for tuning of logic to various needs + 1.1 Tuning based on device or low level driver capabilities + - Per-queue parameters + - Highmem I/O support + - I/O scheduler modularization + 1.2 Tuning based on high level requirements/capabilities + 1.2.1 I/O Barriers + 1.2.2 Request Priority/Latency + 1.3 Direct access/bypass to lower layers for diagnostics and special + device operations + 1.3.1 Pre-built commands +2. New flexible and generic but minimalist i/o structure or descriptor + (instead of using buffer heads at the i/o layer) + 2.1 Requirements/Goals addressed + 2.2 The bio struct in detail (multi-page io unit) + 2.3 Changes in the request structure +3. Using bios + 3.1 Setup/teardown (allocation, splitting) + 3.2 Generic bio helper routines + 3.2.1 Traversing segments and completion units in a request + 3.2.2 Setting up DMA scatterlists + 3.2.3 I/O completion + 3.2.4 Implications for drivers that do not interpret bios (don't handle + multiple segments) + 3.2.5 Request command tagging + 3.3 I/O submission +4. The I/O scheduler +5. Scalability related changes + 5.1 Granular locking: Removal of io_request_lock + 5.2 Prepare for transition to 64 bit sector_t +6. Other Changes/Implications + 6.1 Partition re-mapping handled by the generic block layer +7. A few tips on migration of older drivers +8. A list of prior/related/impacted patches/ideas +9. Other References/Discussion Threads + +--------------------------------------------------------------------------- + +Bio Notes +-------- + +Let us discuss the changes in the context of how some overall goals for the +block layer are addressed. + +1. Scope for tuning the generic logic to satisfy various requirements + +The block layer design supports adaptable abstractions to handle common +processing with the ability to tune the logic to an appropriate extent +depending on the nature of the device and the requirements of the caller. +One of the objectives of the rewrite was to increase the degree of tunability +and to enable higher level code to utilize underlying device/driver +capabilities to the maximum extent for better i/o performance. This is +important especially in the light of ever improving hardware capabilities +and application/middleware software designed to take advantage of these +capabilities. + +1.1 Tuning based on low level device / driver capabilities + +Sophisticated devices with large built-in caches, intelligent i/o scheduling +optimizations, high memory DMA support, etc may find some of the +generic processing an overhead, while for less capable devices the +generic functionality is essential for performance or correctness reasons. +Knowledge of some of the capabilities or parameters of the device should be +used at the generic block layer to take the right decisions on +behalf of the driver. + +How is this achieved ? + +Tuning at a per-queue level: + +i. Per-queue limits/values exported to the generic layer by the driver + +Various parameters that the generic i/o scheduler logic uses are set at +a per-queue level (e.g maximum request size, maximum number of segments in +a scatter-gather list, hardsect size) + +Some parameters that were earlier available as global arrays indexed by +major/minor are now directly associated with the queue. Some of these may +move into the block device structure in the future. Some characteristics +have been incorporated into a queue flags field rather than separate fields +in themselves. There are blk_queue_xxx functions to set the parameters, +rather than update the fields directly + +Some new queue property settings: + + blk_queue_bounce_limit(q, u64 dma_address) + Enable I/O to highmem pages, dma_address being the + limit. No highmem default. + + blk_queue_max_sectors(q, max_sectors) + Maximum size request you can handle in units of 512 byte + sectors. 255 default. + + blk_queue_max_phys_segments(q, max_segments) + Maximum physical segments you can handle in a request. 128 + default (driver limit). (See 3.2.2) + + blk_queue_max_hw_segments(q, max_segments) + Maximum dma segments the hardware can handle in a request. 128 + default (host adapter limit, after dma remapping). + (See 3.2.2) + + blk_queue_max_segment_size(q, max_seg_size) + Maximum size of a clustered segment, 64kB default. + + blk_queue_hardsect_size(q, hardsect_size) + Lowest possible sector size that the hardware can operate + on, 512 bytes default. + +New queue flags: + + QUEUE_FLAG_CLUSTER (see 3.2.2) + QUEUE_FLAG_QUEUED (see 3.2.4) + + +ii. High-mem i/o capabilities are now considered the default + +The generic bounce buffer logic, present in 2.4, where the block layer would +by default copyin/out i/o requests on high-memory buffers to low-memory buffers +assuming that the driver wouldn't be able to handle it directly, has been +changed in 2.5. The bounce logic is now applied only for memory ranges +for which the device cannot handle i/o. A driver can specify this by +setting the queue bounce limit for the request queue for the device +(blk_queue_bounce_limit()). This avoids the inefficiencies of the copyin/out +where a device is capable of handling high memory i/o. + +In order to enable high-memory i/o where the device is capable of supporting +it, the pci dma mapping routines and associated data structures have now been +modified to accomplish a direct page -> bus translation, without requiring +a virtual address mapping (unlike the earlier scheme of virtual address +-> bus translation). So this works uniformly for high-memory pages (which +do not have a correponding kernel virtual address space mapping) and +low-memory pages. + +Note: Please refer to DMA-mapping.txt for a discussion on PCI high mem DMA +aspects and mapping of scatter gather lists, and support for 64 bit PCI. + +Special handling is required only for cases where i/o needs to happen on +pages at physical memory addresses beyond what the device can support. In these +cases, a bounce bio representing a buffer from the supported memory range +is used for performing the i/o with copyin/copyout as needed depending on +the type of the operation. For example, in case of a read operation, the +data read has to be copied to the original buffer on i/o completion, so a +callback routine is set up to do this, while for write, the data is copied +from the original buffer to the bounce buffer prior to issuing the +operation. Since an original buffer may be in a high memory area that's not +mapped in kernel virtual addr, a kmap operation may be required for +performing the copy, and special care may be needed in the completion path +as it may not be in irq context. Special care is also required (by way of +GFP flags) when allocating bounce buffers, to avoid certain highmem +deadlock possibilities. + +It is also possible that a bounce buffer may be allocated from high-memory +area that's not mapped in kernel virtual addr, but within the range that the +device can use directly; so the bounce page may need to be kmapped during +copy operations. [Note: This does not hold in the current implementation, +though] + +There are some situations when pages from high memory may need to +be kmapped, even if bounce buffers are not necessary. For example a device +may need to abort DMA operations and revert to PIO for the transfer, in +which case a virtual mapping of the page is required. For SCSI it is also +done in some scenarios where the low level driver cannot be trusted to +handle a single sg entry correctly. The driver is expected to perform the +kmaps as needed on such occasions using the __bio_kmap_atomic and bio_kmap_irq +routines as appropriate. A driver could also use the blk_queue_bounce() +routine on its own to bounce highmem i/o to low memory for specific requests +if so desired. + +iii. The i/o scheduler algorithm itself can be replaced/set as appropriate + +As in 2.4, it is possible to plugin a brand new i/o scheduler for a particular +queue or pick from (copy) existing generic schedulers and replace/override +certain portions of it. The 2.5 rewrite provides improved modularization +of the i/o scheduler. There are more pluggable callbacks, e.g for init, +add request, extract request, which makes it possible to abstract specific +i/o scheduling algorithm aspects and details outside of the generic loop. +It also makes it possible to completely hide the implementation details of +the i/o scheduler from block drivers. + +I/O scheduler wrappers are to be used instead of accessing the queue directly. +See section 4. The I/O scheduler for details. + +1.2 Tuning Based on High level code capabilities + +i. Application capabilities for raw i/o + +This comes from some of the high-performance database/middleware +requirements where an application prefers to make its own i/o scheduling +decisions based on an understanding of the access patterns and i/o +characteristics + +ii. High performance filesystems or other higher level kernel code's +capabilities + +Kernel components like filesystems could also take their own i/o scheduling +decisions for optimizing performance. Journalling filesystems may need +some control over i/o ordering. + +What kind of support exists at the generic block layer for this ? + +The flags and rw fields in the bio structure can be used for some tuning +from above e.g indicating that an i/o is just a readahead request, or for +marking barrier requests (discussed next), or priority settings (currently +unused). As far as user applications are concerned they would need an +additional mechanism either via open flags or ioctls, or some other upper +level mechanism to communicate such settings to block. + +1.2.1 I/O Barriers + +There is a way to enforce strict ordering for i/os through barriers. +All requests before a barrier point must be serviced before the barrier +request and any other requests arriving after the barrier will not be +serviced until after the barrier has completed. This is useful for higher +level control on write ordering, e.g flushing a log of committed updates +to disk before the corresponding updates themselves. + +A flag in the bio structure, BIO_BARRIER is used to identify a barrier i/o. +The generic i/o scheduler would make sure that it places the barrier request and +all other requests coming after it after all the previous requests in the +queue. Barriers may be implemented in different ways depending on the +driver. A SCSI driver for example could make use of ordered tags to +preserve the necessary ordering with a lower impact on throughput. For IDE +this might be two sync cache flush: a pre and post flush when encountering +a barrier write. + +There is a provision for queues to indicate what kind of barriers they +can provide. This is as of yet unmerged, details will be added here once it +is in the kernel. + +1.2.2 Request Priority/Latency + +Todo/Under discussion: +Arjan's proposed request priority scheme allows higher levels some broad + control (high/med/low) over the priority of an i/o request vs other pending + requests in the queue. For example it allows reads for bringing in an + executable page on demand to be given a higher priority over pending write + requests which haven't aged too much on the queue. Potentially this priority + could even be exposed to applications in some manner, providing higher level + tunability. Time based aging avoids starvation of lower priority + requests. Some bits in the bi_rw flags field in the bio structure are + intended to be used for this priority information. + + +1.3 Direct Access to Low level Device/Driver Capabilities (Bypass mode) + (e.g Diagnostics, Systems Management) + +There are situations where high-level code needs to have direct access to +the low level device capabilities or requires the ability to issue commands +to the device bypassing some of the intermediate i/o layers. +These could, for example, be special control commands issued through ioctl +interfaces, or could be raw read/write commands that stress the drive's +capabilities for certain kinds of fitness tests. Having direct interfaces at +multiple levels without having to pass through upper layers makes +it possible to perform bottom up validation of the i/o path, layer by +layer, starting from the media. + +The normal i/o submission interfaces, e.g submit_bio, could be bypassed +for specially crafted requests which such ioctl or diagnostics +interfaces would typically use, and the elevator add_request routine +can instead be used to directly insert such requests in the queue or preferably +the blk_do_rq routine can be used to place the request on the queue and +wait for completion. Alternatively, sometimes the caller might just +invoke a lower level driver specific interface with the request as a +parameter. + +If the request is a means for passing on special information associated with +the command, then such information is associated with the request->special +field (rather than misuse the request->buffer field which is meant for the +request data buffer's virtual mapping). + +For passing request data, the caller must build up a bio descriptor +representing the concerned memory buffer if the underlying driver interprets +bio segments or uses the block layer end*request* functions for i/o +completion. Alternatively one could directly use the request->buffer field to +specify the virtual address of the buffer, if the driver expects buffer +addresses passed in this way and ignores bio entries for the request type +involved. In the latter case, the driver would modify and manage the +request->buffer, request->sector and request->nr_sectors or +request->current_nr_sectors fields itself rather than using the block layer +end_request or end_that_request_first completion interfaces. +(See 2.3 or Documentation/block/request.txt for a brief explanation of +the request structure fields) + +[TBD: end_that_request_last should be usable even in this case; +Perhaps an end_that_direct_request_first routine could be implemented to make +handling direct requests easier for such drivers; Also for drivers that +expect bios, a helper function could be provided for setting up a bio +corresponding to a data buffer] + +<JENS: I dont understand the above, why is end_that_request_first() not +usable? Or _last for that matter. I must be missing something> +<SUP: What I meant here was that if the request doesn't have a bio, then + end_that_request_first doesn't modify nr_sectors or current_nr_sectors, + and hence can't be used for advancing request state settings on the + completion of partial transfers. The driver has to modify these fields + directly by hand. + This is because end_that_request_first only iterates over the bio list, + and always returns 0 if there are none associated with the request. + _last works OK in this case, and is not a problem, as I mentioned earlier +> + +1.3.1 Pre-built Commands + +A request can be created with a pre-built custom command to be sent directly +to the device. The cmd block in the request structure has room for filling +in the command bytes. (i.e rq->cmd is now 16 bytes in size, and meant for +command pre-building, and the type of the request is now indicated +through rq->flags instead of via rq->cmd) + +The request structure flags can be set up to indicate the type of request +in such cases (REQ_PC: direct packet command passed to driver, REQ_BLOCK_PC: +packet command issued via blk_do_rq, REQ_SPECIAL: special request). + +It can help to pre-build device commands for requests in advance. +Drivers can now specify a request prepare function (q->prep_rq_fn) that the +block layer would invoke to pre-build device commands for a given request, +or perform other preparatory processing for the request. This is routine is +called by elv_next_request(), i.e. typically just before servicing a request. +(The prepare function would not be called for requests that have REQ_DONTPREP +enabled) + +Aside: + Pre-building could possibly even be done early, i.e before placing the + request on the queue, rather than construct the command on the fly in the + driver while servicing the request queue when it may affect latencies in + interrupt context or responsiveness in general. One way to add early + pre-building would be to do it whenever we fail to merge on a request. + Now REQ_NOMERGE is set in the request flags to skip this one in the future, + which means that it will not change before we feed it to the device. So + the pre-builder hook can be invoked there. + + +2. Flexible and generic but minimalist i/o structure/descriptor. + +2.1 Reason for a new structure and requirements addressed + +Prior to 2.5, buffer heads were used as the unit of i/o at the generic block +layer, and the low level request structure was associated with a chain of +buffer heads for a contiguous i/o request. This led to certain inefficiencies +when it came to large i/o requests and readv/writev style operations, as it +forced such requests to be broken up into small chunks before being passed +on to the generic block layer, only to be merged by the i/o scheduler +when the underlying device was capable of handling the i/o in one shot. +Also, using the buffer head as an i/o structure for i/os that didn't originate +from the buffer cache unecessarily added to the weight of the descriptors +which were generated for each such chunk. + +The following were some of the goals and expectations considered in the +redesign of the block i/o data structure in 2.5. + +i. Should be appropriate as a descriptor for both raw and buffered i/o - + avoid cache related fields which are irrelevant in the direct/page i/o path, + or filesystem block size alignment restrictions which may not be relevant + for raw i/o. +ii. Ability to represent high-memory buffers (which do not have a virtual + address mapping in kernel address space). +iii.Ability to represent large i/os w/o unecessarily breaking them up (i.e + greater than PAGE_SIZE chunks in one shot) +iv. At the same time, ability to retain independent identity of i/os from + different sources or i/o units requiring individual completion (e.g. for + latency reasons) +v. Ability to represent an i/o involving multiple physical memory segments + (including non-page aligned page fragments, as specified via readv/writev) + without unecessarily breaking it up, if the underlying device is capable of + handling it. +vi. Preferably should be based on a memory descriptor structure that can be + passed around different types of subsystems or layers, maybe even + networking, without duplication or extra copies of data/descriptor fields + themselves in the process +vii.Ability to handle the possibility of splits/merges as the structure passes + through layered drivers (lvm, md, evms), with minimal overhead. + +The solution was to define a new structure (bio) for the block layer, +instead of using the buffer head structure (bh) directly, the idea being +avoidance of some associated baggage and limitations. The bio structure +is uniformly used for all i/o at the block layer ; it forms a part of the +bh structure for buffered i/o, and in the case of raw/direct i/o kiobufs are +mapped to bio structures. + +2.2 The bio struct + +The bio structure uses a vector representation pointing to an array of tuples +of <page, offset, len> to describe the i/o buffer, and has various other +fields describing i/o parameters and state that needs to be maintained for +performing the i/o. + +Notice that this representation means that a bio has no virtual address +mapping at all (unlike buffer heads). + +struct bio_vec { + struct page *bv_page; + unsigned short bv_len; + unsigned short bv_offset; +}; + +/* + * main unit of I/O for the block layer and lower layers (ie drivers) + */ +struct bio { + sector_t bi_sector; + struct bio *bi_next; /* request queue link */ + struct block_device *bi_bdev; /* target device */ + unsigned long bi_flags; /* status, command, etc */ + unsigned long bi_rw; /* low bits: r/w, high: priority */ + + unsigned int bi_vcnt; /* how may bio_vec's */ + unsigned int bi_idx; /* current index into bio_vec array */ + + unsigned int bi_size; /* total size in bytes */ + unsigned short bi_phys_segments; /* segments after physaddr coalesce*/ + unsigned short bi_hw_segments; /* segments after DMA remapping */ + unsigned int bi_max; /* max bio_vecs we can hold + used as index into pool */ + struct bio_vec *bi_io_vec; /* the actual vec list */ + bio_end_io_t *bi_end_io; /* bi_end_io (bio) */ + atomic_t bi_cnt; /* pin count: free when it hits zero */ + void *bi_private; + bio_destructor_t *bi_destructor; /* bi_destructor (bio) */ +}; + +With this multipage bio design: + +- Large i/os can be sent down in one go using a bio_vec list consisting + of an array of <page, offset, len> fragments (similar to the way fragments + are represented in the zero-copy network code) +- Splitting of an i/o request across multiple devices (as in the case of + lvm or raid) is achieved by cloning the bio (where the clone points to + the same bi_io_vec array, but with the index and size accordingly modified) +- A linked list of bios is used as before for unrelated merges (*) - this + avoids reallocs and makes independent completions easier to handle. +- Code that traverses the req list needs to make a distinction between + segments of a request (bio_for_each_segment) and the distinct completion + units/bios (rq_for_each_bio). +- Drivers which can't process a large bio in one shot can use the bi_idx + field to keep track of the next bio_vec entry to process. + (e.g a 1MB bio_vec needs to be handled in max 128kB chunks for IDE) + [TBD: Should preferably also have a bi_voffset and bi_vlen to avoid modifying + bi_offset an len fields] + +(*) unrelated merges -- a request ends up containing two or more bios that + didn't originate from the same place. + +bi_end_io() i/o callback gets called on i/o completion of the entire bio. + +At a lower level, drivers build a scatter gather list from the merged bios. +The scatter gather list is in the form of an array of <page, offset, len> +entries with their corresponding dma address mappings filled in at the +appropriate time. As an optimization, contiguous physical pages can be +covered by a single entry where <page> refers to the first page and <len> +covers the range of pages (upto 16 contiguous pages could be covered this +way). There is a helper routine (blk_rq_map_sg) which drivers can use to build +the sg list. + +Note: Right now the only user of bios with more than one page is ll_rw_kio, +which in turn means that only raw I/O uses it (direct i/o may not work +right now). The intent however is to enable clustering of pages etc to +become possible. The pagebuf abstraction layer from SGI also uses multi-page +bios, but that is currently not included in the stock development kernels. +The same is true of Andrew Morton's work-in-progress multipage bio writeout +and readahead patches. + +2.3 Changes in the Request Structure + +The request structure is the structure that gets passed down to low level +drivers. The block layer make_request function builds up a request structure, +places it on the queue and invokes the drivers request_fn. The driver makes +use of block layer helper routine elv_next_request to pull the next request +off the queue. Control or diagnostic functions might bypass block and directly +invoke underlying driver entry points passing in a specially constructed +request structure. + +Only some relevant fields (mainly those which changed or may be referred +to in some of the discussion here) are listed below, not necessarily in +the order in which they occur in the structure (see include/linux/blkdev.h) +Refer to Documentation/block/request.txt for details about all the request +structure fields and a quick reference about the layers which are +supposed to use or modify those fields. + +struct request { + struct list_head queuelist; /* Not meant to be directly accessed by + the driver. + Used by q->elv_next_request_fn + rq->queue is gone + */ + . + . + unsigned char cmd[16]; /* prebuilt command data block */ + unsigned long flags; /* also includes earlier rq->cmd settings */ + . + . + sector_t sector; /* this field is now of type sector_t instead of int + preparation for 64 bit sectors */ + . + . + + /* Number of scatter-gather DMA addr+len pairs after + * physical address coalescing is performed. + */ + unsigned short nr_phys_segments; + + /* Number of scatter-gather addr+len pairs after + * physical and DMA remapping hardware coalescing is performed. + * This is the number of scatter-gather entries the driver + * will actually have to deal with after DMA mapping is done. + */ + unsigned short nr_hw_segments; + + /* Various sector counts */ + unsigned long nr_sectors; /* no. of sectors left: driver modifiable */ + unsigned long hard_nr_sectors; /* block internal copy of above */ + unsigned int current_nr_sectors; /* no. of sectors left in the + current segment:driver modifiable */ + unsigned long hard_cur_sectors; /* block internal copy of the above */ + . + . + int tag; /* command tag associated with request */ + void *special; /* same as before */ + char *buffer; /* valid only for low memory buffers upto + current_nr_sectors */ + . + . + struct bio *bio, *biotail; /* bio list instead of bh */ + struct request_list *rl; +} + +See the rq_flag_bits definitions for an explanation of the various flags +available. Some bits are used by the block layer or i/o scheduler. + +The behaviour of the various sector counts are almost the same as before, +except that since we have multi-segment bios, current_nr_sectors refers +to the numbers of sectors in the current segment being processed which could +be one of the many segments in the current bio (i.e i/o completion unit). +The nr_sectors value refers to the total number of sectors in the whole +request that remain to be transferred (no change). The purpose of the +hard_xxx values is for block to remember these counts every time it hands +over the request to the driver. These values are updated by block on +end_that_request_first, i.e. every time the driver completes a part of the +transfer and invokes block end*request helpers to mark this. The +driver should not modify these values. The block layer sets up the +nr_sectors and current_nr_sectors fields (based on the corresponding +hard_xxx values and the number of bytes transferred) and updates it on +every transfer that invokes end_that_request_first. It does the same for the +buffer, bio, bio->bi_idx fields too. + +The buffer field is just a virtual address mapping of the current segment +of the i/o buffer in cases where the buffer resides in low-memory. For high +memory i/o, this field is not valid and must not be used by drivers. + +Code that sets up its own request structures and passes them down to +a driver needs to be careful about interoperation with the block layer helper +functions which the driver uses. (Section 1.3) + +3. Using bios + +3.1 Setup/Teardown + +There are routines for managing the allocation, and reference counting, and +freeing of bios (bio_alloc, bio_get, bio_put). + +This makes use of Ingo Molnar's mempool implementation, which enables +subsystems like bio to maintain their own reserve memory pools for guaranteed +deadlock-free allocations during extreme VM load. For example, the VM +subsystem makes use of the block layer to writeout dirty pages in order to be +able to free up memory space, a case which needs careful handling. The +allocation logic draws from the preallocated emergency reserve in situations +where it cannot allocate through normal means. If the pool is empty and it +can wait, then it would trigger action that would help free up memory or +replenish the pool (without deadlocking) and wait for availability in the pool. +If it is in IRQ context, and hence not in a position to do this, allocation +could fail if the pool is empty. In general mempool always first tries to +perform allocation without having to wait, even if it means digging into the +pool as long it is not less that 50% full. + +On a free, memory is released to the pool or directly freed depending on +the current availability in the pool. The mempool interface lets the +subsystem specify the routines to be used for normal alloc and free. In the +case of bio, these routines make use of the standard slab allocator. + +The caller of bio_alloc is expected to taken certain steps to avoid +deadlocks, e.g. avoid trying to allocate more memory from the pool while +already holding memory obtained from the pool. +[TBD: This is a potential issue, though a rare possibility + in the bounce bio allocation that happens in the current code, since + it ends up allocating a second bio from the same pool while + holding the original bio ] + +Memory allocated from the pool should be released back within a limited +amount of time (in the case of bio, that would be after the i/o is completed). +This ensures that if part of the pool has been used up, some work (in this +case i/o) must already be in progress and memory would be available when it +is over. If allocating from multiple pools in the same code path, the order +or hierarchy of allocation needs to be consistent, just the way one deals +with multiple locks. + +The bio_alloc routine also needs to allocate the bio_vec_list (bvec_alloc()) +for a non-clone bio. There are the 6 pools setup for different size biovecs, +so bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of the +given size from these slabs. + +The bi_destructor() routine takes into account the possibility of the bio +having originated from a different source (see later discussions on +n/w to block transfers and kvec_cb) + +The bio_get() routine may be used to hold an extra reference on a bio prior +to i/o submission, if the bio fields are likely to be accessed after the +i/o is issued (si |