diff options
Diffstat (limited to 'Documentation')
36 files changed, 797 insertions, 349 deletions
diff --git a/Documentation/DocBook/Makefile b/Documentation/DocBook/Makefile index 2975291e296..7d87dd73cbe 100644 --- a/Documentation/DocBook/Makefile +++ b/Documentation/DocBook/Makefile @@ -28,7 +28,7 @@ PS_METHOD = $(prefer-db2x) ### # The targets that may be used. -.PHONY: xmldocs sgmldocs psdocs pdfdocs htmldocs mandocs installmandocs +PHONY += xmldocs sgmldocs psdocs pdfdocs htmldocs mandocs installmandocs BOOKS := $(addprefix $(obj)/,$(DOCBOOKS)) xmldocs: $(BOOKS) @@ -211,3 +211,9 @@ clean-dirs := $(patsubst %.xml,%,$(DOCBOOKS)) #man put files in man subdir - traverse down subdir- := man/ + + +# Declare the contents of the .PHONY variable as phony. We keep that +# information in a variable se we can use it in if_changed and friends. + +.PHONY: $(PHONY) diff --git a/Documentation/DocBook/deviceiobook.tmpl b/Documentation/DocBook/deviceiobook.tmpl index 6f41f2f5c6f..90ed23df1f6 100644 --- a/Documentation/DocBook/deviceiobook.tmpl +++ b/Documentation/DocBook/deviceiobook.tmpl @@ -270,25 +270,6 @@ CPU B: spin_unlock_irqrestore(&dev_lock, flags) </para> </sect1> - <sect1> - <title>ISA legacy functions</title> - <para> - On older kernels (2.2 and earlier) the ISA bus could be read or - written with these functions and without ioremap being used. This is - no longer true in Linux 2.4. A set of equivalent functions exist for - easy legacy driver porting. The functions available are prefixed - with 'isa_' and are <function>isa_readb</function>, - <function>isa_writeb</function>, <function>isa_readw</function>, - <function>isa_writew</function>, <function>isa_readl</function>, - <function>isa_writel</function>, <function>isa_memcpy_fromio</function> - and <function>isa_memcpy_toio</function> - </para> - <para> - These functions should not be used in new drivers, and will - eventually be going away. - </para> - </sect1> - </chapter> <chapter> diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 5ed85af8878..b4ea51ad361 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -360,7 +360,7 @@ uses of RCU may be found in listRCU.txt, arrayRCU.txt, and NMI-RCU.txt. struct foo *new_fp; struct foo *old_fp; - new_fp = kmalloc(sizeof(*fp), GFP_KERNEL); + new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL); spin_lock(&foo_mutex); old_fp = gbl_foo; *new_fp = *old_fp; @@ -461,7 +461,7 @@ The foo_update_a() function might then be written as follows: struct foo *new_fp; struct foo *old_fp; - new_fp = kmalloc(sizeof(*fp), GFP_KERNEL); + new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL); spin_lock(&foo_mutex); old_fp = gbl_foo; *new_fp = *old_fp; diff --git a/Documentation/arm/Booting b/Documentation/arm/Booting index fad566bb02f..76850295af8 100644 --- a/Documentation/arm/Booting +++ b/Documentation/arm/Booting @@ -118,7 +118,7 @@ to store page tables. The recommended placement is 32KiB into RAM. In either case, the following conditions must be met: -- Quiesce all DMA capable devicess so that memory does not get +- Quiesce all DMA capable devices so that memory does not get corrupted by bogus network packets or disk data. This will save you many hours of debug. diff --git a/Documentation/arm/README b/Documentation/arm/README index 5ed6f3530b8..9b9c8226fdc 100644 --- a/Documentation/arm/README +++ b/Documentation/arm/README @@ -89,7 +89,7 @@ Modules Although modularisation is supported (and required for the FP emulator), each module on an ARM2/ARM250/ARM3 machine when is loaded will take memory up to the next 32k boundary due to the size of the pages. - Therefore, modularisation on these machines really worth it? + Therefore, is modularisation on these machines really worth it? However, ARM6 and up machines allow modules to take multiples of 4k, and as such Acorn RiscPCs and other architectures using these processors can diff --git a/Documentation/arm/Setup b/Documentation/arm/Setup index 0abd0720d7e..0cb1e64bde8 100644 --- a/Documentation/arm/Setup +++ b/Documentation/arm/Setup @@ -58,7 +58,7 @@ below: video_y This describes the character position of cursor on VGA console, and - is otherwise unused. (should not used for other console types, and + is otherwise unused. (should not be used for other console types, and should not be used for other purposes). memc_control_reg diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index 30c41459953..159e2a0c3e8 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt @@ -18,7 +18,8 @@ CONTENTS: 1.4 What are exclusive cpusets ? 1.5 What does notify_on_release do ? 1.6 What is memory_pressure ? - 1.7 How do I use cpusets ? + 1.7 What is memory spread ? + 1.8 How do I use cpusets ? 2. Usage Examples and Syntax 2.1 Basic Usage 2.2 Adding/removing cpus @@ -317,7 +318,78 @@ the tasks in the cpuset, in units of reclaims attempted per second, times 1000. -1.7 How do I use cpusets ? +1.7 What is memory spread ? +--------------------------- +There are two boolean flag files per cpuset that control where the +kernel allocates pages for the file system buffers and related in +kernel data structures. They are called 'memory_spread_page' and +'memory_spread_slab'. + +If the per-cpuset boolean flag file 'memory_spread_page' is set, then +the kernel will spread the file system buffers (page cache) evenly +over all the nodes that the faulting task is allowed to use, instead +of preferring to put those pages on the node where the task is running. + +If the per-cpuset boolean flag file 'memory_spread_slab' is set, +then the kernel will spread some file system related slab caches, +such as for inodes and dentries evenly over all the nodes that the +faulting task is allowed to use, instead of preferring to put those +pages on the node where the task is running. + +The setting of these flags does not affect anonymous data segment or +stack segment pages of a task. + +By default, both kinds of memory spreading are off, and memory +pages are allocated on the node local to where the task is running, +except perhaps as modified by the tasks NUMA mempolicy or cpuset +configuration, so long as sufficient free memory pages are available. + +When new cpusets are created, they inherit the memory spread settings +of their parent. + +Setting memory spreading causes allocations for the affected page +or slab caches to ignore the tasks NUMA mempolicy and be spread +instead. Tasks using mbind() or set_mempolicy() calls to set NUMA +mempolicies will not notice any change in these calls as a result of +their containing tasks memory spread settings. If memory spreading +is turned off, then the currently specified NUMA mempolicy once again +applies to memory page allocations. + +Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag +files. By default they contain "0", meaning that the feature is off +for that cpuset. If a "1" is written to that file, then that turns +the named feature on. + +The implementation is simple. + +Setting the flag 'memory_spread_page' turns on a per-process flag +PF_SPREAD_PAGE for each task that is in that cpuset or subsequently +joins that cpuset. The page allocation calls for the page cache +is modified to perform an inline check for this PF_SPREAD_PAGE task +flag, and if set, a call to a new routine cpuset_mem_spread_node() +returns the node to prefer for the allocation. + +Similarly, setting 'memory_spread_cache' turns on the flag +PF_SPREAD_SLAB, and appropriately marked slab caches will allocate +pages from the node returned by cpuset_mem_spread_node(). + +The cpuset_mem_spread_node() routine is also simple. It uses the +value of a per-task rotor cpuset_mem_spread_rotor to select the next +node in the current tasks mems_allowed to prefer for the allocation. + +This memory placement policy is also known (in other contexts) as +round-robin or interleave. + +This policy can provide substantial improvements for jobs that need +to place thread local data on the corresponding node, but that need +to access large file system data sets that need to be spread across +the several nodes in the jobs cpuset in order to fit. Without this +policy, especially for jobs that might have one thread reading in the +data set, the memory allocation across the nodes in the jobs cpuset +can become very uneven. + + +1.8 How do I use cpusets ? -------------------------- In order to minimize the impact of cpusets on critical kernel diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index c7a4d0faab2..495858b236b 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -116,6 +116,17 @@ Who: Harald Welte <laforge@netfilter.org> --------------------------- +What: remove EXPORT_SYMBOL(kernel_thread) +When: August 2006 +Files: arch/*/kernel/*_ksyms.c +Why: kernel_thread is a low-level implementation detail. Drivers should + use the <linux/kthread.h> API instead which shields them from + implementation details and provides a higherlevel interface that + prevents bugs and code duplication +Who: Christoph Hellwig <hch@lst.de> + +--------------------------- + What: EXPORT_SYMBOL(lookup_hash) When: January 2006 Why: Too low-level interface. Use lookup_one_len or lookup_create instead. @@ -165,6 +176,18 @@ Who: Richard Knutsson <ricknu-0@student.ltu.se> and Greg Kroah-Hartman <gregkh@s --------------------------- +What: Usage of invalid timevals in setitimer +When: March 2007 +Why: POSIX requires to validate timevals in the setitimer call. This + was never done by Linux. The invalid (e.g. negative timevals) were + silently converted to more or less random timeouts and intervals. + Until the removal a per boot limited number of warnings is printed + and the timevals are sanitized. + +Who: Thomas Gleixner <tglx@linutronix.de> + +--------------------------- + What: I2C interface of the it87 driver When: January 2007 Why: The ISA interface is faster and should be always available. The I2C @@ -174,6 +197,17 @@ Who: Jean Delvare <khali@linux-fr.org> --------------------------- +What: remove EXPORT_SYMBOL(tasklist_lock) +When: August 2006 +Files: kernel/fork.c +Why: tasklist_lock protects the kernel internal task list. Modules have + no business looking at it, and all instances in drivers have been due + to use of too-lowlevel APIs. Having this symbol exported prevents + moving to more scalable locking schemes for the task list. +Who: Christoph Hellwig <hch@lst.de> + +--------------------------- + What: mount/umount uevents When: February 2007 Why: These events are not correct, and do not properly let userspace know diff --git a/Documentation/filesystems/v9fs.txt b/Documentation/filesystems/9p.txt index 24c7a9c41f0..43b89c214d2 100644 --- a/Documentation/filesystems/v9fs.txt +++ b/Documentation/filesystems/9p.txt @@ -1,5 +1,5 @@ - V9FS: 9P2000 for Linux - ====================== + v9fs: Plan 9 Resource Sharing for Linux + ======================================= ABOUT ===== @@ -9,18 +9,19 @@ v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol. This software was originally developed by Ron Minnich <rminnich@lanl.gov> and Maya Gokhale <maya@lanl.gov>. Additional development by Greg Watson <gwatson@lanl.gov> and most recently Eric Van Hensbergen -<ericvh@gmail.com> and Latchesar Ionkov <lucho@ionkov.net>. +<ericvh@gmail.com>, Latchesar Ionkov <lucho@ionkov.net> and Russ Cox +<rsc@swtch.com>. USAGE ===== For remote file server: - mount -t 9P 10.10.1.2 /mnt/9 + mount -t 9p 10.10.1.2 /mnt/9 For Plan 9 From User Space applications (http://swtch.com/plan9) - mount -t 9P `namespace`/acme /mnt/9 -o proto=unix,name=$USER + mount -t 9p `namespace`/acme /mnt/9 -o proto=unix,uname=$USER OPTIONS ======= @@ -32,7 +33,7 @@ OPTIONS fd - used passed file descriptors for connection (see rfdno and wfdno) - name=name user name to attempt mount as on the remote server. The + uname=name user name to attempt mount as on the remote server. The server may override or ignore this value. Certain user names may require authentication. @@ -42,7 +43,7 @@ OPTIONS debug=n specifies debug level. The debug level is a bitmask. 0x01 = display verbose error messages 0x02 = developer debug (DEBUG_CURRENT) - 0x04 = display 9P trace + 0x04 = display 9p trace 0x08 = display VFS trace 0x10 = display Marshalling debug 0x20 = display RPC debug @@ -53,11 +54,11 @@ OPTIONS wfdno=n the file descriptor for writing with proto=fd - maxdata=n the number of bytes to use for 9P packet payload (msize) + maxdata=n the number of bytes to use for 9p packet payload (msize) port=n port to connect to on the remote server - noextend force legacy mode (no 9P2000.u semantics) + noextend force legacy mode (no 9p2000.u semantics) uid attempt to mount as a particular uid @@ -72,7 +73,7 @@ OPTIONS RESOURCES ========= -The Linux version of the 9P server is now maintained under the npfs project +The Linux version of the 9p server is now maintained under the npfs project on sourceforge (http://sourceforge.net/projects/npfs). There are user and developer mailing lists available through the v9fs project diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 944cf109a6f..99902ae6804 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -121,7 +121,7 @@ Table 1-1: Process specific entries in /proc .............................................................................. File Content cmdline Command line arguments - cpu Current and last cpu in wich it was executed (2.4)(smp) + cpu Current and last cpu in which it was executed (2.4)(smp) cwd Link to the current working directory environ Values of environment variables exe Link to the executable of this process @@ -309,13 +309,13 @@ is the same by default: > cat /proc/irq/0/smp_affinity ffffffff -It's a bitmask, in wich you can specify wich CPUs can handle the IRQ, you can +It's a bitmask, in which you can specify which CPUs can handle the IRQ, you can set it by doing: > echo 1 > /proc/irq/prof_cpu_mask This means that only the first CPU will handle the IRQ, but you can also echo 5 -wich means that only the first and fourth CPU can handle the IRQ. +which means that only the first and fourth CPU can handle the IRQ. The way IRQs are routed is handled by the IO-APIC, and it's Round Robin between all the CPUs which are allowed to handle it. As usual the kernel has diff --git a/Documentation/filesystems/udf.txt b/Documentation/filesystems/udf.txt index e5213bc301f..511b4230c05 100644 --- a/Documentation/filesystems/udf.txt +++ b/Documentation/filesystems/udf.txt @@ -26,6 +26,20 @@ The following mount options are supported: nostrict Unset strict conformance iocharset= Set the NLS character set +The uid= and gid= options need a bit more explaining. They will accept a +decimal numeric value which will be used as the default ID for that mount. +They will also accept the string "ignore" and "forget". For files on the disk +that are owned by nobody ( -1 ), they will instead look as if they are owned +by the default ID. The ignore option causes the default ID to override all +IDs on the disk, not just -1. The forget option causes all IDs to be written +to disk as -1, so when the media is later remounted, they will appear to be +owned by whatever default ID it is mounted with at that time. + +For typical desktop use of removable media, you should set the ID to that +of the interactively logged on user, and also specify both the forget and +ignore options. This way the interactive user will always see the files +on the disk as belonging to him. + The remaining are for debugging and disaster recovery: novrs Skip volume sequence recognition diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index e56e842847d..adaa899e5c9 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -230,10 +230,15 @@ only called from a process context (i.e. not from an interrupt handler or bottom half). alloc_inode: this method is called by inode_alloc() to allocate memory - for struct inode and initialize it. + for struct inode and initialize it. If this function is not + defined, a simple 'struct inode' is allocated. Normally + alloc_inode will be used to allocate a larger structure which + contains a 'struct inode' embedded within it. destroy_inode: this method is called by destroy_inode() to release - resources allocated for struct inode. + resources allocated for struct inode. It is only required if + ->alloc_inode was defined and simply undoes anything done by + ->alloc_inode. read_inode: this method is called to read a specific inode from the mounted filesystem. The i_ino member in the struct inode is @@ -443,14 +448,81 @@ otherwise noted. The Address Space Object ======================== -The address space object is used to identify pages in the page cache. - +The address space object is used to group and manage pages in the page +cache. It can be used to keep track of the pages in a file (or +anything else) and also track the mapping of sections of the file into +process address spaces. + +There are a number of distinct yet related services that an +address-space can provide. These include communicating memory +pressure, page lookup by address, and keeping track of pages tagged as +Dirty or Writeback. + +The first can be used independently to the others. The VM can try to +either write dirty pages in order to clean them, or release clean +pages in order to reuse them. To do this it can call the ->writepage +method on dirty pages, and ->releasepage on clean pages with +PagePrivate set. Clean pages without PagePrivate and with no external +references will be released without notice being given to the +address_space. + +To achieve this functionality, pages need to be placed on an LRU with +lru_cache_add and mark_page_active needs to be called whenever the +page is used. + +Pages are normally kept in a radix tree index by ->index. This tree +maintains information about the PG_Dirty and PG_Writeback status of +each page, so that pages with either of these flags can be found +quickly. + +The Dirty tag is primarily used by mpage_writepages - the default +->writepages method. It uses the tag to find dirty pages to call +->writepage on. If mpage_writepages is not used (i.e. the address +provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is +almost unused. write_inode_now and sync_inode do use it (through +__sync_single_inode) to check if ->writepages has been successful in +writing out the whole address_space. + +The Writeback tag is used by filemap*wait* and sync_page* functions, +via wait_on_page_writeback_range, to wait for all writeback to +complete. While waiting ->sync_page (if defined) will be called on +each page that is found to require writeback. + +An address_space handler may attach extra information to a page, +typically using the 'private' field in the 'struct page'. If such +information is attached, the PG_Private flag should be set. This will +cause various VM routines to make extra calls into the address_space +handler to deal with that data. + +An address space acts as an intermediate between storage and +application. Data is read into the address space a whole page at a +time, and provided to the application either by copying of the page, +or by memory-mapping the page. +Data is written into the address space by the application, and then +written-back to storage typically in whole pages, however the +address_space has finer control of write sizes. + +The read process essentially only requires 'readpage'. The write +process is more complicated and uses prepare_write/commit_write or +set_page_dirty to write data into the address_space, and writepage, +sync_page, and writepages to writeback data to storage. + +Adding and removing pages to/from an address_space is protected by the +inode's i_mutex. + +When data is written to a page, the PG_Dirty flag should be set. It +typically remains set until writepage asks for it to be written. This +should clear PG_Dirty and set PG_Writeback. It can be actually +written at any point after PG_Dirty is clear. Once it is known to be +safe, PG_Writeback is cleared. + +Writeback makes use of a writeback_control structure... struct address_space_operations ------------------------------- This describes how the VFS can manipulate mapping of a file to page cache in -your filesystem. As of kernel 2.6.13, the following members are defined: +your filesystem. As of kernel 2.6.16, the following members are defined: struct address_space_operations { int (*writepage)(struct page *page, struct writeback_control *wbc); @@ -469,47 +541,148 @@ struct address_space_operations { loff_t offset, unsigned long nr_segs); struct page* (*get_xip_page)(struct address_space *, sector_t, int); + /* migrate the contents of a page to the specified target */ + int (*migratepage) (struct page *, struct page *); }; - writepage: called by the VM write a dirty page to backing store. + writepage: called by the VM to write a dirty page to backing store. + This may happen for data integrity reasons (i.e. 'sync'), or + to free up memory (flush). The difference can be seen in + wbc->sync_mode. + The PG_Dirty flag has been cleared and PageLocked is true. + writepage should start writeout, should set PG_Writeback, + and should make sure the page is unlocked, either synchronously + or asynchronously when the write operation completes. + + If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to + try too hard if there are problems, and may choose to write out + other pages from the mapping if that is easier (e.g. due to + internal dependencies). If it chooses not to start writeout, it + should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep + calling ->writepage on that page. + + See the file "Locking" for more details. readpage: called by the VM to read a page from backing store. + The page will be Locked when readpage is called, and should be + unlocked and marked uptodate once the read completes. + If ->readpage discovers that it needs to unlock the page for + some reason, it can do so, and then return AOP_TRUNCATED_PAGE. + In this case, the page will be relocated, relocked and if + that all succeeds, ->readpage will be called again. sync_page: called by the VM to notify the backing store to perform all queued I/O operations for a page. I/O operations for other pages associated with this address_space object may also be performed. + This function is optional and is called only for pages with + PG_Writeback set while waiting for the writeback to complete. + writepages: called by the VM to write out pages associated with the - address_space object. + address_space object. If wbc->sync_mode is WBC_SYNC_ALL, then + the writeback_control will specify a range of pages that must be + written out. If it is WBC_SYNC_NONE, then a nr_to_write is given + and that many pages should be written if possible. + If no ->writepages is given, then mpage_writepages is used + instead. This will choose pages from the address space that are + tagged as DIRTY and will pass them to ->writepage. set_page_dirty: called by the VM to set a page dirty. + This is particularly needed if an address space attaches + private data to a page, and that data needs to be updated when + a page is dirtied. This is called, for example, when a memory + mapped page gets modified. + If defined, it should set the PageDirty flag, and the + PAGECACHE_TAG_DIRTY tag in the radix tree. readpages: called by the VM to read pages associated with the address_space - object. + object. This is essentially just a vector version of + readpage. Instead of just one page, several pages are + requested. + readpages is only used for read-ahead, so read errors are + ignored. If anything goes wrong, feel free to give up. prepare_write: called by the generic write path in VM to set up a write - request for a page. - - commit_write: called by the generic write path in VM to write page to - its backing store. + request for a page. This indicates to the address space that + the given range of bytes is about to be written. The + address_space should check that the write will be able to + complete, by allocating space if necessary and doing any other + internal housekeeping. If the write will update parts of + any basic-blocks on storage, then those blocks should be + pre-read (if they haven't been read already) so that the + updated blocks can be written out properly. + The page will be locked. If prepare_write wants to unlock the + page it, like readpage, may do so and return + AOP_TRUNCATED_PAGE. + In this case the prepare_write will be retried one the lock is + regained. + + commit_write: If prepare_write succeeds, new data will be copied + into the page and then commit_write will be called. It will + typically update the size of the file (if appropriate) and + mark the inode as dirty, and do any other related housekeeping + operations. It should avoid returning an error if possible - + errors should have been handled by prepare_write. bmap: called by the VFS to map a logical block offset within object to - physical block number. This method is use by for the legacy FIBMAP - ioctl. Other uses are discouraged. - - invalidatepage: called by the VM on truncate to disassociate a page from its - address_space mapping. - - releasepage: called by the VFS to release filesystem specific metadata from - a page. - - direct_IO: called by the VM for direct I/O writes and reads. + physical block number. This method is used by the FIBMAP + ioctl and for working with swap-files. To be able to swap to + a file, the file must have a stable mapping to a block + device. The swap system does not go through the filesystem + but instead uses bmap to find out where the blocks in the file + are and u |