diff options
Diffstat (limited to 'Documentation')
43 files changed, 881 insertions, 242 deletions
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index 40ac7759c3b..d273b557a93 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -126,18 +126,16 @@ devices.txt - plain ASCII listing of all the nodes in /dev/ with major minor #'s. digiepca.txt - info on Digi Intl. {PC,PCI,EISA}Xx and Xem series cards. -dnotify.txt - - info about directory notification in Linux. dontdiff - file containing a list of files that should never be diff'ed. driver-model/ - directory with info about Linux driver model. -drivers/ - - directory with driver documentation (currently only EDAC). dvb/ - info on Linux Digital Video Broadcast (DVB) subsystem. early-userspace/ - info about initramfs, klibc, and userspace early during boot. +edac.txt + - information on EDAC - Error Detection And Correction eisa.txt - info on EISA bus support. exception.txt @@ -334,20 +332,8 @@ rtc.txt - notes on how to use the Real Time Clock (aka CMOS clock) driver. s390/ - directory with info on using Linux on the IBM S390. -sched-arch.txt - - CPU Scheduler implementation hints for architecture specific code. -sched-coding.txt - - reference for various scheduler-related methods in the O(1) scheduler. -sched-design.txt - - goals, design and implementation of the Linux O(1) scheduler. -sched-design-CFS.txt - - goals, design and implementation of the Complete Fair Scheduler. -sched-domains.txt - - information on scheduling domains. -sched-nice-design.txt - - How and why the scheduler's nice levels are implemented. -sched-stats.txt - - information on schedstats (Linux Scheduler Statistics). +scheduler/ + - directory with info on the scheduler. scsi/ - directory with info on Linux scsi support. serial/ @@ -360,8 +346,6 @@ sgi-visws.txt - short blurb on the SGI Visual Workstations. sh/ - directory with info on porting Linux to a new architecture. -sharedsubtree.txt - - a description of shared subtrees for namespaces. smart-config.txt - description of the Smart Config makefile feature. sony-laptop.txt diff --git a/Documentation/ABI/testing/sysfs-kernel-uids b/Documentation/ABI/testing/sysfs-kernel-uids index 648d65dbc0e..28f14695a85 100644 --- a/Documentation/ABI/testing/sysfs-kernel-uids +++ b/Documentation/ABI/testing/sysfs-kernel-uids @@ -11,4 +11,4 @@ Description: example would be, if User A has shares = 1024 and user B has shares = 2048, User B will get twice the CPU bandwidth user A will. For more details refer - Documentation/sched-design-CFS.txt + Documentation/scheduler/sched-design-CFS.txt diff --git a/Documentation/BUG-HUNTING b/Documentation/BUG-HUNTING index 6c816751b86..65022a87bf1 100644 --- a/Documentation/BUG-HUNTING +++ b/Documentation/BUG-HUNTING @@ -214,6 +214,23 @@ And recompile the kernel with CONFIG_DEBUG_INFO enabled: gdb vmlinux (gdb) p vt_ioctl (gdb) l *(0x<address of vt_ioctl> + 0xda8) +or, as one command + (gdb) l *(vt_ioctl + 0xda8) + +If you have a call trace, such as :- +>Call Trace: +> [<ffffffff8802c8e9>] :jbd:log_wait_commit+0xa3/0xf5 +> [<ffffffff810482d9>] autoremove_wake_function+0x0/0x2e +> [<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee +> ... +this shows the problem in the :jbd: module. You can load that module in gdb +and list the relevant code. + gdb fs/jbd/jbd.ko + (gdb) p log_wait_commit + (gdb) l *(0x<address> + 0xa3) +or + (gdb) l *(log_wait_commit + 0xa3) + Another very useful option of the Kernel Hacking section in menuconfig is Debug memory allocations. This will help you see whether data has been diff --git a/Documentation/DocBook/genericirq.tmpl b/Documentation/DocBook/genericirq.tmpl index 4215f69ce7e..3a882d9a90a 100644 --- a/Documentation/DocBook/genericirq.tmpl +++ b/Documentation/DocBook/genericirq.tmpl @@ -172,7 +172,7 @@ <listitem><para>Chiplevel hardware encapsulation</para></listitem> </orderedlist> </para> - <sect1> + <sect1 id="Interrupt_control_flow"> <title>Interrupt control flow</title> <para> Each interrupt is described by an interrupt descriptor structure @@ -190,7 +190,7 @@ referenced by the assigned chip descriptor structure. </para> </sect1> - <sect1> + <sect1 id="Highlevel_Driver_API"> <title>Highlevel Driver API</title> <para> The highlevel Driver API consists of following functions: @@ -210,7 +210,7 @@ See the autogenerated function documentation for details. </para> </sect1> - <sect1> + <sect1 id="Highlevel_IRQ_flow_handlers"> <title>Highlevel IRQ flow handlers</title> <para> The generic layer provides a set of pre-defined irq-flow methods: @@ -224,9 +224,9 @@ specific) are assigned to specific interrupts by the architecture either during bootup or during device initialization. </para> - <sect2> + <sect2 id="Default_flow_implementations"> <title>Default flow implementations</title> - <sect3> + <sect3 id="Helper_functions"> <title>Helper functions</title> <para> The helper functions call the chip primitives and @@ -267,9 +267,9 @@ noop(irq) </para> </sect3> </sect2> - <sect2> + <sect2 id="Default_flow_handler_implementations"> <title>Default flow handler implementations</title> - <sect3> + <sect3 id="Default_Level_IRQ_flow_handler"> <title>Default Level IRQ flow handler</title> <para> handle_level_irq provides a generic implementation @@ -284,7 +284,7 @@ desc->chip->end(); </programlisting> </para> </sect3> - <sect3> + <sect3 id="Default_Edge_IRQ_flow_handler"> <title>Default Edge IRQ flow handler</title> <para> handle_edge_irq provides a generic implementation @@ -311,7 +311,7 @@ desc->chip->end(); </programlisting> </para> </sect3> - <sect3> + <sect3 id="Default_simple_IRQ_flow_handler"> <title>Default simple IRQ flow handler</title> <para> handle_simple_irq provides a generic implementation @@ -328,7 +328,7 @@ handle_IRQ_event(desc->action); </programlisting> </para> </sect3> - <sect3> + <sect3 id="Default_per_CPU_flow_handler"> <title>Default per CPU flow handler</title> <para> handle_percpu_irq provides a generic implementation @@ -349,7 +349,7 @@ desc->chip->end(); </para> </sect3> </sect2> - <sect2> + <sect2 id="Quirks_and_optimizations"> <title>Quirks and optimizations</title> <para> The generic functions are intended for 'clean' architectures and chips, @@ -358,7 +358,7 @@ desc->chip->end(); overriding the highlevel irq-flow handler. </para> </sect2> - <sect2> + <sect2 id="Delayed_interrupt_disable"> <title>Delayed interrupt disable</title> <para> This per interrupt selectable feature, which was introduced by Russell @@ -380,7 +380,7 @@ desc->chip->end(); </para> </sect2> </sect1> - <sect1> + <sect1 id="Chiplevel_hardware_encapsulation"> <title>Chiplevel hardware encapsulation</title> <para> The chip level hardware descriptor structure irq_chip diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index 77436d73501..059aaf20951 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -165,6 +165,7 @@ X!Ilib/string.c !Emm/vmalloc.c !Imm/page_alloc.c !Emm/mempool.c +!Emm/dmapool.c !Emm/page-writeback.c !Emm/truncate.c </sect1> @@ -371,7 +372,6 @@ X!Iinclude/linux/device.h !Edrivers/base/class.c !Edrivers/base/firmware_class.c !Edrivers/base/transport_class.c -!Edrivers/base/dmapool.c <!-- Cannot be included, because attribute_container_add_class_device_adapter and attribute_container_classdev_to_container diff --git a/Documentation/DocBook/kernel-locking.tmpl b/Documentation/DocBook/kernel-locking.tmpl index 01825ee7db6..2e9d6b41f03 100644 --- a/Documentation/DocBook/kernel-locking.tmpl +++ b/Documentation/DocBook/kernel-locking.tmpl @@ -717,7 +717,7 @@ used, and when it gets full, throws out the least used one. <para> For our first example, we assume that all operations are in user context (ie. from system calls), so we can sleep. This means we can -use a semaphore to protect the cache and all the objects within +use a mutex to protect the cache and all the objects within it. Here's the code: </para> @@ -725,7 +725,7 @@ it. Here's the code: #include <linux/list.h> #include <linux/slab.h> #include <linux/string.h> -#include <asm/semaphore.h> +#include <linux/mutex.h> #include <asm/errno.h> struct object @@ -737,7 +737,7 @@ struct object }; /* Protects the cache, cache_num, and the objects within it */ -static DECLARE_MUTEX(cache_lock); +static DEFINE_MUTEX(cache_lock); static LIST_HEAD(cache); static unsigned int cache_num = 0; #define MAX_CACHE_SIZE 10 @@ -789,17 +789,17 @@ int cache_add(int id, const char *name) obj->id = id; obj->popularity = 0; - down(&cache_lock); + mutex_lock(&cache_lock); __cache_add(obj); - up(&cache_lock); + mutex_unlock(&cache_lock); return 0; } void cache_delete(int id) { - down(&cache_lock); + mutex_lock(&cache_lock); __cache_delete(__cache_find(id)); - up(&cache_lock); + mutex_unlock(&cache_lock); } int cache_find(int id, char *name) @@ -807,13 +807,13 @@ int cache_find(int id, char *name) struct object *obj; int ret = -ENOENT; - down(&cache_lock); + mutex_lock(&cache_lock); obj = __cache_find(id); if (obj) { ret = 0; strcpy(name, obj->name); } - up(&cache_lock); + mutex_unlock(&cache_lock); return ret; } </programlisting> @@ -853,7 +853,7 @@ The change is shown below, in standard patch format: the int popularity; }; --static DECLARE_MUTEX(cache_lock); +-static DEFINE_MUTEX(cache_lock); +static spinlock_t cache_lock = SPIN_LOCK_UNLOCKED; static LIST_HEAD(cache); static unsigned int cache_num = 0; @@ -870,22 +870,22 @@ The change is shown below, in standard patch format: the obj->id = id; obj->popularity = 0; -- down(&cache_lock); +- mutex_lock(&cache_lock); + spin_lock_irqsave(&cache_lock, flags); __cache_add(obj); -- up(&cache_lock); +- mutex_unlock(&cache_lock); + spin_unlock_irqrestore(&cache_lock, flags); return 0; } void cache_delete(int id) { -- down(&cache_lock); +- mutex_lock(&cache_lock); + unsigned long flags; + + spin_lock_irqsave(&cache_lock, flags); __cache_delete(__cache_find(id)); -- up(&cache_lock); +- mutex_unlock(&cache_lock); + spin_unlock_irqrestore(&cache_lock, flags); } @@ -895,14 +895,14 @@ The change is shown below, in standard patch format: the int ret = -ENOENT; + unsigned long flags; -- down(&cache_lock); +- mutex_lock(&cache_lock); + spin_lock_irqsave(&cache_lock, flags); obj = __cache_find(id); if (obj) { ret = 0; strcpy(name, obj->name); } -- up(&cache_lock); +- mutex_unlock(&cache_lock); + spin_unlock_irqrestore(&cache_lock, flags); return ret; } diff --git a/Documentation/DocBook/lsm.tmpl b/Documentation/DocBook/lsm.tmpl index f6382219587..fe7664ce966 100644 --- a/Documentation/DocBook/lsm.tmpl +++ b/Documentation/DocBook/lsm.tmpl @@ -33,7 +33,7 @@ </authorgroup> </articleinfo> -<sect1><title>Introduction</title> +<sect1 id="Introduction"><title>Introduction</title> <para> In March 2001, the National Security Agency (NSA) gave a presentation diff --git a/Documentation/DocBook/mtdnand.tmpl b/Documentation/DocBook/mtdnand.tmpl index 957cf5c2683..8e145857fc9 100644 --- a/Documentation/DocBook/mtdnand.tmpl +++ b/Documentation/DocBook/mtdnand.tmpl @@ -80,7 +80,7 @@ struct member has a short description which is marked with an [XXX] identifier. The following chapters explain the meaning of those identifiers. </para> - <sect1> + <sect1 id="Function_identifiers_XXX"> <title>Function identifiers [XXX]</title> <para> The functions are marked with [XXX] identifiers in the short @@ -115,7 +115,7 @@ </para></listitem> </itemizedlist> </sect1> - <sect1> + <sect1 id="Struct_member_identifiers_XXX"> <title>Struct member identifiers [XXX]</title> <para> The struct members are marked with [XXX] identifiers in the @@ -159,7 +159,7 @@ basic functions and fill out some really board dependent members in the nand chip description structure. </para> - <sect1> + <sect1 id="Basic_defines"> <title>Basic defines</title> <para> At least you have to provide a mtd structure and @@ -185,7 +185,7 @@ static struct nand_chip board_chip; static unsigned long baseaddr; </programlisting> </sect1> - <sect1> + <sect1 id="Partition_defines"> <title>Partition defines</title> <para> If you want to divide your device into partitions, then @@ -204,7 +204,7 @@ static struct mtd_partition partition_info[] = { }; </programlisting> </sect1> - <sect1> + <sect1 id="Hardware_control_functions"> <title>Hardware control function</title> <para> The hardware control function provides access to the @@ -246,7 +246,7 @@ static void board_hwcontrol(struct mtd_info *mtd, int cmd) } </programlisting> </sect1> - <sect1> + <sect1 id="Device_ready_function"> <title>Device ready function</title> <para> If the hardware interface has the ready busy pin of the NAND chip connected to a @@ -257,7 +257,7 @@ static void board_hwcontrol(struct mtd_info *mtd, int cmd) the function must not be defined and the function pointer this->dev_ready is set to NULL. </para> </sect1> - <sect1> + <sect1 id="Init_function"> <title>Init function</title> <para> The init function allocates memory and sets up all the board @@ -325,7 +325,7 @@ out: module_init(board_init); </programlisting> </sect1> - <sect1> + <sect1 id="Exit_function"> <title>Exit function</title> <para> The exit function is only neccecary if the driver is @@ -359,7 +359,7 @@ module_exit(board_cleanup); driver. For a list of functions which can be overridden by the board driver see the documentation of the nand_chip structure. </para> - <sect1> + <sect1 id="Multiple_chip_control"> <title>Multiple chip control</title> <para> The nand driver can control chip arrays. Therefor the @@ -419,9 +419,9 @@ static void board_select_chip (struct mtd_info *mtd, int chip) } </programlisting> </sect1> - <sect1> + <sect1 id="Hardware_ECC_support"> <title>Hardware ECC support</title> - <sect2> + <sect2 id="Functions_and_constants"> <title>Functions and constants</title> <para> The nand driver supports three different types of @@ -475,7 +475,7 @@ static void board_select_chip (struct mtd_info *mtd, int chip) </itemizedlist> </para> </sect2> - <sect2> + <sect2 id="Hardware_ECC_with_syndrome_calculation"> <title>Hardware ECC with syndrome calculation</title> <para> Many hardware ECC implementations provide Reed-Solomon @@ -500,7 +500,7 @@ static void board_select_chip (struct mtd_info *mtd, int chip) </para> </sect2> </sect1> - <sect1> + <sect1 id="Bad_Block_table_support"> <title>Bad block table support</title> <para> Most NAND chips mark the bad blocks at a defined @@ -552,7 +552,7 @@ static void board_select_chip (struct mtd_info *mtd, int chip) allows faster access than always checking the bad block information on the flash chip itself. </para> - <sect2> + <sect2 id="Flash_based_tables"> <title>Flash based tables</title> <para> It may be desired or neccecary to keep a bad block table in FLASH. @@ -587,7 +587,7 @@ static void board_select_chip (struct mtd_info *mtd, int chip) </itemizedlist> </para> </sect2> - <sect2> + <sect2 id="User_defined_tables"> <title>User defined tables</title> <para> User defined tables are created by filling out a @@ -676,7 +676,7 @@ static void board_select_chip (struct mtd_info *mtd, int chip) </para> </sect2> </sect1> - <sect1> + <sect1 id="Spare_area_placement"> <title>Spare area (auto)placement</title> <para> The nand driver implements different possibilities for @@ -730,7 +730,7 @@ struct nand_oobinfo { </para></listitem> </itemizedlist> </para> - <sect2> + <sect2 id="Placement_defined_by_fs_driver"> <title>Placement defined by fs driver</title> <para> The calling function provides a pointer to a nand_oobinfo @@ -760,7 +760,7 @@ struct nand_oobinfo { done according to the given scheme in the nand_oobinfo structure. </para> </sect2> - <sect2> + <sect2 id="Automatic_placement"> <title>Automatic placement</title> <para> Automatic placement uses the built in defaults to place the @@ -774,7 +774,7 @@ struct nand_oobinfo { done according to the default builtin scheme. </para> </sect2> - <sect2> + <sect2 id="User_space_placement_selection"> <title>User space placement selection</title> <para> All non ecc functions like mtd->read and mtd->write use an internal @@ -789,9 +789,9 @@ struct nand_oobinfo { </para> </sect2> </sect1> - <sect1> + <sect1 id="Spare_area_autoplacement_default"> <title>Spare area autoplacement default schemes</title> - <sect2> + <sect2 id="pagesize_256"> <title>256 byte pagesize</title> <informaltable><tgroup cols="3"><tbody> <row> @@ -843,7 +843,7 @@ pages this byte is reserved</entry> </row> </tbody></tgroup></informaltable> </sect2> - <sect2> + <sect2 id="pagesize_512"> <title>512 byte pagesize</title> <informaltable><tgroup cols="3"><tbody> <row> @@ -906,7 +906,7 @@ in this page</entry> </row> </tbody></tgroup></informaltable> </sect2> - <sect2> + <sect2 id="pagesize_2048"> <title>2048 byte pagesize</title> <informaltable><tgroup cols="3"><tbody> <row> @@ -1126,9 +1126,9 @@ in this page</entry> <para> This chapter describes the constants which might be relevant for a driver developer. </para> - <sect1> + <sect1 id="Chip_option_constants"> <title>Chip option constants</title> - <sect2> + <sect2 id="Constants_for_chip_id_table"> <title>Constants for chip id table</title> <para> These constants are defined in nand.h. They are ored together to describe @@ -1153,7 +1153,7 @@ in this page</entry> </programlisting> </para> </sect2> - <sect2> + <sect2 id="Constants_for_runtime_options"> <title>Constants for runtime options</title> <para> These constants are defined in nand.h. They are ored together to describe @@ -1171,7 +1171,7 @@ in this page</entry> </sect2> </sect1> - <sect1> + <sect1 id="EEC_selection_constants"> <title>ECC selection constants</title> <para> Use these constants to select the ECC algorithm. @@ -1192,7 +1192,7 @@ in this page</entry> </para> </sect1> - <sect1> + <sect1 id="Hardware_control_related_constants"> <title>Hardware control related constants</title> <para> These constants describe the requested hardware access function when @@ -1218,7 +1218,7 @@ in this page</entry> </para> </sect1> - <sect1> + <sect1 id="Bad_block_table_constants"> <title>Bad block table related constants</title> <para> These constants describe the options used for bad block diff --git a/Documentation/DocBook/procfs-guide.tmpl b/Documentation/DocBook/procfs-guide.tmpl index 2de84dc195a..1fd6a1ec759 100644 --- a/Documentation/DocBook/procfs-guide.tmpl +++ b/Documentation/DocBook/procfs-guide.tmpl @@ -85,7 +85,7 @@ - <preface> + <preface id="Preface"> <title>Preface</title> <para> @@ -230,7 +230,7 @@ - <sect1> + <sect1 id="Creating_a_symlink"> <title>Creating a symlink</title> <funcsynopsis> @@ -254,7 +254,7 @@ </para> </sect1> - <sect1> + <sect1 id="Creating_a_directory"> <title>Creating a directory</title> <funcsynopsis> @@ -274,7 +274,7 @@ - <sect1> + <sect1 id="Removing_an_entry"> <title>Removing an entry</title> <funcsynopsis> @@ -340,7 +340,7 @@ entry->write_proc = write_proc_foo; - <sect1> + <sect1 id="Reading_data"> <title>Reading data</title> <para> @@ -448,7 +448,7 @@ entry->write_proc = write_proc_foo; - <sect1> + <sect1 id="Writing_data"> <title>Writing data</title> <para> @@ -579,7 +579,7 @@ int foo_read_func(char *page, char **start, off_t off, - <sect1> + <sect1 id="Modules"> <title>Modules</title> <para> @@ -599,7 +599,7 @@ entry->owner = THIS_MODULE; - <sect1> + <sect1 id="Mode_and_ownership"> <title>Mode and ownership</title> <para> diff --git a/Documentation/DocBook/rapidio.tmpl b/Documentation/DocBook/rapidio.tmpl index a8b88c47e80..b9e143e28c6 100644 --- a/Documentation/DocBook/rapidio.tmpl +++ b/Documentation/DocBook/rapidio.tmpl @@ -77,11 +77,11 @@ <chapter id="bugs"> <title>Known Bugs and Limitations</title> - <sect1> + <sect1 id="known_bugs"> <title>Bugs</title> <para>None. ;)</para> </sect1> - <sect1> + <sect1 id="Limitations"> <title>Limitations</title> <para> <orderedlist> @@ -100,7 +100,7 @@ on devices, request/map memory region resources, and manage mailboxes/doorbells. </para> - <sect1> + <sect1 id="Functions"> <title>Functions</title> !Iinclude/linux/rio_drv.h !Edrivers/rapidio/rio-driver.c @@ -116,23 +116,23 @@ subsystem. </para> - <sect1><title>Structures</title> + <sect1 id="Structures"><title>Structures</title> !Iinclude/linux/rio.h </sect1> - <sect1><title>Enumeration and Discovery</title> + <sect1 id="Enumeration_and_Discovery"><title>Enumeration and Discovery</title> !Idrivers/rapidio/rio-scan.c </sect1> - <sect1><title>Driver functionality</title> + <sect1 id="Driver_functionality"><title>Driver functionality</title> !Idrivers/rapidio/rio.c !Idrivers/rapidio/rio-access.c </sect1> - <sect1><title>Device model support</title> + <sect1 id="Device_model_support"><title>Device model support</title> !Idrivers/rapidio/rio-driver.c </sect1> - <sect1><title>Sysfs support</title> + <sect1 id="Sysfs_support"><title>Sysfs support</title> !Idrivers/rapidio/rio-sysfs.c </sect1> - <sect1><title>PPC32 support</title> + <sect1 id="PPC32_support"><title>PPC32 support</title> !Iarch/powerpc/kernel/rio.c !Earch/powerpc/sysdev/fsl_rio.c !Iarch/powerpc/sysdev/fsl_rio.c diff --git a/Documentation/DocBook/videobook.tmpl b/Documentation/DocBook/videobook.tmpl index b3d93ee2769..89817795e66 100644 --- a/Documentation/DocBook/videobook.tmpl +++ b/Documentation/DocBook/videobook.tmpl @@ -170,7 +170,7 @@ int __init myradio_init(struct video_init *v) <para> The types available are </para> - <table frame="all"><title>Device Types</title> + <table frame="all" id="Device_Types"><title>Device Types</title> <tgroup cols="3" align="left"> <tbody> <row> @@ -291,7 +291,7 @@ static int radio_ioctl(struct video_device *dev, unsigned int cmd, void *arg) allows the applications to find out what sort of a card they have found and to figure out what they want to do about it. The fields in the structure are </para> - <table frame="all"><title>struct video_capability fields</title> + <table frame="all" id="video_capability_fields"><title>struct video_capability fields</title> <tgroup cols="2" align="left"> <tbody> <row> @@ -365,7 +365,7 @@ static int radio_ioctl(struct video_device *dev, unsigned int cmd, void *arg) <para> The video_tuner structure has the following fields </para> - <table frame="all"><title>struct video_tuner fields</title> + <table frame="all" id="video_tuner_fields"><title>struct video_tuner fields</title> <tgroup cols="2" align="left"> <tbody> <row> @@ -398,7 +398,7 @@ static int radio_ioctl(struct video_device *dev, unsigned int cmd, void *arg) </tgroup> </table> - <table frame="all"><title>struct video_tuner flags</title> + <table frame="all" id="video_tuner_flags"><title>struct video_tuner flags</title> <tgroup cols="2" align="left"> <tbody> <row> @@ -421,7 +421,7 @@ static int radio_ioctl(struct video_device *dev, unsigned int cmd, void *arg) </tgroup> </table> - <table frame="all"><title>struct video_tuner modes</title> + <table frame="all" id="video_tuner_modes"><title>struct video_tuner modes</title> <tgroup cols="2" align="left"> <tbody> <row> @@ -572,7 +572,7 @@ static int current_volume=0; <para> Then we fill in the video_audio structure. This has the following format </para> - <table frame="all"><title>struct video_audio fields</title> + <table frame="all" id="video_audio_fields"><title>struct video_audio fields</title> <tgroup cols="2" align="left"> <tbody> <row> @@ -607,7 +607,7 @@ static int current_volume=0; </tgroup> </table> - <table frame="all"><title>struct video_audio flags</title> + <table frame="all" id="video_audio_flags"><title>struct video_audio flags</title> <tgroup cols="2" align="left"> <tbody> <row> @@ -625,7 +625,7 @@ static int current_volume=0; </tgroup> </table> - <table frame="all"><title>struct video_audio modes</title> + <table frame="all" id="video_audio_modes"><title>struct video_audio modes</title> <tgroup cols="2" align="left"> <tbody> <row> @@ -775,7 +775,7 @@ module_exit(cleanup); </para> </sect1> </chapter> - <chapter> + <chapter id="Video_Capture_Devices"> <title>Video Capture Devices</title> <sect1 id="introvid"> <title>Video Capture Device Types</title> @@ -855,7 +855,7 @@ static struct video_device my_camera We use the extra video capability flags that did not apply to the radio interface. The video related flags are </para> - <table frame="all"><title>Capture Capabilities</title> + <table frame="all" id="Capture_Capabilities"><title>Capture Capabilities</title> <tgroup cols="2" align="left"> <tbody> <row> @@ -1195,7 +1195,7 @@ static int camera_ioctl(struct video_device *dev, unsigned int cmd, void *arg) inputs to the video card). Our example card has a single camera input. The fields in the structure are </para> - <table frame="all"><title>struct video_channel fields</title> + <table frame="all" id="video_channel_fields"><title>struct video_channel fields</title> <tgroup cols="2" align="left"> <tbody> <row> @@ -1218,7 +1218,7 @@ static int camera_ioctl(struct video_device *dev, unsigned int cmd, void *arg) </tbody> </tgroup> </table> - <table frame="all"><title>struct video_channel flags</title> + <table frame="all" id="video_channel_flags"><title>struct video_channel flags</title> <tgroup cols="2" align="left"> <tbody> <row> @@ -1229,7 +1229,7 @@ static int camera_ioctl(struct video_device *dev, unsigned int cmd, void *arg) </tbody> </tgroup> </table> - <table frame="all"><title>struct video_channel types</title> + <table frame="all" id="video_channel_types"><title>struct video_channel types</title> <tgroup cols="2" align="left"> <tbody> <row> @@ -1242,7 +1242,7 @@ static int camera_ioctl(struct video_device *dev, unsigned int cmd, void *arg) </tbody> </tgroup> </table> - <table frame="all"><title>struct video_channel norms</title> + <table frame="all" id="video_channel_norms"><title>struct video_channel norms</title> <tgroup cols="2" align="left"> <tbody> <row> @@ -1328,7 +1328,7 @@ static int camera_ioctl(struct video_device *dev, unsigned int cmd, void *arg) for every other pixel in the image. The other common formats the interface defines are </para> - <table frame="all"><title>Framebuffer Encodings</title> + <table frame="all" id="Framebuffer_Encodings"><title>Framebuffer Encodings</title> <tgroup cols="2" align="left"> <tbody> <row> @@ -1466,7 +1466,7 @@ static struct video_buffer capture_fb; display. The video_window structure is used to describe the way the image should be displayed. </para> - <table frame="all"><title>struct video_window fields</title> + <table frame="all" id="video_window_fields"><title>struct video_window fields</title> <tgroup cols="2" align="left"> <tbody> <row> @@ -1503,7 +1503,7 @@ static struct video_buffer capture_fb; <para> Each clip is a struct video_clip which has the following fields </para> - <table frame="all"><title>video_clip fields</title> + <table frame="all" id="video_clip_fields"><title>video_clip fields</title> <tgroup cols="2" align="left"> <tbody> <row> diff --git a/Documentation/DocBook/z8530book.tmpl b/Documentation/DocBook/z8530book.tmpl index a507876447a..42c75ba71ba 100644 --- a/Documentation/DocBook/z8530book.tmpl +++ b/Documentation/DocBook/z8530book.tmpl @@ -77,7 +77,7 @@ </para> </chapter> - <chapter> + <chapter id="Driver_Modes"> <title>Driver Modes</title> <para> The Z85230 driver layer can drive Z8530, Z85C30 and Z85230 devices @@ -108,7 +108,7 @@ </para> </chapter> - <chapter> + <chapter id="Using_the_Z85230_driver"> <title>Using the Z85230 driver</title> <para> The Z85230 driver provides the back end interface to your board. To @@ -174,7 +174,7 @@ </para> </chapter> - <chapter> + <chapter id="Attaching_Network_Interfaces"> <title>Attaching Network Interfaces</title> <para> If you wish to use the network interface facilities of the driver, @@ -216,7 +216,7 @@ </para> </chapter> - <chapter> + <chapter id="Configuring_And_Activating_The_Port"> <title>Configuring And Activating The Port</title> <para> The Z85230 driver provides helper functions and tables to load the @@ -300,7 +300,7 @@ </para> </chapter> - <chapter> + <chapter id="Network_Layer_Functions"> <title>Network Layer Functions</title> <para> The Z8530 layer provides functions to queue packets for @@ -327,7 +327,7 @@ </para> </chapter> - <chapter> + <chapter id="Porting_The_Z8530_Driver"> <title>Porting The Z8530 Driver</title> <para> The Z8530 driver is written to be portable. In DMA mode it makes diff --git a/Documentation/cgroups.txt b/Documentation/cgroups.txt index 98a26f81fa7..42d7c4cb39c 100644 --- a/Documentation/cgroups.txt +++ b/Documentation/cgroups.txt @@ -456,7 +456,7 @@ methods are create/destroy. Any others that are null are presumed to be successful no-ops. struct cgroup_subsys_state *create(struct cgroup *cont) -LL=cgroup_mutex +(cgroup_mutex held by caller) Called to create a subsystem state object for a cgroup. The subsystem should allocate its subsystem state object for the passed @@ -471,14 +471,19 @@ it's the root of the hierarchy) and may be an appropriate place for initialization code. void destroy(struct cgroup *cont) -LL=cgroup_mutex +(cgroup_mutex held by caller) -The cgroup system is about to destroy the passed cgroup; the -subsystem should do any necessary cleanup +The cgroup system is about to destroy the passed cgroup; the subsystem +should do any necessary cleanup and free its subsystem state +object. By the time this method is called, the cgroup has already been +unlinked from the file system and from the child list of its parent; +cgroup->parent is still valid. (Note - can also be called for a +newly-created cgroup if an error occurs after this subsystem's +create() method has been called for the new cgroup). int can_attach(struct cgroup_subsys *ss, struct cgroup *cont, struct task_struct *task) -LL=cgroup_mutex +(cgroup_mutex held by caller) Called prior to moving a task into a cgroup; if the subsystem returns an error, this will abort the attach operation. If a NULL @@ -489,25 +494,20 @@ remain valid while the caller holds cgroup_mutex. void attach(struct cgroup_subsys *ss, struct cgroup *cont, struct cgroup *old_cont, struct task_struct *task) -LL=cgroup_mutex - Called after the task has been attached to the cgroup, to allow any post-attachment activity that requires memory allocations or blocking. void fork(struct cgroup_subsy *ss, struct task_struct *task) -LL=callback_mutex, maybe read_lock(tasklist_lock) Called when a task is forked into a cgroup. Also called during registration for all existing tasks. void exit(struct cgroup_subsys *ss, struct task_struct *task) -LL=callback_mutex Called during task exit int populate(struct cgroup_subsys *ss, struct cgroup *cont) -LL=none Called after creation of a cgroup to allow a subsystem to populate the cgroup directory with file entries. The subsystem should make @@ -524,7 +524,7 @@ example in cpusets, no task may attach before 'cpus' and 'mems' are set up. void bind(struct cgroup_subsys *ss, struct cgroup *root) -LL=callback_mutex +(cgroup_mutex held by caller) Called when a cgroup subsystem is rebound to a different hierarchy and root cgroup. Currently this will only involve movement between diff --git a/Documentation/controllers/memory.txt b/Documentation/controllers/memory.txt new file mode 100644 index 00000000000..b5bbea92a61 --- /dev/null +++ b/Documentation/controllers/memory.txt @@ -0,0 +1,279 @@ +Memory Controller + +Salient features + +a. Enable control of both RSS (mapped) and Page Cache (unmapped) pages +b. The infrastructure allows easy addition of other types of memory to control +c. Provides *zero overhead* for non memory controller users +d. Provides a double LRU: global memory pressure causes reclaim from the + global LRU; a cgroup on hitting a limit, reclaims from the per + cgroup LRU + +NOTE: Swap Cache (unmapped) is not accounted now. + +Benefits and Purpose of the memory controller + +The memory controller isolates the memory behaviour of a group of tasks +from the rest of the system. The article on LWN [12] mentions some probable +uses of the memory controller. The memory controller can be used to + +a. Isolate an application or a group of applications + Memory hungry applications can be isolated and limited to a smaller + amount of memory. +b. Create a cgroup with limited amount of memory, this can be used + as a good alternative to booting with mem=XXXX. +c. Virtualization solutions can control the amount of memory they want + to assign to a virtual machine instance. +d. A CD/DVD burner could control the amount of memory used by the + rest of the system to ensure that burning does not fail due to lack + of available memory. +e. There are several other use cases, find one or use the controller just + for fun (to learn and hack on the VM subsystem). + +1. History + +The memory controller has a long history. A request for comments for the memory +controller was posted by Balbir Singh [1]. At the time the RFC was posted +there were several implementations for memory control. The goal of the +RFC was to build consensus and agreement for the minimal features required +for memory control. The first RSS controller was posted by Balbir Singh[2] +in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the +RSS controller. At OLS, at the resource management BoF, everyone suggested +that we handle both page cache and RSS together. Another request was raised +to allow user space handling of OOM. The current memory controller is +at version 6; it combines both mapped (RSS) and unmapped Page +Cache Control [11]. + +2. Memory Control + +Memory is a unique resource in the sense that it is present in a limited +amount. If a task requires a lot of CPU processing, the task can spread +its processing over a period of hours, days, months or years, but with +memory, the same physical memory needs to be reused to accomplish the task. + +The memory controller implementation has been divided into phases. These +are: + +1. Memory controller +2. mlock(2) controller +3. Kernel user memory accounting and slab control +4. user mappings length controller + +The memory controller is the first controller developed. + +2.1. Design + +The core of the design is a counter called the res_counter. The res_counter +tracks the current memory usage and limit of the group of processes associated +with the controller. Each cgroup has a memory controller specific data +structure (mem_cgroup) associated with it. + +2.2. Accounting + + +--------------------+ + | mem_cgroup | + | (res_counter) | + +--------------------+ + / ^ \ + / | \ + +---------------+ | +---------------+ + | mm_struct | |.... | mm_struct | + | | | | | + +---------------+ | +---------------+ + | + + --------------+ + | + +---------------+ +------+--------+ + | page +----------> page_cgroup| + | | | | + +---------------+ +---------------+ + + (Figure 1: Hierarchy of Accounting) + + +Figure 1 shows the important aspects of the controller + +1. Accounting happens per cgroup +2. Each mm_struct knows about which cgroup it belongs to +3. Each page has a pointer to the page_cgroup, which in turn knows the + cgroup it belongs to + +The accounting is done as follows: mem_cgroup_charge() is invoked to setup +the necessary data structures and check if the cgroup that is being charged +is over its limit. If it is then reclaim is invoked on the cgroup. +More details can be found in the reclaim section of this document. +If everything goes well, a page meta-data-structure called page_cgroup is +allocated and associated with the page. This routine also adds the page to +the per cgroup LRU. + +2.2.1 Accounting details + +All mapped pages (RSS) and unmapped user pages (Page Cache) are accounted. +RSS pages are accounted at the time of page_add_*_rmap() unless they've already +been accounted for earlier. A file page will be accounted for as Page Cache; +it's mapped into the page tables of a process, duplicate accounting is carefully +avoided. Page Cache pages are accounted at the time of add_to_page_cache(). +The corresponding routines that remove a page from the page tables or removes +a page from Page Cache is used to decrement the accounting counters of the +cgroup. + +2.3 Shared Page Accounting + +Shared pages are accounted on the basis of the first touch approach. The +cgroup that first touches a page is accounted for the page. The principle +behind this approach is that a cgroup that aggressively uses a shared +page will eventually get charged for it (once it is uncharged from +the cgroup that brought it in -- this will happen on memory pressure). + +2.4 Reclaim + +Each cgroup maintains a per cgroup LRU that consists of an active +and inactive list. When a cgroup goes over its limit, we first try +to reclaim memory from the cgroup so as to make space for the new +pages that the cgroup has touched. If the reclaim is unsuccessful, +an OOM routine is invoked to select and kill the bulkiest task in the +cgroup. + +The reclaim algorithm has not been modified for cgroups, except that +pages that are selected for reclaiming come from the per cgroup LRU +list. + +2. Locking + +The memory controller uses the following hierarchy + +1. zone->lru_lock is used for selecting pages to be isolated +2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone) +3. lock_page_cgroup() is used to protect page->page_cgroup + +3. User Interface + +0. Configuration + +a. Enable CONFIG_CGROUPS +b. Enable CONFIG_RESOURCE_COUNTERS +c. Enable CONFIG_CGROUP_MEM_CONT + +1. Prepare the cgroups +# mkdir -p /cgroups +# mount -t cgroup none /cgroups -o memory + +2. Make the new group and move bash into it +# mkdir /cgroups/0 +# echo $$ > /cgroups/0/tasks + +Since now we're in the 0 cgroup, +We can alter the memory limit: +# echo -n 4M > /cgroups/0/memory.limit_in_bytes + +NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, +mega or gigabytes. + +# cat /cgroups/0/memory.limit_in_bytes +4194304 Bytes + +NOTE: The interface has now changed to display the usage in bytes +instead of pages + +We can check the usage: +# cat /cgroups/0/memory.usage_in_bytes +1216512 Bytes + +A successful write to this file does not guarantee a successful set of +this limit to the value written into the file. This can be due to a +number of factors, such as rounding up to page boundaries or the total +availability of memory on the system. The user is required to re-read +this file after a write to guarantee the value committed by the kernel. + +# echo -n 1 > memory.limit_in_bytes +# cat memory.limit_in_bytes +4096 Bytes + +The memory.failcnt field gives the number of times that the cgroup limit was +exceeded. + +The memory.stat file gives accounting information. Now, the number of +caches, RSS and Active pages/Inactive pages are shown. + +The memory.force_empty gives an interface to drop *all* charges by force. + +# echo -n 1 > memory.force_empty + +will drop all charges in cgroup. Currently, this is maintained for test. + +4. Testing + +Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11]. +Apart from that v6 has been tested with several applications and regular +daily use. The controller has also been tested on the PPC64, x86_64 and +UML platforms. + +4.1 Troubleshooting + +Sometimes a user might find that the application under a cgroup is +terminated. There are several causes for this: + +1. The cgroup limit is too low (just too low to do anything useful) +2. The user is using anonymous memory and swap is turned off or too low + +A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of +some of the pages cached in the cgroup (page cache pages). + +4.2 Task migration + +When a task migrates from one cgroup to another, it's charge is not +carried forward. The pages allocated from the original cgroup still +remain charged to it, the charge is dropped when the page is freed or +reclaimed. + +4.3 Removing a cgroup + +A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a +cgroup might have some charge associated with it, even though all +tasks have migrated away from it. Such charges are automatically dropped at +rmdir() if there are no tasks. + +4.4 Choosing what to account -- Page Cache (unmapped) vs RSS (mapped)? + +The type of memory accounted by the cgroup can be limited to just +mapped pages by writing "1" to memory.control_type field + +echo -n 1 > memory.control_type + +5. TODO + +1. Add support for accounting huge pages (as a separate controller) +2. Make per-cgroup scanner reclaim not-shared pages first +3. Teach controller to account for shared-pages +4. Start reclamation when the limit is lowered +5. Start reclamation in the background when the limit is + not yet hit but the usage is getting closer + +Summary + +Overall, the memory controller has been a stable controller and has been +commented and discussed quite extensively in the community. + +References + +1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ +2. Singh, Balbir. Memory Controller (RSS Control), + http://lwn.net/Articles/222762/ +3. Emelianov, Pavel. Resource controllers based on process cgroups + http://lkml.org/lkml/2007/3/6/198 +4. Emelianov, Pavel. RSS controller based on process cgroups (v2) + http://lkml.org/lkml/2007/4/9/74 +5. Emelianov, Pavel. RSS controller based on process cgroups (v3) + http://lkml.org/lkml/2007/5/30/244 +6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/ +7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control + subsystem (v3), http://lwn.net/Articles/235534/ +8. Singh, Balbir. RSS controller V2 test results (lmbench), + http://lkml.org/lkml/2007/5/17/232 +9. Singh, Balbir. RSS controller V2 AIM9 results + http://lkml.org/lkml/2007/5/18/1 +10. Singh, Balbir. Memory controller v6 results, + http://lkml.org/lkml/2007/8/19/36 +11. Singh, Balbir. Memory controller v6, http://lkml.org/lkml/2007/8/17/69 +12. Corbet, Jonathan, Controlling memory use in cgroups, + http://lwn.net/Articles/243795/ diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index 141bef1c859..43db6fe1281 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt @@ -523,21 +523,14 @@ from one cpuset to another, then the kernel will adjust the tasks memory placement, as above, the next time that the kernel attempts to allocate a page of memory for that task. -If a cpuset has its CPUs modified, then each task using that -cpuset does _not_ change its behavior automatically. In order to -minimize the impact on the critical scheduling code in the kernel, -tasks will continue to use their prior CPU placement until they -are rebound to their cpuset, by rewriting their pid to the 'tasks' -file of their cpuset. If a task had been bound to some subset of its -cpuset using the sched_setaffinity() call, and if any of that subset -is still allowed in its new cpuset settings, then the task will be -restricted to the intersection of the CPUs it was allowed on before, -and its new cpuset CPU placement. If, on the other hand, there is -no overlap between a tasks prior placement and its new cpuset CPU -placement, then the task will be allowed to run on any CPU allowed -in its new cpuset. If a task is moved from one cpuset to another, -its CPU placement is updated in the same way as if the tasks pid is -rewritten to the 'tasks' file of its current cpuset. +If a cpuset has its 'cpus' modified, then each task in that cpuset +will have its allowed CPU placement changed immediately. Similarly, +if a tasks pid is written to a cpusets 'tasks' file, in either its +current cpuset or another cpuset, then its allowed CPU placement is +changed immediately. If such a task had been bound to some subset +of its cpuset using the sched_setaffinity() call, the task will be +allowed to run on any CPU allowed in its new cpuset, negating the +affect of the prior sched_setaffinity() call. In summary, the memory placement of a task whose cpuset is changed is updated by the kernel, on the next allocation of a page for that task, diff --git a/Documentation/drivers/edac/edac.txt b/Documentation/edac.txt index a5c36842ece..a5c36842ece 100644 --- a/Documentation/drivers/edac/edac.txt +++ b/Documentation/edac.txt diff --git a/Documentation/email-clients.txt b/Documentation/email-clients.txt index 113165b4830..2ebb94d6ed8 100644 --- a/Documentation/email-clients.txt +++ b/Documentation/email-clients.txt @@ -170,7 +170,6 @@ Sylpheed (GUI) - Works well for inlining text (or using attachments). - Allows use of an external editor. -- Not good for IMAP. - Is slow on large folders. - Won't do TLS SMTP auth over a non-SSL connection. - Has a helpful ruler bar in the compose window. diff --git a/Documentation/fb/deferred_io.txt b/Documentation/fb/deferred_io.txt index 63883a89212..74832837025 100644 --- a/Documentation/fb/deferred_io.txt +++ b/Documentation/fb/deferred_io.txt @@ -7,10 +7,10 @@ IO. The following example may be a useful explanation of how one such setup works: - userspace app like Xfbdev mmaps framebuffer -- deferred IO and driver sets up nopage and page_mkwrite handlers +- deferred IO and driver sets up fault and page_mkwrite handlers - userspace app tries to write to mmaped vaddress -- we get pagefault and reach nopage handler -- nopage handler finds and returns physical page +- we get pagefault and reach fault handler +- fault handler finds and returns physical page - we get page_mkwrite where we add this page to a list - schedule a workqueue task to be run after a delay - app continues writing to that page with no additional cost. this is diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index a7d9d179131..17b1659bd3f 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -6,14 +6,6 @@ be removed from this file. --------------------------- -What: MXSER -When: December 2007 -Why: Old mxser driver is obsoleted by the mxser_new. Give it some time yet - and remove it. -Who: Jiri Slaby <jirislaby@gmail.com> - ---------------------------- - What: dev->power.power_state When: July 2007 Why: Broken design for runtime control over driver power states, confusing @@ -208,13 +200,6 @@ Who: Randy Dunlap <randy.dunlap@oracle.com> --------------------------- -What: drivers depending on OSS_OBSOLETE -When: options in 2.6.23, code in 2.6.25 -Why: obsolete OSS drivers -Who: Adrian Bunk <bunk@stusta.de> - ---------------------------- - What: libata spindown skipping and warning When: Dec 2008 Why: Some halt(8) implementations synchronize caches for and spin diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index 1de155e2dc3..e68021c08fb 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -32,6 +32,8 @@ directory-locking - info about the locking scheme used for directory operations. dlmfs.txt - info on the userspace interface to the OCFS2 DLM. +dnotify.txt + - info about directory notification in Linux. ecryptfs.txt - docs on eCryptfs: stacked cryptographic filesystem for Linux. ext2.txt @@ -80,6 +82,8 @@ relay.txt - info on relay, for efficient streaming from kernel to user space. romfs.txt - description of the ROMFS filesystem. +sharedsubtree.txt + - a description of shared subtrees for namespaces. smbfs.txt - info on using filesystems with the SMB protocol (Win 3.11 and NT). spufs.txt diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 37c10cba717..42d4b30b104 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -90,7 +90,6 @@ of the locking scheme for directory operations. prototypes: struct inode *(*alloc_inode)(struct super_block *sb); void (*destroy_inode)(struct inode *); - void (*read_inode) (struct inode *); void (*dirty_inode) (struct inode *); int (*write_inode) (struct inode *, int); void (*put_inode) (struct inode *); @@ -114,7 +113,6 @@ locking rules: BKL s_lock s_umount alloc_inode: no no no destroy_inode: no -read_inode: no (see below) dirty_inode: no (must not sleep) write_inode: no put_inode: no @@ -133,7 +131,6 @@ show_options: no (vfsmount->sem) quota_read: no no no (see below) quota_write: no no no (see below) -->read_inode() is not a method - it's a callback used in iget(). ->remount_fs() will have the s_umount lock if it's already mounted. When called from get_sb_single, it does NOT have the s_umount lock. ->quota_read() and ->quota_write() functions are both guaranteed to diff --git a/Documentation/dnotify.txt b/Documentation/filesystems/dnotify.txt index 6984fca6002..9f5d338ddbb 100644 --- a/Documentation/dnotify.txt +++ b/Documentation/filesystems/dnotify.txt @@ -69,24 +69,24 @@ Example #include <signal.h> #include <stdio.h> #include <unistd.h> - + static volatile int event_fd; - + static void handler(int sig, siginfo_t *si, void *data) { event_fd = si->si_fd; } - + int main(void) { struct sigaction act; int fd; - + act.sa_sigaction = handler; sigemptyset(&act.sa_mask); act.sa_flags = SA_SIGINFO; sigaction(SIGRTMIN + 1, &act, NULL); - + fd = open(".", O_RDONLY); fcntl(fd, F_SETSIG, SIGRTMIN + 1); fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT); diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index 0f33c77bc14..92b888d540a 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting @@ -34,8 +34,8 @@ FOO_I(inode) (see in-tree filesystems for examples). Make them ->alloc_inode and ->destroy_inode in your super_operations. -Keep in mind that now you need explicit initialization of private data - -typically in ->read_inode() and after getting an inode from new_inode(). +Keep in mind that now you need explicit initialization of private data +typically between calling iget_locked() and unlocking the inode. At some point that will become mandatory. @@ -173,10 +173,10 @@ should be a non-blocking function that initializes those parts of a newly created inode to allow the test function to succeed. 'data' is passed as an opaque value to both test and set functions. -When the inode has been created by iget5_locked(), it will be returned with -the I_NEW flag set and will still be locked. read_inode has not been -called so the file system still has to finalize the initialization. Once -the inode is initialized it must be unlocked by calling unlock_new_inode(). +When the inode has been created by iget5_locked(), it will be returned with the +I_NEW flag set and will still be locked. The filesystem then needs to finalize +the initialization. Once the inode is initialized it must be unlocked by +calling unlock_new_inode(). The filesystem is responsible for setting (and possibly testing) i_ino when appropriate. There is also a simpler iget_locked function that @@ -184,11 +184,19 @@ just takes the superblock and inode number as arguments and does the test and set for you. e.g. - inode = iget_locked(sb, ino); - if (inode->i_state & I_NEW) { - read_inode_from_disk(inode); - unlock_new_inode(inode); - } + inode = iget_locked(sb, ino); + if (inode->i_state & I_NEW) { + err = read_inode_from_disk(inode); + if (err < 0) { + iget_failed(inode); + return err; + } + unlock_new_inode(inode); + } + +Note that if the process of setting up a new inode fails, then iget_failed() +should be called on the inode to render it dead, and an appropriate error +should be passed back to the caller. --- [recommended] diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index e2799b5fafe..5681e2fa149 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -1029,6 +1029,14 @@ nr_inodes Denotes the number of inodes the system has allocated. This number will grow and shrink dynamically. +nr_open +------- + +Denotes the maximum number of file-handles a process can +allocate. Default value is 1024*1024 (1048576) which should be +enough for most machines. Actual limit depends on RLIMIT_NOFILE +resource limit. + nr_free_inodes -------------- diff --git a/Documentation/sharedsubtree.txt b/Documentation/filesystems/sharedsubtree.txt index 736540045dc..736540045dc 100644 --- a/Documentation/sharedsubtree.txt +++ b/Documentation/filesystems/sharedsubtree.txt diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 9d019d35728..bd55038b56f 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -203,8 +203,6 @@ struct super_operations { struct inode *(*alloc_inode)(struct super_block *sb); void (*destroy_inode)(struct inode *); - void (*read_inode) (struct inode *); - void (*dirty_inode) (struct inode *); int (*write_inode) (struct inode *, int); void (*put_inode) (struct inode *); @@ -242,15 +240,6 @@ or bottom half). ->alloc_inode was defined and simply undoes anything done by ->alloc_inode. - read_inode: this method is called to read a specific inode from the - mounted filesystem. The i_ino member in the struct inode is - initialized by the VFS to indicate which inode to read. Other - members are filled in by this method. - - You can set this to NULL and use iget5_locked() instead of iget() - to read inodes. This is necessary for filesystems for which the - inode number is not sufficient to identify an inode. - dirty_inode: this method is called by the VFS to mark an inode dirty. write_inode: this method is called when the VFS needs to write an @@ -308,9 +297,9 @@ or bottom half). quota_write: called by the VFS to write to filesystem quota file. -The read_inode() method is responsible for filling in the "i_op" -field. This is a pointer to a "struct inode_operations" which -describes the methods that can be performed on individual inodes. +Whoever sets up the inode is responsible for filling in the "i_op" field. This +is a pointer to a "struct inode_operations" which describes the methods that +can be performed on individual inodes. The Inode Object diff --git a/Documentation/kprobes.txt b/Documentation/kprobes.txt index 53a63890aea..30c101761d0 100644 --- a/Documentation/kprobes.txt +++ b/Documentation/kprobes.txt @@ -96,7 +96,9 @@ or in registers (e.g., for x86_64 or for an i386 fastcall function). The jprobe will work in either case, so long as the handler's prototype matches that of the probed function. -1.3 How Does a Return Probe Work? +1.3 Return Probes + +1.3.1 How Does a Return Probe Work? When you call register_kretprobe(), Kprobes establishes a kprobe at the entry to the function. When the probed function is called and this @@ -107,9 +109,9 @@ At boot time, Kprobes registers a kprobe at the trampoline. When the probed function executes its return instruction, control passes to the trampoline and that probe is hit. Kprobes' trampoline -handler calls the user-specified handler associated with the kretprobe, -then sets the saved instruction pointer to the saved return address, -and that's where execution resumes upon return from the trap. +handler calls the user-specified return handler associated with the +kretprobe, then sets the saved instruction pointer to the saved return +address, and that's where execution resumes upon return from the trap. While the probed function is executing, its return address is stored in an object of type kretprobe_instance. Before calling @@ -131,6 +133,30 @@ zero when the return probe is registered, and is incremented every time the probed function is entered but there is no kretprobe_instance object available for establishing the return probe. +1.3.2 Kretprobe entry-handler + +Kretprobes also provides an optional user-specified handler which runs +on function entry. This handler is specified by setting the entry_handler +field of the kretprobe struct. Whenever the kprobe placed by kretprobe at the +function entry is hit, the user-defined entry_handler, if any, is invoked. +If the entry_handler returns 0 (success) then a corresponding return handler +is guaranteed to be called upon function return. If the entry_handler +returns a non-zero error then Kprobes leaves the return address as is, and +the kretprobe has no further effect for that particular function instance. + +Multiple entry and return handler invocations are matched using the unique +kretprobe_instance object associated with them. Additionally, a user +may also specify per return-instance private data to be part of each +kretprobe_instance object. This is especially useful when sharing private +data between corresponding user entry and return handlers. The size of each +private data object can be specified at kretprobe registration time by +setting the data_size field of the kretprobe struct. This data can be +accessed through the data field of each kretprobe_instance object. + +In case probed function is entered but there is no kretprobe_instance +object available, then in addition to incrementing the nmissed count, +the user entry_handler invocation is also skipped. + 2. Architectures Supported Kprobes, jprobes, and return probes are implemented on the following @@ -274,6 +300,8 @@ of interest: - ret_addr: the return address - rp: points to the corresponding kretprobe object - task: points to the corresponding task struct +- data: points to per return-instance private data; see "Kretprobe + entry-handler" for details. The regs_return_value(regs) macro provides a simple abstraction to extract the return value from the appropriate register as defined by @@ -556,23 +584,52 @@ report failed calls to sys_open(). #include <linux/kernel.h> #include <linux/module.h> #include <linux/kprobes.h> +#include <linux/ktime.h> + +/* per-instance private data */ +struct my_data { + ktime_t entry_stamp; +}; static const char *probed_func = "sys_open"; -/* Return-probe handler: If the probed function fails, log the return value. */ -static int ret_handler(struct kretprobe_instance *ri, struct pt_regs *regs) +/* Timestamp function entry. */ +static int entry_handler(struct kretprobe_instance *ri, struct pt_regs *regs) +{ + struct my_data *data; + + if(!current->mm) + return 1; /* skip kernel threads */ + + data = (struct my_data *)ri->data; + data->entry_stamp = ktime_get(); + return 0; +} + +/* If the probed function failed, log the return value and duration. + * Duration may turn out to be zero consistently, depending upon the + * granularity of time accounting on the platform. */ +static int return_handler(struct kretprobe_instance *ri, struct pt_regs *regs) { int retval = regs_return_value(regs); + struct my_data *data = (struct my_data *)ri->data; + s64 delta; + ktime_t now; + if (retval < 0) { - printk("%s returns %d\n", probed_func, retval); + now = ktime_get(); + delta = ktime_to_ns(ktime_sub(now, data->entry_stamp)); + printk("%s: return val = %d (duration = %lld ns)\n", + probed_func, retval, delta); } return 0; } static struct kretprobe my_kretprobe = { - .handler = ret_handler, - /* Probe up to 20 instances concurrently. */ - .maxactive = 20 + .handler = return_handler, + .entry_handler = entry_handler, + .data_size = sizeof(struct my_data), + .maxactive = 20, /* probe up to 20 instances concurrently */ }; static int __init kretprobe_init(void) @@ -584,7 +641,7 @@ static int __init kretprobe_init(void) printk("register_kretprobe failed, returned %d\n", ret); return -1; } - printk("Planted return probe at %p\n", my_kretprobe.kp.addr); + printk("Kretprobe active on %s\n", my_kretprobe.kp.symbol_name); return 0; } @@ -594,7 +651,7 @@ static void __exit kretprobe_exit(void) printk("kretprobe unregistered\n"); /* nmissed > 0 suggests that maxactive was set too low. */ printk("Missed probing %d instances of %s\n", - my_kretprobe.nmissed, probed_func); + my_kretprobe.nmissed, probed_func); } module_init(kretprobe_init) diff --git a/Documentation/kref.txt b/Documentation/kref.txt index f38b59d00c6..130b6e87aa7 100644 --- a/Documentation/kref.txt +++ b/Documentation/kref.txt @@ -141,10 +141,10 @@ The last rule (rule 3) is the nastiest one to handle. Say, for instance, you have a list of items that are each kref-ed, and you wish to get the first one. You can't just pull the first item off the list and kref_get() it. That violates rule 3 because you are not already -holding a valid pointer. You must add locks or semaphores. For -instance: +holding a valid pointer. You must add a mutex (or some other lock). +For instance: -static DECLARE_MUTEX(sem); +static DEFINE_MUTEX(mutex); static LIST_HEAD(q); struct my_data { @@ -155,12 +155,12 @@ struct my_data static struct my_data *get_entry() { struct my_data *entry = NULL; - down(&sem); + mutex_lock(&mutex); if (!list_empty(&q)) { entry = container_of(q.next, struct my_q_entry, link); kref_get(&entry->refcount); } - up(&sem); + mutex_unlock(&mutex); return entry; } @@ -174,9 +174,9 @@ static void release_entry(struct kref *ref) static void put_entry(struct my_data *entry) { - down(&sem); + mutex_lock(&mutex); kref_put(&entry->refcount, release_entry); - up(&sem); + mutex_unlock(&mutex); } The kref_put() return value is useful if you do not want to hold the @@ -191,13 +191,13 @@ static void release_entry(struct kref *ref) static void put_entry(struct my_data *entry) { - down(&sem); + mutex_lock(&mutex); if (kref_put(&entry->refcount, release_entry)) { list_del(&entry->link); - up(&sem); + mutex_unlock(&mutex); kfree(entry); } else - up(&sem); + mutex_unlock(&mutex); } This is really more useful if you have to call other routines as part diff --git a/Documentation/md.txt b/Documentation/md.txt index 5818628207b..396cdd982c2 100644 --- a/Documentation/md.txt +++ b/Documentation/md.txt @@ -416,6 +416,16 @@ also have sectors in total that could need to be processed. The two numbers are separated by a '/' thus effectively showing one value, a fraction of the process that is complete. + A 'select' on this attribute will return when resync completes, + when it reaches the current sync_max (below) and possibly at + other times. + + sync_max + This is a number of sectors at which point a resync/recovery + process will pause. When a resync is active, the value can + only ever be increased, never decreased. The value of 'max' + effectively disables the limit. + sync_speed This shows the current actual speed, in K/sec, of the current diff --git a/Documentation/rtc.txt b/Documentation/rtc.txt index e20b19c1b60..8deffcd68cb 100644 --- a/Documentation/rtc.txt +++ b/Documentation/rtc.txt @@ -182,8 +182,8 @@ driver returns ENOIOCTLCMD. Some common examples: since the frequency is stored in the irq_freq member of the rtc_device structure. Your driver needs to initialize the irq_freq member during init. Make sure you check the requested frequency is in range of your - hardware in the irq_set_freq function. If you cannot actually change - the frequency, just return -ENOTTY. + hardware in the irq_set_freq function. If it isn't, return -EINVAL. If + you cannot actually change the frequency, do not define irq_set_freq. If all else fails, check out the rtc-test.c driver! @@ -268,8 +268,8 @@ int main(int argc, char **argv) /* This read will block */ retval = read(fd, &data, sizeof(unsigned long)); if (retval == -1) { - perror("read"); - exit(errno); + perror("read"); + exit(errno); } fprintf(stderr, " %d",i); fflush(stderr); @@ -326,11 +326,11 @@ test_READ: rtc_tm.tm_sec %= 60; rtc_tm.tm_min++; } - if (rtc_tm.tm_min == 60) { + if (rtc_tm.tm_min == 60) { rtc_tm.tm_min = 0; rtc_tm.tm_hour++; } - if (rtc_tm.tm_hour == 24) + if (rtc_tm.tm_hour == 24) rtc_tm.tm_hour = 0; retval = ioctl(fd, RTC_ALM_SET, &rtc_tm); @@ -407,8 +407,8 @@ test_PIE: "\n...Periodic IRQ rate is fixed\n"); goto done; } - perror("RTC_IRQP_SET ioctl"); - exit(errno); + perror("RTC_IRQP_SET ioctl"); + exit(errno); } fprintf(stderr, "\n%ldHz:\t", tmp); @@ -417,27 +417,27 @@ test_PIE: /* Enable periodic interrupts */ retval = ioctl(fd, RTC_PIE_ON, 0); if (retval == -1) { - perror("RTC_PIE_ON ioctl"); - exit(errno); + perror("RTC_PIE_ON ioctl"); + exit(errno); } for (i=1; i<21; i++) { - /* This blocks */ - retval = read(fd, &data, sizeof(unsigned long)); - if (retval == -1) { - perror("read"); - exit(errno); - } - fprintf(stderr, " %d",i); - fflush(stderr); - irqcount++; + /* This blocks */ + retval = read(fd, &data, sizeof(unsigned long)); + if (retval == -1) { + perror("read"); + exit(errno); + } + fprintf(stderr, " %d",i); + fflush(stderr); + irqcount++; } /* Disable periodic interrupts */ retval = ioctl(fd, RTC_PIE_OFF, 0); if (retval == -1) { - perror("RTC_PIE_OFF ioctl"); - exit(errno); + perror("RTC_PIE_OFF ioctl"); + exit(errno); } } diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX new file mode 100644 index 00000000000..b5f5ca069b2 --- /dev/null +++ b/Documentation/scheduler/00-INDEX @@ -0,0 +1,16 @@ +00-INDEX + - this file. +sched-arch.txt + - CPU Scheduler implementation hints for architecture specific code. +sched-coding.txt + - reference for various scheduler-related methods in the O(1) scheduler. +sched-design.txt + - goals, design and implementation of the Linux O(1) scheduler. +sched-design-CFS.txt + - goals, design and implementation of the Complete Fair Scheduler. +sched-domains.txt + - information on scheduling domains. +sched-nice-design.txt + - How and why the scheduler's nice levels are implemented. +sched-stats.txt + - information on schedstats (Linux Scheduler Statistics). diff --git a/Documentation/sched-arch.txt b/Documentation/scheduler/sched-arch.txt index 941615a9769..941615a9769 100644 --- a/Documentation/sched-arch.txt +++ b/Documentation/scheduler/sched-arch.txt diff --git a/Documentation/sched-coding.txt b/Documentation/scheduler/sched-coding.txt index cbd8db752ac..cbd8db752ac 100644 --- a/Documentation/sched-coding.txt +++ b/Documentation/scheduler/sched-coding.txt diff --git a/Documentation/sched-design-CFS.txt b/Documentation/scheduler/sched-design-CFS.txt index 88bcb876733..88bcb876733 100644 --- a/Documentation/sched-design-CFS.txt +++ b/Documentation/scheduler/sched-design-CFS.txt diff --git a/Documentation/sched-design.txt b/Documentation/scheduler/sched-design.txt index 1605bf0cba8..1605bf0cba8 100644 --- a/Documentation/sched-design.txt +++ b/Documentation/scheduler/sched-design.txt diff --git a/Documentation/sched-domains.txt b/Documentation/scheduler/sched-domains.txt index a9e990ab980..a9e990ab980 100644 --- a/Documentation/sched-domains.txt +++ b/Documentation/scheduler/sched-domains.txt diff --git a/Documentation/sched-nice-design.txt b/Documentation/scheduler/sched-nice-design.txt index e2bae5a577e..e2bae5a577e 100644 --- a/Documentation/sched-nice-design.txt +++ b/Documentation/scheduler/sched-nice-design.txt diff --git a/Documentation/sched-stats.txt b/Documentation/scheduler/sched-stats.txt index 442e14d35de..442e14d35de 100644 --- a/Documentation/sched-stats.txt +++ b/Documentation/scheduler/sched-stats.txt diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt index aa986a35e99..f99254327ae 100644 --- a/Documentation/sysctl/fs.txt +++ b/Documentation/sysctl/fs.txt @@ -23,6 +23,7 @@ Currently, these files are in /proc/sys/fs: - inode-max - inode-nr - inode-state +- nr_open - overflowuid - overflowgid - suid_dumpable @@ -91,6 +92,15 @@ usage of file handles and you don't need to increase the maximum. ============================================================== +nr_open: + +This denotes the maximum number of file-handles a process can +allocate. Default value is 1024*1024 (1048576) which should be +enough for most machines. Actual limit depends on RLIMIT_NOFILE +resource limit. + +============================================================== + inode-max, inode-nr & inode-state: As with file handles, the kernel allocates the inode structures diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 24eac1bc735..8a4863c4edd 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/vm: - min_unmapped_ratio - min_slab_ratio - panic_on_oom +- oom_dump_tasks - oom_kill_allocating_task - mmap_min_address - numa_zonelist_order @@ -232,6 +233,27 @@ according to your policy of failover. ============================================================= +oom_dump_tasks + +Enables a system-wide task dump (excluding kernel threads) to be +produced when the kernel performs an OOM-killing and includes such +information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and +name. This is helpful to determine why the OOM killer was invoked +and to identify the rogue task that caused it. + +If this is set to zero, this information is suppressed. On very +large systems with thousands of tasks it may not be feasible to dump +the memory state information for each one. Such systems should not +be forced to incur a performance penalty in OOM conditions when the +information may not be desired. + +If this is set to non-zero, this information is shown whenever the +OOM killer actually kills a memory-hogging task. + +The default value is 0. + +============================================================= + oom_kill_allocating_task This enables or disables killing the OOM-triggering task in diff --git a/Documentation/unaligned-memory-access.txt b/Documentation/unaligned-memory-access.txt new file mode 100644 index 00000000000..6223eace3c0 --- /dev/null +++ b/Documentation/unaligned-memory-access.txt @@ -0,0 +1,226 @@ +UNALIGNED MEMORY ACCESSES +========================= + +Linux runs on a wide variety of architectures which have varying behaviour +when it comes to memory access. This document presents some details about +unaligned accesses, why you need to write code that doesn't cause them, +and how to write such code! + + +The definition of an unaligned access +===================================== + +Unaligned memory accesses occur when you try to read N bytes of data starting +from an address that is not evenly divisible by N (i.e. addr % N != 0). +For example, reading 4 bytes of data from address 0x10004 is fine, but +reading 4 bytes of data from address 0x10005 would be an unaligned memory +access. + +The above may seem a little vague, as memory access can happen in different +ways. The context here is at the machine code level: certain instructions read +or write a number of bytes to or from memory (e.g. movb, movw, movl in x86 +assembly). As will become clear, it is relatively easy to spot C statements +which will compile to multiple-byte memory access instructions, namely when +dealing with types such as u16, u32 and u64. + + +Natural alignment +================= + +The rule mentioned above forms what we refer to as natural alignment: +When accessing N bytes of memory, the base memory address must be evenly +divisible by N, i.e. addr % N == 0. + +When writing code, assume the target architecture has natural alignment +requirements. + +In reality, only a few architectures require natural alignment on all sizes +of memory access. However, we must consider ALL supported architectures; +writing code that satisfies natural alignment requirements is the easiest way +to achieve full portability. + + +Why unaligned access is bad +=========================== + +The effects of performing an unaligned memory access vary from architecture +to architecture. It would be easy to write a whole document on the differences +here; a summary of the common scenarios is presented below: + + - Some architectures are able to perform unaligned memory accesses + transparently, but there is usually a significant performance cost. + - Some architectures raise processor exceptions when unaligned accesses + happen. The exception handler is able to correct the unaligned access, + at significant cost to performance. + - Some architectures raise processor exceptions when unaligned accesses + happen, but the exceptions do not contain enough information for the + unaligned access to be corrected. + - Some architectures are not capable of unaligned memory access, but will + silently perform a different memory access to the one that was requested, + resulting a a subtle code bug that is hard to detect! + +It should be obvious from the above that if your code causes unaligned +memory accesses to happen, your code will not work correctly on certain +platforms and will cause performance problems on others. + + +Code that does not cause unaligned access +========================================= + +At first, the concepts above may seem a little hard to relate to actual +coding practice. After all, you don't have a great deal of control over +memory addresses of certain variables, etc. + +Fortunately things are not too complex, as in most cases, the compiler +ensures that things will work for you. For example, take the following +structure: + + struct foo { + u16 field1; + u32 field2; + u8 field3; + }; + +Let us assume that an instance of the above structure resides in memory +starting at address 0x10000. With a basic level of understanding, it would +not be unreasonable to expect that accessing field2 would cause an unaligned +access. You'd be expecting field2 to be located at offset 2 bytes into the +structure, i.e. address 0x10002, but that address is not evenly divisible +by 4 (remember, we're reading a 4 byte value here). + +Fortunately, the compiler understands the alignment constraints, so in the +above case it would insert 2 bytes of padding in between field1 and field2. +Therefore, for standard structure types you can always rely on the compiler +to pad structures so that accesses to fields are suitably aligned (assuming +you do not cast the field to a type of different length). + +Similarly, you can also rely on the compiler to align variables and function +parameters to a naturally aligned scheme, based on the size of the type of +the variable. + +At this point, it should be clear that accessing a single byte (u8 or char) +will never cause an unaligned access, because all memory addresses are evenly +divisible by one. + +On a related topic, with the above considerations in mind you may observe +that you could reorder the fields in the structure in order to place fields +where padding would otherwise be inserted, and hence reduce the overall +resident memory size of structure instances. The optimal layout of the +above example is: + + struct foo { + u32 field2; + u16 field1; + u8 field3; + }; + +For a natural alignment scheme, the compiler would only have to add a single +byte of padding at the end of the structure. This padding is added in order +to satisfy alignment constraints for arrays of these structures. + +Another point worth mentioning is the use of __attribute__((packed)) on a +structure type. This GCC-specific attribute tells the compiler never to +insert any padding within structures, useful when you want to use a C struct +to represent some data that comes in a fixed arrangement 'off the wire'. + +You might be inclined to believe that usage of this attribute can easily +lead to unaligned accesses when accessing fields that do not satisfy +architectural alignment requirements. However, again, the compiler is aware +of the alignment constraints and will generate extra instructions to perform +the memory access in a way that does not cause unaligned access. Of course, +the extra instructions obviously cause a loss in performance compared to the +non-packed case, so the packed attribute should only be used when avoiding +structure padding is of importance. + + +Code that causes unaligned access +================================= + +With the above in mind, let's move onto a real life example of a function +that can cause an unaligned memory access. The following function adapted +from include/linux/etherdevice.h is an optimized routine to compare two +ethernet MAC addresses for equality. + +unsigned int compare_ether_addr(const u8 *addr1, const u8 *addr2) +{ + const u16 *a = (const u16 *) addr1; + const u16 *b = (const u16 *) addr2; + return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0; +} + +In the above function, the reference to a[0] causes 2 bytes (16 bits) to +be read from memory starting at address addr1. Think about what would happen +if addr1 was an odd address such as 0x10003. (Hint: it'd be an unaligned +access.) + +Despite the potential unaligned access problems with the above function, it +is included in the kernel anyway but is understood to only work on +16-bit-aligned addresses. It is up to the caller to ensure this alignment or +not use this function at all. This alignment-unsafe function is still useful +as it is a decent optimization for the cases when you can ensure alignment, +which is true almost all of the time in ethernet networking context. + + +Here is another example of some code that could cause unaligned accesses: + void myfunc(u8 *data, u32 value) + { + [...] + *((u32 *) data) = cpu_to_le32(value); + [...] + } + +This code will cause unaligned accesses every time the data parameter points +to an address that is not evenly divisible by 4. + +In summary, the 2 main scenarios where you may run into unaligned access +problems involve: + 1. Casting variables to types of different lengths + 2. Pointer arithmetic followed by access to at least 2 bytes of data + + +Avoiding unaligned accesses +=========================== + +The easiest way to avoid unaligned access is to use the get_unaligned() and +put_unaligned() macros provided by the <asm/unaligned.h> header file. + +Going back to an earlier example of code that potentially causes unaligned +access: + + void myfunc(u8 *data, u32 value) + { + [...] + *((u32 *) data) = cpu_to_le32(value); + [...] + } + +To avoid the unaligned memory access, you would rewrite it as follows: + + void myfunc(u8 *data, u32 value) + { + [...] + value = cpu_to_le32(value); + put_unaligned(value, (u32 *) data); + [...] + } + +The get_unaligned() macro works similarly. Assuming 'data' is a pointer to +memory and you wish to avoid unaligned access, its usage is as follows: + + u32 value = get_unaligned((u32 *) data); + +These macros work work for memory accesses of any length (not just 32 bits as +in the examples above). Be aware that when compared to standard access of +aligned memory, using these macros to access unaligned memory can be costly in +terms of performance. + +If use of such macros is not convenient, another option is to use memcpy(), +where the source or destination (or both) are of type u8* or unsigned char*. +Due to the byte-wise nature of this operation, unaligned accesses are avoided. + +-- +Author: Daniel Drake <dsd@gentoo.org> +With help from: Alan Cox, Avuton Olrich, Heikki Orsila, Jan Engelhardt, +Johannes Berg, Kyle McMartin, Kyle Moffett, Randy Dunlap, Robert Hancock, +Uli Kunitz, Vadim Lobanov + diff --git a/Documentation/w1/masters/00-INDEX b/Documentation/w1/masters/00-INDEX index 752613c4cea..7b0ceaaad7a 100644 --- a/Documentation/w1/masters/00-INDEX +++ b/Documentation/w1/masters/00-INDEX @@ -4,3 +4,5 @@ ds2482 - The Maxim/Dallas Semiconductor DS2482 provides 1-wire busses. ds2490 - The Maxim/Dallas Semiconductor DS2490 builds USB <-> W1 bridges. +w1-gpio + - GPIO 1-wire bus master driver. diff --git a/Documentation/w1/masters/w1-gpio b/Documentation/w1/masters/w1-gpio new file mode 100644 index 00000000000..af5d3b4aa85 --- /dev/null +++ b/Documentation/w1/masters/w1-gpio @@ -0,0 +1,33 @@ +Kernel driver w1-gpio +===================== + +Author: Ville Syrjala <syrjala@sci.fi> + + +Description +----------- + +GPIO 1-wire bus master driver. The driver uses the GPIO API to control the +wire and the GPIO pin can be specified using platform data. + + +Example (mach-at91) +------------------- + +#include <linux/w1-gpio.h> + +static struct w1_gpio_platform_data foo_w1_gpio_pdata = { + .pin = AT91_PIN_PB20, + .is_open_drain = 1, +}; + +static struct platform_device foo_w1_device = { + .name = "w1-gpio", + .id = -1, + .dev.platform_data = &foo_w1_gpio_pdata, +}; + +... + at91_set_GPIO_periph(foo_w1_gpio_pdata.pin, 1); + at91_set_multi_drive(foo_w1_gpio_pdata.pin, 1); + platform_device_register(&foo_w1_device); |