diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/cgroups/cgroups.txt (renamed from Documentation/cgroups.txt) | 0 | ||||
-rw-r--r-- | Documentation/cgroups/freezer-subsystem.txt | 99 | ||||
-rw-r--r-- | Documentation/controllers/memory.txt | 24 | ||||
-rw-r--r-- | Documentation/cpusets.txt | 2 | ||||
-rw-r--r-- | Documentation/filesystems/ext3.txt | 5 | ||||
-rw-r--r-- | Documentation/filesystems/proc.txt | 28 | ||||
-rw-r--r-- | Documentation/filesystems/ubifs.txt | 9 | ||||
-rw-r--r-- | Documentation/kernel-parameters.txt | 14 | ||||
-rw-r--r-- | Documentation/mtd/nand_ecc.txt | 714 | ||||
-rw-r--r-- | Documentation/sysrq.txt | 5 | ||||
-rw-r--r-- | Documentation/vm/unevictable-lru.txt | 615 |
11 files changed, 1492 insertions, 23 deletions
diff --git a/Documentation/cgroups.txt b/Documentation/cgroups/cgroups.txt index d9014aa0eb6..d9014aa0eb6 100644 --- a/Documentation/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt diff --git a/Documentation/cgroups/freezer-subsystem.txt b/Documentation/cgroups/freezer-subsystem.txt new file mode 100644 index 00000000000..c50ab58b72e --- /dev/null +++ b/Documentation/cgroups/freezer-subsystem.txt @@ -0,0 +1,99 @@ + The cgroup freezer is useful to batch job management system which start +and stop sets of tasks in order to schedule the resources of a machine +according to the desires of a system administrator. This sort of program +is often used on HPC clusters to schedule access to the cluster as a +whole. The cgroup freezer uses cgroups to describe the set of tasks to +be started/stopped by the batch job management system. It also provides +a means to start and stop the tasks composing the job. + + The cgroup freezer will also be useful for checkpointing running groups +of tasks. The freezer allows the checkpoint code to obtain a consistent +image of the tasks by attempting to force the tasks in a cgroup into a +quiescent state. Once the tasks are quiescent another task can +walk /proc or invoke a kernel interface to gather information about the +quiesced tasks. Checkpointed tasks can be restarted later should a +recoverable error occur. This also allows the checkpointed tasks to be +migrated between nodes in a cluster by copying the gathered information +to another node and restarting the tasks there. + + Sequences of SIGSTOP and SIGCONT are not always sufficient for stopping +and resuming tasks in userspace. Both of these signals are observable +from within the tasks we wish to freeze. While SIGSTOP cannot be caught, +blocked, or ignored it can be seen by waiting or ptracing parent tasks. +SIGCONT is especially unsuitable since it can be caught by the task. Any +programs designed to watch for SIGSTOP and SIGCONT could be broken by +attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can +demonstrate this problem using nested bash shells: + + $ echo $$ + 16644 + $ bash + $ echo $$ + 16690 + + From a second, unrelated bash shell: + $ kill -SIGSTOP 16690 + $ kill -SIGCONT 16990 + + <at this point 16990 exits and causes 16644 to exit too> + + This happens because bash can observe both signals and choose how it +responds to them. + + Another example of a program which catches and responds to these +signals is gdb. In fact any program designed to use ptrace is likely to +have a problem with this method of stopping and resuming tasks. + + In contrast, the cgroup freezer uses the kernel freezer code to +prevent the freeze/unfreeze cycle from becoming visible to the tasks +being frozen. This allows the bash example above and gdb to run as +expected. + + The freezer subsystem in the container filesystem defines a file named +freezer.state. Writing "FROZEN" to the state file will freeze all tasks in the +cgroup. Subsequently writing "THAWED" will unfreeze the tasks in the cgroup. +Reading will return the current state. + +* Examples of usage : + + # mkdir /containers/freezer + # mount -t cgroup -ofreezer freezer /containers + # mkdir /containers/0 + # echo $some_pid > /containers/0/tasks + +to get status of the freezer subsystem : + + # cat /containers/0/freezer.state + THAWED + +to freeze all tasks in the container : + + # echo FROZEN > /containers/0/freezer.state + # cat /containers/0/freezer.state + FREEZING + # cat /containers/0/freezer.state + FROZEN + +to unfreeze all tasks in the container : + + # echo THAWED > /containers/0/freezer.state + # cat /containers/0/freezer.state + THAWED + +This is the basic mechanism which should do the right thing for user space task +in a simple scenario. + +It's important to note that freezing can be incomplete. In that case we return +EBUSY. This means that some tasks in the cgroup are busy doing something that +prevents us from completely freezing the cgroup at this time. After EBUSY, +the cgroup will remain partially frozen -- reflected by freezer.state reporting +"FREEZING" when read. The state will remain "FREEZING" until one of these +things happens: + + 1) Userspace cancels the freezing operation by writing "THAWED" to + the freezer.state file + 2) Userspace retries the freezing operation by writing "FROZEN" to + the freezer.state file (writing "FREEZING" is not legal + and returns EIO) + 3) The tasks that blocked the cgroup from entering the "FROZEN" + state disappear from the cgroup's set of tasks. diff --git a/Documentation/controllers/memory.txt b/Documentation/controllers/memory.txt index 9b53d582736..1c07547d3f8 100644 --- a/Documentation/controllers/memory.txt +++ b/Documentation/controllers/memory.txt @@ -112,14 +112,22 @@ the per cgroup LRU. 2.2.1 Accounting details -All mapped pages (RSS) and unmapped user pages (Page Cache) are accounted. -RSS pages are accounted at the time of page_add_*_rmap() unless they've already -been accounted for earlier. A file page will be accounted for as Page Cache; -it's mapped into the page tables of a process, duplicate accounting is carefully -avoided. Page Cache pages are accounted at the time of add_to_page_cache(). -The corresponding routines that remove a page from the page tables or removes -a page from Page Cache is used to decrement the accounting counters of the -cgroup. +All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. +(some pages which never be reclaimable and will not be on global LRU + are not accounted. we just accounts pages under usual vm management.) + +RSS pages are accounted at page_fault unless they've already been accounted +for earlier. A file page will be accounted for as Page Cache when it's +inserted into inode (radix-tree). While it's mapped into the page tables of +processes, duplicate accounting is carefully avoided. + +A RSS page is unaccounted when it's fully unmapped. A PageCache page is +unaccounted when it's removed from radix-tree. + +At page migration, accounting information is kept. + +Note: we just account pages-on-lru because our purpose is to control amount +of used pages. not-on-lru pages are tend to be out-of-control from vm view. 2.3 Shared Page Accounting diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index 47e568a9370..5c86c258c79 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt @@ -48,7 +48,7 @@ hooks, beyond what is already present, required to manage dynamic job placement on large systems. Cpusets use the generic cgroup subsystem described in -Documentation/cgroup.txt. +Documentation/cgroups/cgroups.txt. Requests by a task, using the sched_setaffinity(2) system call to include CPUs in its CPU affinity mask, and using the mbind(2) and diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 295f26cd895..9dd2a3bb2ac 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -96,6 +96,11 @@ errors=remount-ro(*) Remount the filesystem read-only on an error. errors=continue Keep going on a filesystem error. errors=panic Panic and halt the machine if an error occurs. +data_err=ignore(*) Just print an error message if an error occurs + in a file data buffer in ordered mode. +data_err=abort Abort the journal if an error occurs in a file + data buffer in ordered mode. + grpid Give objects the same group ID as their creator. bsdgroups diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index c032bf39e8b..bcceb99b81d 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -1384,15 +1384,18 @@ causes the kernel to prefer to reclaim dentries and inodes. dirty_background_ratio ---------------------- -Contains, as a percentage of total system memory, the number of pages at which -the pdflush background writeback daemon will start writing out dirty data. +Contains, as a percentage of the dirtyable system memory (free pages + mapped +pages + file cache, not including locked pages and HugePages), the number of +pages at which the pdflush background writeback daemon will start writing out +dirty data. dirty_ratio ----------------- -Contains, as a percentage of total system memory, the number of pages at which -a process which is generating disk writes will itself start writing out dirty -data. +Contains, as a percentage of the dirtyable system memory (free pages + mapped +pages + file cache, not including locked pages and HugePages), the number of +pages at which a process which is generating disk writes will itself start +writing out dirty data. dirty_writeback_centisecs ------------------------- @@ -2412,24 +2415,29 @@ will be dumped when the <pid> process is dumped. coredump_filter is a bitmask of memory types. If a bit of the bitmask is set, memory segments of the corresponding memory type are dumped, otherwise they are not dumped. -The following 4 memory types are supported: +The following 7 memory types are supported: - (bit 0) anonymous private memory - (bit 1) anonymous shared memory - (bit 2) file-backed private memory - (bit 3) file-backed shared memory - (bit 4) ELF header pages in file-backed private memory areas (it is effective only if the bit 2 is cleared) + - (bit 5) hugetlb private memory + - (bit 6) hugetlb shared memory Note that MMIO pages such as frame buffer are never dumped and vDSO pages are always dumped regardless of the bitmask status. -Default value of coredump_filter is 0x3; this means all anonymous memory -segments are dumped. + Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only + effected by bit 5-6. + +Default value of coredump_filter is 0x23; this means all anonymous memory +segments and hugetlb private memory are dumped. If you don't want to dump all shared memory segments attached to pid 1234, -write 1 to the process's proc file. +write 0x21 to the process's proc file. - $ echo 0x1 > /proc/1234/coredump_filter + $ echo 0x21 > /proc/1234/coredump_filter When a new process is created, the process inherits the bitmask status from its parent. It is useful to set up coredump_filter before the program runs. diff --git a/Documentation/filesystems/ubifs.txt b/Documentation/filesystems/ubifs.txt index 6a0d70a22f0..dd84ea3c10d 100644 --- a/Documentation/filesystems/ubifs.txt +++ b/Documentation/filesystems/ubifs.txt @@ -86,6 +86,15 @@ norm_unmount (*) commit on unmount; the journal is committed fast_unmount do not commit on unmount; this option makes unmount faster, but the next mount slower because of the need to replay the journal. +bulk_read read more in one go to take advantage of flash + media that read faster sequentially +no_bulk_read (*) do not bulk-read +no_chk_data_crc skip checking of CRCs on data nodes in order to + improve read performance. Use this option only + if the flash media is highly reliable. The effect + of this option is that corruption of the contents + of a file can go unnoticed. +chk_data_crc (*) do not skip checking CRCs on data nodes Quick usage instructions diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index d4f4875fc7c..0f1544f6740 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -690,7 +690,7 @@ and is between 256 and 4096 characters. It is defined in the file See Documentation/block/as-iosched.txt and Documentation/block/deadline-iosched.txt for details. - elfcorehdr= [X86-32, X86_64] + elfcorehdr= [IA64,PPC,SH,X86-32,X86_64] Specifies physical address of start of kernel core image elf header. Generally kexec loader will pass this option to capture kernel. @@ -796,6 +796,8 @@ and is between 256 and 4096 characters. It is defined in the file Defaults to the default architecture's huge page size if not specified. + hlt [BUGS=ARM,SH] + i8042.debug [HW] Toggle i8042 debug mode i8042.direct [HW] Put keyboard port into non-translated mode i8042.dumbkbd [HW] Pretend that controller can only read data from @@ -1211,6 +1213,10 @@ and is between 256 and 4096 characters. It is defined in the file mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel memory. + memchunk=nn[KMG] + [KNL,SH] Allow user to override the default size for + per-device physically contiguous DMA buffers. + memmap=exactmap [KNL,X86-32,X86_64] Enable setting of an exact E820 memory map, as specified by the user. Such memmap=exactmap lines can be constructed based on @@ -1393,6 +1399,8 @@ and is between 256 and 4096 characters. It is defined in the file nodisconnect [HW,SCSI,M68K] Disables SCSI disconnects. + nodsp [SH] Disable hardware DSP at boot time. + noefi [X86-32,X86-64] Disable EFI runtime services support. noexec [IA-64] @@ -1409,13 +1417,15 @@ and is between 256 and 4096 characters. It is defined in the file noexec32=off: disable non-executable mappings read implies executable mappings + nofpu [SH] Disable hardware FPU at boot time. + nofxsr [BUGS=X86-32] Disables x86 floating point extended register save and restore. The kernel will only save legacy floating-point registers on task switch. noclflush [BUGS=X86] Don't use the CLFLUSH instruction - nohlt [BUGS=ARM] + nohlt [BUGS=ARM,SH] no-hlt [BUGS=X86-32] Tells the kernel that the hlt instruction doesn't work correctly and not to diff --git a/Documentation/mtd/nand_ecc.txt b/Documentation/mtd/nand_ecc.txt new file mode 100644 index 00000000000..bdf93b7f0f2 --- /dev/null +++ b/Documentation/mtd/nand_ecc.txt @@ -0,0 +1,714 @@ +Introduction +============ + +Having looked at the linux mtd/nand driver and more specific at nand_ecc.c +I felt there was room for optimisation. I bashed the code for a few hours +performing tricks like table lookup removing superfluous code etc. +After that the speed was increased by 35-40%. +Still I was not too happy as I felt there was additional room for improvement. + +Bad! I was hooked. +I decided to annotate my steps in this file. Perhaps it is useful to someone +or someone learns something from it. + + +The problem +=========== + +NAND flash (at least SLC one) typically has sectors of 256 bytes. +However NAND flash is not extremely reliable so some error detection +(and sometimes correction) is needed. + +This is done by means of a Hamming code. I'll try to explain it in +laymans terms (and apologies to all the pro's in the field in case I do +not use the right terminology, my coding theory class was almost 30 +years ago, and I must admit it was not one of my favourites). + +As I said before the ecc calculation is performed on sectors of 256 +bytes. This is done by calculating several parity bits over the rows and +columns. The parity used is even parity which means that the parity bit = 1 +if the data over which the parity is calculated is 1 and the parity bit = 0 +if the data over which the parity is calculated is 0. So the total +number of bits over the data over which the parity is calculated + the +parity bit is even. (see wikipedia if you can't follow this). +Parity is often calculated by means of an exclusive or operation, +sometimes also referred to as xor. In C the operator for xor is ^ + +Back to ecc. +Let's give a small figure: + +byte 0: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp4 ... rp14 +byte 1: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp2 rp4 ... rp14 +byte 2: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp4 ... rp14 +byte 3: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp4 ... rp14 +byte 4: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp5 ... rp14 +.... +byte 254: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp5 ... rp15 +byte 255: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp5 ... rp15 + cp1 cp0 cp1 cp0 cp1 cp0 cp1 cp0 + cp3 cp3 cp2 cp2 cp3 cp3 cp2 cp2 + cp5 cp5 cp5 cp5 cp4 cp4 cp4 cp4 + +This figure represents a sector of 256 bytes. +cp is my abbreviaton for column parity, rp for row parity. + +Let's start to explain column parity. +cp0 is the parity that belongs to all bit0, bit2, bit4, bit6. +so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even. +Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7. +cp2 is the parity over bit0, bit1, bit4 and bit5 +cp3 is the parity over bit2, bit3, bit6 and bit7. +cp4 is the parity over bit0, bit1, bit2 and bit3. +cp5 is the parity over bit4, bit5, bit6 and bit7. +Note that each of cp0 .. cp5 is exactly one bit. + +Row parity actually works almost the same. +rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254) +rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255) +rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ... +(so handle two bytes, then skip 2 bytes). +rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...) +for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc. +so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...) +and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, .. +The story now becomes quite boring. I guess you get the idea. +rp6 covers 8 bytes then skips 8 etc +rp7 skips 8 bytes then covers 8 etc +rp8 covers 16 bytes then skips 16 etc +rp9 skips 16 bytes then covers 16 etc +rp10 covers 32 bytes then skips 32 etc +rp11 skips 32 bytes then covers 32 etc +rp12 covers 64 bytes then skips 64 etc +rp13 skips 64 bytes then covers 64 etc +rp14 covers 128 bytes then skips 128 +rp15 skips 128 bytes then covers 128 + +In the end the parity bits are grouped together in three bytes as +follows: +ECC Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0 +ECC 0 rp07 rp06 rp05 rp04 rp03 rp02 rp01 rp00 +ECC 1 rp15 rp14 rp13 rp12 rp11 rp10 rp09 rp08 +ECC 2 cp5 cp4 cp3 cp2 cp1 cp0 1 1 + +I detected after writing this that ST application note AN1823 +(http://www.st.com/stonline/books/pdf/docs/10123.pdf) gives a much +nicer picture.(but they use line parity as term where I use row parity) +Oh well, I'm graphically challenged, so suffer with me for a moment :-) +And I could not reuse the ST picture anyway for copyright reasons. + + +Attempt 0 +========= + +Implementing the parity calculation is pretty simple. +In C pseudocode: +for (i = 0; i < 256; i++) +{ + if (i & 0x01) + rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; + else + rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; + if (i & 0x02) + rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3; + else + rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2; + if (i & 0x04) + rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5; + else + rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4; + if (i & 0x08) + rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7; + else + rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6; + if (i & 0x10) + rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9; + else + rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8; + if (i & 0x20) + rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11; + else + rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10; + if (i & 0x40) + rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13; + else + rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12; + if (i & 0x80) + rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15; + else + rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14; + cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0; + cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1; + cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2; + cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3 + cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4 + cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5 +} + + +Analysis 0 +========== + +C does have bitwise operators but not really operators to do the above +efficiently (and most hardware has no such instructions either). +Therefore without implementing this it was clear that the code above was +not going to bring me a Nobel prize :-) + +Fortunately the exclusive or operation is commutative, so we can combine +the values in any order. So instead of calculating all the bits +individually, let us try to rearrange things. +For the column parity this is easy. We can just xor the bytes and in the +end filter out the relevant bits. This is pretty nice as it will bring +all cp calculation out of the if loop. + +Similarly we can first xor the bytes for the various rows. +This leads to: + + +Attempt 1 +========= + +const char parity[256] = { + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0 +}; + +void ecc1(const unsigned char *buf, unsigned char *code) +{ + int i; + const unsigned char *bp = buf; + unsigned char cur; + unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; + unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; + unsigned char par; + + par = 0; + rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; + rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; + rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; + rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; + + for (i = 0; i < 256; i++) + { + cur = *bp++; + par ^= cur; + if (i & 0x01) rp1 ^= cur; else rp0 ^= cur; + if (i & 0x02) rp3 ^= cur; else rp2 ^= cur; + if (i & 0x04) rp5 ^= cur; else rp4 ^= cur; + if (i & 0x08) rp7 ^= cur; else rp6 ^= cur; + if (i & 0x10) rp9 ^= cur; else rp8 ^= cur; + if (i & 0x20) rp11 ^= cur; else rp10 ^= cur; + if (i & 0x40) rp13 ^= cur; else rp12 ^= cur; + if (i & 0x80) rp15 ^= cur; else rp14 ^= cur; + } + code[0] = + (parity[rp7] << 7) | + (parity[rp6] << 6) | + (parity[rp5] << 5) | + (parity[rp4] << 4) | + (parity[rp3] << 3) | + (parity[rp2] << 2) | + (parity[rp1] << 1) | + (parity[rp0]); + code[1] = + (parity[rp15] << 7) | + (parity[rp14] << 6) | + (parity[rp13] << 5) | + (parity[rp12] << 4) | + (parity[rp11] << 3) | + (parity[rp10] << 2) | + (parity[rp9] << 1) | + (parity[rp8]); + code[2] = + (parity[par & 0xf0] << 7) | + (parity[par & 0x0f] << 6) | + (parity[par & 0xcc] << 5) | + (parity[par & 0x33] << 4) | + (parity[par & 0xaa] << 3) | + (parity[par & 0x55] << 2); + code[0] = ~code[0]; + code[1] = ~code[1]; + code[2] = ~code[2]; +} + +Still pretty straightforward. The last three invert statements are there to +give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash +all data is 0xff, so the checksum then matches. + +I also introduced the parity lookup. I expected this to be the fastest +way to calculate the parity, but I will investigate alternatives later +on. + + +Analysis 1 +========== + +The code works, but is not terribly efficient. On my system it took +almost 4 times as much time as the linux driver code. But hey, if it was +*that* easy this would have been done long before. +No pain. no gain. + +Fortunately there is plenty of room for improvement. + +In step 1 we moved from bit-wise calculation to byte-wise calculation. +However in C we can also use the unsigned long data type and virtually +every modern microprocessor supports 32 bit operations, so why not try +to write our code in such a way that we process data in 32 bit chunks. + +Of course this means some modification as the row parity is byte by +byte. A quick analysis: +for the column parity we use the par variable. When extending to 32 bits +we can in the end easily calculate p0 and p1 from it. +(because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0 +respectively) +also rp2 and rp3 can be easily retrieved from par as rp3 covers the +first two bytes and rp2 the last two bytes. + +Note that of course now the loop is executed only 64 times (256/4). +And note that care must taken wrt byte ordering. The way bytes are +ordered in a long is machine dependent, and might affect us. +Anyway, if there is an issue: this code is developed on x86 (to be +precise: a DELL PC with a D920 Intel CPU) + +And of course the performance might depend on alignment, but I expect +that the I/O buffers in the nand driver are aligned properly (and +otherwise that should be fixed to get maximum performance). + +Let's give it a try... + + +Attempt 2 +========= + +extern const char parity[256]; + +void ecc2(const unsigned char *buf, unsigned char *code) +{ + int i; + const unsigned long *bp = (unsigned long *)buf; + unsigned long cur; + unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; + unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; + unsigned long par; + + par = 0; + rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; + rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; + rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; + rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; + + for (i = 0; i < 64; i++) + { + cur = *bp++; + par ^= cur; + if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; + if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; + if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; + if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; + if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; + if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; + } + /* + we need to adapt the code generation for the fact that rp vars are now + long; also the column parity calculation needs to be changed. + we'll bring rp4 to 15 back to single byte entities by shifting and + xoring + */ + rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff; + rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff; + rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff; + rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff; + rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff; + rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff; + rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff; + rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff; + rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff; + rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff; + rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff; + rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff; + rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff; + rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff; + par ^= (par >> 16); + rp1 = (par >> 8); rp1 &= 0xff; + rp0 = (par & 0xff); + par ^= (par >> 8); par &= 0xff; + + code[0] = + (parity[rp7] << 7) | + (parity[rp6] << 6) | + (parity[rp5] << 5) | + (parity[rp4] << 4) | + (parity[rp3] << 3) | + (parity[rp2] << 2) | + (parity[rp1] << 1) | + (parity[rp0]); + code[1] = + (parity[rp15] << 7) | + (parity[rp14] << 6) | + (parity[rp13] << 5) | + (parity[rp12] << 4) | + (parity[rp11] << 3) | + (parity[rp10] << 2) | + (parity[rp9] << 1) | + (parity[rp8]); + code[2] = + (parity[par & 0xf0] << 7) | + (parity[par & 0x0f] << 6) | + (parity[par & 0xcc] << 5) | + (parity[par & 0x33] << 4) | + (parity[par & 0xaa] << 3) | + (parity[par & 0x55] << 2); + code[0] = ~code[0]; + code[1] = ~code[1]; + code[2] = ~code[2]; +} + +The parity array is not shown any more. Note also that for these +examples I kinda deviated from my regular programming style by allowing +multiple statements on a line, not using { } in then and else blocks +with only a single statement and by using operators like ^= + + +Analysis 2 +========== + +The code (of course) works, and hurray: we are a little bit faster than +the linux driver code (about 15%). But wait, don't cheer too quickly. +THere is more to be gained. +If we look at e.g. rp14 and rp15 we see that we either xor our data with +rp14 or with rp15. However we also have par which goes over all data. +This means there is no need to calculate rp14 as it can be calculated from +rp15 through rp14 = par ^ rp15; +(or if desired we can avoid calculating rp15 and calculate it from +rp14). That is why some places refer to inverse parity. +Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13. +Effectively this means we can eliminate the else clause from the if +statements. Also we can optimise the calculation in the end a little bit +by going from long to byte first. Actually we can even avoid the table +lookups + +Attempt 3 +========= + +Odd replaced: + if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; + if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; + if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; + if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; + if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; + if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; +with + if (i & 0x01) rp5 ^= cur; + if (i & 0x02) rp7 ^= cur; + if (i & 0x04) rp9 ^= cur; + if (i & 0x08) rp11 ^= cur; + if (i & 0x10) rp13 ^= cur; + if (i & 0x20) rp15 ^= cur; + + and outside the loop added: + rp4 = par ^ rp5; + rp6 = par ^ rp7; + rp8 = par ^ rp9; + rp10 = par ^ rp11; + rp12 = par ^ rp13; + rp14 = par ^ rp15; + +And after that the code takes about 30% more time, although the number of +statements is reduced. This is also reflected in the assembly code. + + +Analysis 3 +========== + +Very weird. Guess it has to do with caching or instruction parallellism +or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting +observation was that this one is only 30% slower (according to time) +executing the code as my 3Ghz D920 processor. + +Well, it was expected not to be easy so maybe instead move to a +different track: let's move back to the code from attempt2 and do some +loop unrolling. This will eliminate a few if statements. I'll try +different amounts of unrolling to see what works best. + + +Attempt 4 +========= + +Unrolled the loop 1, 2, 3 and 4 times. +For 4 the code starts with: + + for (i = 0; i < 4; i++) + { + cur = *bp++; + par ^= cur; + rp4 ^= cur; + rp6 ^= cur; + rp8 ^= cur; + rp10 ^= cur; + if (i & 0x1) rp13 ^= cur; else rp12 ^= cur; + if (i & 0x2) rp15 ^= cur; else rp14 ^= cur; + cur = *bp++; + par ^= cur; + rp5 ^= cur; + rp6 ^= cur; + ... + + +Analysis 4 +========== + +Unrolling once gains about 15% +Unrolling twice keeps the gain at about 15% +Unrolling three times gives a gain of 30% compared to attempt 2. +Unrolling four times gives a marginal improvement compared to unrolling +three times. + +I decided to proceed with a four time unrolled loop anyway. It was my gut +feeling that in the next steps I would obtain additional gain from it. + +The next step was triggered by the fact that par contains the xor of all +bytes and rp4 and rp5 each contain the xor of half of the bytes. +So in effect par = rp4 ^ rp5. But as xor is commutative we can also say +that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can +eliminate rp5 (or rp4, but I already foresaw another optimisation). +The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15. + + +Attempt 5 +========= + +Effectively so all odd digit rp assignments in the loop were removed. +This included the else clause of the if statements. +Of course after the loop we need to correct things by adding code like: + rp5 = par ^ rp4; +Also the initial assignments (rp5 = 0; etc) could be removed. +Along the line I also removed the initialisation of rp0/1/2/3. + + +Analysis 5 +========== + +Measurements showed this was a good move. The run-time roughly halved +compared with attempt 4 with 4 times unrolled, and we only require 1/3rd +of the processor time compared to the current code in the linux kernel. + +However, still I thought there was more. I didn't like all the if +statements. Why not keep a running parity and only keep the last if +statement. Time for yet another version! + + +Attempt 6 +========= + +THe code within the for loop was changed to: + + for (i = 0; i < 4; i++) + { + cur = *bp++; tmppar = cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; + + cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; + + cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur; + cur = *bp++; tmppar ^= cur; rp8 ^= cur; + + cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; + |