aboutsummaryrefslogtreecommitdiff
path: root/Documentation/vm
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/vm')
-rw-r--r--Documentation/vm/00-INDEX26
-rw-r--r--Documentation/vm/Makefile8
-rw-r--r--Documentation/vm/active_mm.txt2
-rw-r--r--Documentation/vm/cleancache.txt279
-rw-r--r--Documentation/vm/frontswap.txt278
-rw-r--r--Documentation/vm/highmem.txt162
-rw-r--r--Documentation/vm/hugetlbpage.txt194
-rw-r--r--Documentation/vm/hwpoison.txt11
-rw-r--r--Documentation/vm/ksm.txt15
-rw-r--r--Documentation/vm/locking130
-rw-r--r--Documentation/vm/map_hugetlb.c77
-rw-r--r--Documentation/vm/numa186
-rw-r--r--Documentation/vm/numa_memory_policy.txt11
-rw-r--r--Documentation/vm/overcommit-accounting17
-rw-r--r--Documentation/vm/page-types.c1003
-rw-r--r--Documentation/vm/pagemap.txt11
-rw-r--r--Documentation/vm/remap_file_pages.txt28
-rw-r--r--Documentation/vm/slabinfo.c1364
-rw-r--r--Documentation/vm/slub.txt10
-rw-r--r--Documentation/vm/soft-dirty.txt43
-rw-r--r--Documentation/vm/split_page_table_lock94
-rw-r--r--Documentation/vm/transhuge.txt376
-rw-r--r--Documentation/vm/unevictable-lru.txt27
-rw-r--r--Documentation/vm/zswap.txt68
24 files changed, 1579 insertions, 2841 deletions
diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
index e57d6a9dd32..081c49777ab 100644
--- a/Documentation/vm/00-INDEX
+++ b/Documentation/vm/00-INDEX
@@ -4,12 +4,18 @@ active_mm.txt
- An explanation from Linus about tsk->active_mm vs tsk->mm.
balance
- various information on memory balancing.
+cleancache.txt
+ - Intro to cleancache and page-granularity victim cache.
+frontswap.txt
+ - Outline frontswap, part of the transcendent memory frontend.
+highmem.txt
+ - Outline of highmem and common issues.
hugetlbpage.txt
- a brief summary of hugetlbpage support in the Linux kernel.
+hwpoison.txt
+ - explains what hwpoison is
ksm.txt
- how to use the Kernel Samepage Merging feature.
-locking
- - info on how locking and synchronization is done in the Linux vm code.
numa
- information about NUMA specific code in the Linux vm.
numa_memory_policy.txt
@@ -18,9 +24,17 @@ overcommit-accounting
- description of the Linux kernels overcommit handling modes.
page_migration
- description of page migration in NUMA systems.
-slabinfo.c
- - source code for a tool to get reports about slabs.
+pagemap.txt
+ - pagemap, from the userspace perspective
slub.txt
- a short users guide for SLUB.
-map_hugetlb.c
- - an example program that uses the MAP_HUGETLB mmap flag.
+soft-dirty.txt
+ - short explanation for soft-dirty PTEs
+split_page_table_lock
+ - Separate per-table lock to improve scalability of the old page_table_lock.
+transhuge.txt
+ - Transparent Hugepage Support, alternative way of using hugepages.
+unevictable-lru.txt
+ - Unevictable LRU infrastructure
+zswap.txt
+ - Intro to compressed cache for swap pages
diff --git a/Documentation/vm/Makefile b/Documentation/vm/Makefile
deleted file mode 100644
index 5bd269b3731..00000000000
--- a/Documentation/vm/Makefile
+++ /dev/null
@@ -1,8 +0,0 @@
-# kbuild trick to avoid linker error. Can be omitted if a module is built.
-obj- := dummy.o
-
-# List of programs to build
-hostprogs-y := slabinfo page-types
-
-# Tell kbuild to always build the programs
-always := $(hostprogs-y)
diff --git a/Documentation/vm/active_mm.txt b/Documentation/vm/active_mm.txt
index 4ee1f643d89..dbf45817405 100644
--- a/Documentation/vm/active_mm.txt
+++ b/Documentation/vm/active_mm.txt
@@ -74,7 +74,7 @@ we have a user context", and is generally done by the page fault handler
and things like that).
Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago,
-because it slightly changes the interfaces to accomodate the alpha (who
+because it slightly changes the interfaces to accommodate the alpha (who
would have thought it, but the alpha actually ends up having one of the
ugliest context switch codes - unlike the other architectures where the MM
and register state is separate, the alpha PALcode joins the two, and you
diff --git a/Documentation/vm/cleancache.txt b/Documentation/vm/cleancache.txt
new file mode 100644
index 00000000000..142fbb0f325
--- /dev/null
+++ b/Documentation/vm/cleancache.txt
@@ -0,0 +1,279 @@
+MOTIVATION
+
+Cleancache is a new optional feature provided by the VFS layer that
+potentially dramatically increases page cache effectiveness for
+many workloads in many environments at a negligible cost.
+
+Cleancache can be thought of as a page-granularity victim cache for clean
+pages that the kernel's pageframe replacement algorithm (PFRA) would like
+to keep around, but can't since there isn't enough memory. So when the
+PFRA "evicts" a page, it first attempts to use cleancache code to
+put the data contained in that page into "transcendent memory", memory
+that is not directly accessible or addressable by the kernel and is
+of unknown and possibly time-varying size.
+
+Later, when a cleancache-enabled filesystem wishes to access a page
+in a file on disk, it first checks cleancache to see if it already
+contains it; if it does, the page of data is copied into the kernel
+and a disk access is avoided.
+
+Transcendent memory "drivers" for cleancache are currently implemented
+in Xen (using hypervisor memory) and zcache (using in-kernel compressed
+memory) and other implementations are in development.
+
+FAQs are included below.
+
+IMPLEMENTATION OVERVIEW
+
+A cleancache "backend" that provides transcendent memory registers itself
+to the kernel's cleancache "frontend" by calling cleancache_register_ops,
+passing a pointer to a cleancache_ops structure with funcs set appropriately.
+Note that cleancache_register_ops returns the previous settings so that
+chaining can be performed if desired. The functions provided must conform to
+certain semantics as follows:
+
+Most important, cleancache is "ephemeral". Pages which are copied into
+cleancache have an indefinite lifetime which is completely unknowable
+by the kernel and so may or may not still be in cleancache at any later time.
+Thus, as its name implies, cleancache is not suitable for dirty pages.
+Cleancache has complete discretion over what pages to preserve and what
+pages to discard and when.
+
+Mounting a cleancache-enabled filesystem should call "init_fs" to obtain a
+pool id which, if positive, must be saved in the filesystem's superblock;
+a negative return value indicates failure. A "put_page" will copy a
+(presumably about-to-be-evicted) page into cleancache and associate it with
+the pool id, a file key, and a page index into the file. (The combination
+of a pool id, a file key, and an index is sometimes called a "handle".)
+A "get_page" will copy the page, if found, from cleancache into kernel memory.
+An "invalidate_page" will ensure the page no longer is present in cleancache;
+an "invalidate_inode" will invalidate all pages associated with the specified
+file; and, when a filesystem is unmounted, an "invalidate_fs" will invalidate
+all pages in all files specified by the given pool id and also surrender
+the pool id.
+
+An "init_shared_fs", like init_fs, obtains a pool id but tells cleancache
+to treat the pool as shared using a 128-bit UUID as a key. On systems
+that may run multiple kernels (such as hard partitioned or virtualized
+systems) that may share a clustered filesystem, and where cleancache
+may be shared among those kernels, calls to init_shared_fs that specify the
+same UUID will receive the same pool id, thus allowing the pages to
+be shared. Note that any security requirements must be imposed outside
+of the kernel (e.g. by "tools" that control cleancache). Or a
+cleancache implementation can simply disable shared_init by always
+returning a negative value.
+
+If a get_page is successful on a non-shared pool, the page is invalidated
+(thus making cleancache an "exclusive" cache). On a shared pool, the page
+is NOT invalidated on a successful get_page so that it remains accessible to
+other sharers. The kernel is responsible for ensuring coherency between
+cleancache (shared or not), the page cache, and the filesystem, using
+cleancache invalidate operations as required.
+
+Note that cleancache must enforce put-put-get coherency and get-get
+coherency. For the former, if two puts are made to the same handle but
+with different data, say AAA by the first put and BBB by the second, a
+subsequent get can never return the stale data (AAA). For get-get coherency,
+if a get for a given handle fails, subsequent gets for that handle will
+never succeed unless preceded by a successful put with that handle.
+
+Last, cleancache provides no SMP serialization guarantees; if two
+different Linux threads are simultaneously putting and invalidating a page
+with the same handle, the results are indeterminate. Callers must
+lock the page to ensure serial behavior.
+
+CLEANCACHE PERFORMANCE METRICS
+
+If properly configured, monitoring of cleancache is done via debugfs in
+the /sys/kernel/debug/mm/cleancache directory. The effectiveness of cleancache
+can be measured (across all filesystems) with:
+
+succ_gets - number of gets that were successful
+failed_gets - number of gets that failed
+puts - number of puts attempted (all "succeed")
+invalidates - number of invalidates attempted
+
+A backend implementation may provide additional metrics.
+
+FAQ
+
+1) Where's the value? (Andrew Morton)
+
+Cleancache provides a significant performance benefit to many workloads
+in many environments with negligible overhead by improving the
+effectiveness of the pagecache. Clean pagecache pages are
+saved in transcendent memory (RAM that is otherwise not directly
+addressable to the kernel); fetching those pages later avoids "refaults"
+and thus disk reads.
+
+Cleancache (and its sister code "frontswap") provide interfaces for
+this transcendent memory (aka "tmem"), which conceptually lies between
+fast kernel-directly-addressable RAM and slower DMA/asynchronous devices.
+Disallowing direct kernel or userland reads/writes to tmem
+is ideal when data is transformed to a different form and size (such
+as with compression) or secretly moved (as might be useful for write-
+balancing for some RAM-like devices). Evicted page-cache pages (and
+swap pages) are a great use for this kind of slower-than-RAM-but-much-
+faster-than-disk transcendent memory, and the cleancache (and frontswap)
+"page-object-oriented" specification provides a nice way to read and
+write -- and indirectly "name" -- the pages.
+
+In the virtual case, the whole point of virtualization is to statistically
+multiplex physical resources across the varying demands of multiple
+virtual machines. This is really hard to do with RAM and efforts to
+do it well with no kernel change have essentially failed (except in some
+well-publicized special-case workloads). Cleancache -- and frontswap --
+with a fairly small impact on the kernel, provide a huge amount
+of flexibility for more dynamic, flexible RAM multiplexing.
+Specifically, the Xen Transcendent Memory backend allows otherwise
+"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
+virtual machines, but the pages can be compressed and deduplicated to
+optimize RAM utilization. And when guest OS's are induced to surrender
+underutilized RAM (e.g. with "self-ballooning"), page cache pages
+are the first to go, and cleancache allows those pages to be
+saved and reclaimed if overall host system memory conditions allow.
+
+And the identical interface used for cleancache can be used in
+physical systems as well. The zcache driver acts as a memory-hungry
+device that stores pages of data in a compressed state. And
+the proposed "RAMster" driver shares RAM across multiple physical
+systems.
+
+2) Why does cleancache have its sticky fingers so deep inside the
+ filesystems and VFS? (Andrew Morton and Christoph Hellwig)
+
+The core hooks for cleancache in VFS are in most cases a single line
+and the minimum set are placed precisely where needed to maintain
+coherency (via cleancache_invalidate operations) between cleancache,
+the page cache, and disk. All hooks compile into nothingness if
+cleancache is config'ed off and turn into a function-pointer-
+compare-to-NULL if config'ed on but no backend claims the ops
+functions, or to a compare-struct-element-to-negative if a
+backend claims the ops functions but a filesystem doesn't enable
+cleancache.
+
+Some filesystems are built entirely on top of VFS and the hooks
+in VFS are sufficient, so don't require an "init_fs" hook; the
+initial implementation of cleancache didn't provide this hook.
+But for some filesystems (such as btrfs), the VFS hooks are
+incomplete and one or more hooks in fs-specific code are required.
+And for some other filesystems, such as tmpfs, cleancache may
+be counterproductive. So it seemed prudent to require a filesystem
+to "opt in" to use cleancache, which requires adding a hook in
+each filesystem. Not all filesystems are supported by cleancache
+only because they haven't been tested. The existing set should
+be sufficient to validate the concept, the opt-in approach means
+that untested filesystems are not affected, and the hooks in the
+existing filesystems should make it very easy to add more
+filesystems in the future.
+
+The total impact of the hooks to existing fs and mm files is only
+about 40 lines added (not counting comments and blank lines).
+
+3) Why not make cleancache asynchronous and batched so it can
+ more easily interface with real devices with DMA instead
+ of copying each individual page? (Minchan Kim)
+
+The one-page-at-a-time copy semantics simplifies the implementation
+on both the frontend and backend and also allows the backend to
+do fancy things on-the-fly like page compression and
+page deduplication. And since the data is "gone" (copied into/out
+of the pageframe) before the cleancache get/put call returns,
+a great deal of race conditions and potential coherency issues
+are avoided. While the interface seems odd for a "real device"
+or for real kernel-addressable RAM, it makes perfect sense for
+transcendent memory.
+
+4) Why is non-shared cleancache "exclusive"? And where is the
+ page "invalidated" after a "get"? (Minchan Kim)
+
+The main reason is to free up space in transcendent memory and
+to avoid unnecessary cleancache_invalidate calls. If you want inclusive,
+the page can be "put" immediately following the "get". If
+put-after-get for inclusive becomes common, the interface could
+be easily extended to add a "get_no_invalidate" call.
+
+The invalidate is done by the cleancache backend implementation.
+
+5) What's the performance impact?
+
+Performance analysis has been presented at OLS'09 and LCA'10.
+Briefly, performance gains can be significant on most workloads,
+especially when memory pressure is high (e.g. when RAM is
+overcommitted in a virtual workload); and because the hooks are
+invoked primarily in place of or in addition to a disk read/write,
+overhead is negligible even in worst case workloads. Basically
+cleancache replaces I/O with memory-copy-CPU-overhead; on older
+single-core systems with slow memory-copy speeds, cleancache
+has little value, but in newer multicore machines, especially
+consolidated/virtualized machines, it has great value.
+
+6) How do I add cleancache support for filesystem X? (Boaz Harrash)
+
+Filesystems that are well-behaved and conform to certain
+restrictions can utilize cleancache simply by making a call to
+cleancache_init_fs at mount time. Unusual, misbehaving, or
+poorly layered filesystems must either add additional hooks
+and/or undergo extensive additional testing... or should just
+not enable the optional cleancache.
+
+Some points for a filesystem to consider:
+
+- The FS should be block-device-based (e.g. a ram-based FS such
+ as tmpfs should not enable cleancache)
+- To ensure coherency/correctness, the FS must ensure that all
+ file removal or truncation operations either go through VFS or
+ add hooks to do the equivalent cleancache "invalidate" operations
+- To ensure coherency/correctness, either inode numbers must
+ be unique across the lifetime of the on-disk file OR the
+ FS must provide an "encode_fh" function.
+- The FS must call the VFS superblock alloc and deactivate routines
+ or add hooks to do the equivalent cleancache calls done there.
+- To maximize performance, all pages fetched from the FS should
+ go through the do_mpag_readpage routine or the FS should add
+ hooks to do the equivalent (cf. btrfs)
+- Currently, the FS blocksize must be the same as PAGESIZE. This
+ is not an architectural restriction, but no backends currently
+ support anything different.
+- A clustered FS should invoke the "shared_init_fs" cleancache
+ hook to get best performance for some backends.
+
+7) Why not use the KVA of the inode as the key? (Christoph Hellwig)
+
+If cleancache would use the inode virtual address instead of
+inode/filehandle, the pool id could be eliminated. But, this
+won't work because cleancache retains pagecache data pages
+persistently even when the inode has been pruned from the
+inode unused list, and only invalidates the data page if the file
+gets removed/truncated. So if cleancache used the inode kva,
+there would be potential coherency issues if/when the inode
+kva is reused for a different file. Alternately, if cleancache
+invalidated the pages when the inode kva was freed, much of the value
+of cleancache would be lost because the cache of pages in cleanache
+is potentially much larger than the kernel pagecache and is most
+useful if the pages survive inode cache removal.
+
+8) Why is a global variable required?
+
+The cleancache_enabled flag is checked in all of the frequently-used
+cleancache hooks. The alternative is a function call to check a static
+variable. Since cleancache is enabled dynamically at runtime, systems
+that don't enable cleancache would suffer thousands (possibly
+tens-of-thousands) of unnecessary function calls per second. So the
+global variable allows cleancache to be enabled by default at compile
+time, but have insignificant performance impact when cleancache remains
+disabled at runtime.
+
+9) Does cleanache work with KVM?
+
+The memory model of KVM is sufficiently different that a cleancache
+backend may have less value for KVM. This remains to be tested,
+especially in an overcommitted system.
+
+10) Does cleancache work in userspace? It sounds useful for
+ memory hungry caches like web browsers. (Jamie Lokier)
+
+No plans yet, though we agree it sounds useful, at least for
+apps that bypass the page cache (e.g. O_DIRECT).
+
+Last updated: Dan Magenheimer, April 13 2011
diff --git a/Documentation/vm/frontswap.txt b/Documentation/vm/frontswap.txt
new file mode 100644
index 00000000000..c71a019be60
--- /dev/null
+++ b/Documentation/vm/frontswap.txt
@@ -0,0 +1,278 @@
+Frontswap provides a "transcendent memory" interface for swap pages.
+In some environments, dramatic performance savings may be obtained because
+swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.
+
+(Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends"
+and the only necessary changes to the core kernel for transcendent memory;
+all other supporting code -- the "backends" -- is implemented as drivers.
+See the LWN.net article "Transcendent memory in a nutshell" for a detailed
+overview of frontswap and related kernel parts:
+https://lwn.net/Articles/454795/ )
+
+Frontswap is so named because it can be thought of as the opposite of
+a "backing" store for a swap device. The storage is assumed to be
+a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming
+to the requirements of transcendent memory (such as Xen's "tmem", or
+in-kernel compressed memory, aka "zcache", or future RAM-like devices);
+this pseudo-RAM device is not directly accessible or addressable by the
+kernel and is of unknown and possibly time-varying size. The driver
+links itself to frontswap by calling frontswap_register_ops to set the
+frontswap_ops funcs appropriately and the functions it provides must
+conform to certain policies as follows:
+
+An "init" prepares the device to receive frontswap pages associated
+with the specified swap device number (aka "type"). A "store" will
+copy the page to transcendent memory and associate it with the type and
+offset associated with the page. A "load" will copy the page, if found,
+from transcendent memory into kernel memory, but will NOT remove the page
+from transcendent memory. An "invalidate_page" will remove the page
+from transcendent memory and an "invalidate_area" will remove ALL pages
+associated with the swap type (e.g., like swapoff) and notify the "device"
+to refuse further stores with that swap type.
+
+Once a page is successfully stored, a matching load on the page will normally
+succeed. So when the kernel finds itself in a situation where it needs
+to swap out a page, it first attempts to use frontswap. If the store returns
+success, the data has been successfully saved to transcendent memory and
+a disk write and, if the data is later read back, a disk read are avoided.
+If a store returns failure, transcendent memory has rejected the data, and the
+page can be written to swap as usual.
+
+If a backend chooses, frontswap can be configured as a "writethrough
+cache" by calling frontswap_writethrough(). In this mode, the reduction
+in swap device writes is lost (and also a non-trivial performance advantage)
+in order to allow the backend to arbitrarily "reclaim" space used to
+store frontswap pages to more completely manage its memory usage.
+
+Note that if a page is stored and the page already exists in transcendent memory
+(a "duplicate" store), either the store succeeds and the data is overwritten,
+or the store fails AND the page is invalidated. This ensures stale data may
+never be obtained from frontswap.
+
+If properly configured, monitoring of frontswap is done via debugfs in
+the /sys/kernel/debug/frontswap directory. The effectiveness of
+frontswap can be measured (across all swap devices) with:
+
+failed_stores - how many store attempts have failed
+loads - how many loads were attempted (all should succeed)
+succ_stores - how many store attempts have succeeded
+invalidates - how many invalidates were attempted
+
+A backend implementation may provide additional metrics.
+
+FAQ
+
+1) Where's the value?
+
+When a workload starts swapping, performance falls through the floor.
+Frontswap significantly increases performance in many such workloads by
+providing a clean, dynamic interface to read and write swap pages to
+"transcendent memory" that is otherwise not directly addressable to the kernel.
+This interface is ideal when data is transformed to a different form
+and size (such as with compression) or secretly moved (as might be
+useful for write-balancing for some RAM-like devices). Swap pages (and
+evicted page-cache pages) are a great use for this kind of slower-than-RAM-
+but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and
+cleancache) interface to transcendent memory provides a nice way to read
+and write -- and indirectly "name" -- the pages.
+
+Frontswap -- and cleancache -- with a fairly small impact on the kernel,
+provides a huge amount of flexibility for more dynamic, flexible RAM
+utilization in various system configurations:
+
+In the single kernel case, aka "zcache", pages are compressed and
+stored in local memory, thus increasing the total anonymous pages
+that can be safely kept in RAM. Zcache essentially trades off CPU
+cycles used in compression/decompression for better memory utilization.
+Benchmarks have shown little or no impact when memory pressure is
+low while providing a significant performance improvement (25%+)
+on some workloads under high memory pressure.
+
+"RAMster" builds on zcache by adding "peer-to-peer" transcendent memory
+support for clustered systems. Frontswap pages are locally compressed
+as in zcache, but then "remotified" to another system's RAM. This
+allows RAM to be dynamically load-balanced back-and-forth as needed,
+i.e. when system A is overcommitted, it can swap to system B, and
+vice versa. RAMster can also be configured as a memory server so
+many servers in a cluster can swap, dynamically as needed, to a single
+server configured with a large amount of RAM... without pre-configuring
+how much of the RAM is available for each of the clients!
+
+In the virtual case, the whole point of virtualization is to statistically
+multiplex physical resources across the varying demands of multiple
+virtual machines. This is really hard to do with RAM and efforts to do
+it well with no kernel changes have essentially failed (except in some
+well-publicized special-case workloads).
+Specifically, the Xen Transcendent Memory backend allows otherwise
+"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
+virtual machines, but the pages can be compressed and deduplicated to
+optimize RAM utilization. And when guest OS's are induced to surrender
+underutilized RAM (e.g. with "selfballooning"), sudden unexpected
+memory pressure may result in swapping; frontswap allows those pages
+to be swapped to and from hypervisor RAM (if overall host system memory
+conditions allow), thus mitigating the potentially awful performance impact
+of unplanned swapping.
+
+A KVM implementation is underway and has been RFC'ed to lkml. And,
+using frontswap, investigation is also underway on the use of NVM as
+a memory extension technology.
+
+2) Sure there may be performance advantages in some situations, but
+ what's the space/time overhead of frontswap?
+
+If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into
+nothingness and the only overhead is a few extra bytes per swapon'ed
+swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend"
+registers, there is one extra global variable compared to zero for
+every swap page read or written. If CONFIG_FRONTSWAP is enabled
+AND a frontswap backend registers AND the backend fails every "store"
+request (i.e. provides no memory despite claiming it might),
+CPU overhead is still negligible -- and since every frontswap fail
+precedes a swap page write-to-disk, the system is highly likely
+to be I/O bound and using a small fraction of a percent of a CPU
+will be irrelevant anyway.
+
+As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend
+registers, one bit is allocated for every swap page for every swap
+device that is swapon'd. This is added to the EIGHT bits (which
+was sixteen until about 2.6.34) that the kernel already allocates
+for every swap page for every swap device that is swapon'd. (Hugh
+Dickins has observed that frontswap could probably steal one of
+the existing eight bits, but let's worry about that minor optimization
+later.) For very large swap disks (which are rare) on a standard
+4K pagesize, this is 1MB per 32GB swap.
+
+When swap pages are stored in transcendent memory instead of written
+out to disk, there is a side effect that this may create more memory
+pressure that can potentially outweigh the other advantages. A
+backend, such as zcache, must implement policies to carefully (but
+dynamically) manage memory limits to ensure this doesn't happen.
+
+3) OK, how about a quick overview of what this frontswap patch does
+ in terms that a kernel hacker can grok?
+
+Let's assume that a frontswap "backend" has registered during
+kernel initialization; this registration indicates that this
+frontswap backend has access to some "memory" that is not directly
+accessible by the kernel. Exactly how much memory it provides is
+entirely dynamic and random.
+
+Whenever a swap-device is swapon'd frontswap_init() is called,
+passing the swap device number (aka "type") as a parameter.
+This notifies frontswap to expect attempts to "store" swap pages
+associated with that number.
+
+Whenever the swap subsystem is readying a page to write to a swap
+device (c.f swap_writepage()), frontswap_store is called. Frontswap
+consults with the frontswap backend and if the backend says it does NOT
+have room, frontswap_store returns -1 and the kernel swaps the page
+to the swap device as normal. Note that the response from the frontswap
+backend is unpredictable to the kernel; it may choose to never accept a
+page, it could accept every ninth page, or it might accept every
+page. But if the backend does accept a page, the data from the page
+has already been copied and associated with the type and offset,
+and the backend guarantees the persistence of the data. In this case,
+frontswap sets a bit in the "frontswap_map" for the swap device
+corresponding to the page offset on the swap device to which it would
+otherwise have written the data.
+
+When the swap subsystem needs to swap-in a page (swap_readpage()),
+it first calls frontswap_load() which checks the frontswap_map to
+see if the page was earlier accepted by the frontswap backend. If
+it was, the page of data is filled from the frontswap backend and
+the swap-in is complete. If not, the normal swap-in code is
+executed to obtain the page of data from the real swap device.
+
+So every time the frontswap backend accepts a page, a swap device read
+and (potentially) a swap device write are replaced by a "frontswap backend
+store" and (possibly) a "frontswap backend loads", which are presumably much
+faster.
+
+4) Can't frontswap be configured as a "special" swap device that is
+ just higher priority than any real swap device (e.g. like zswap,
+ or maybe swap-over-nbd/NFS)?
+
+No. First, the existing swap subsystem doesn't allow for any kind of
+swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy,
+but this would require fairly drastic changes. Even if it were
+rewritten, the existing swap subsystem uses the block I/O layer which
+assumes a swap device is fixed size and any page in it is linearly
+addressable. Frontswap barely touches the existing swap subsystem,
+and works around the constraints of the block I/O subsystem to provide
+a great deal of flexibility and dynamicity.
+
+For example, the acceptance of any swap page by the frontswap backend is
+entirely unpredictable. This is critical to the definition of frontswap
+backends because it grants completely dynamic discretion to the
+backend. In zcache, one cannot know a priori how compressible a page is.
+"Poorly" compressible pages can be rejected, and "poorly" can itself be
+defined dynamically depending on current memory constraints.
+
+Further, frontswap is entirely synchronous whereas a real swap
+device is, by definition, asynchronous and uses block I/O. The
+block I/O layer is not only unnecessary, but may perform "optimizations"
+that are inappropriate for a RAM-oriented device including delaying
+the write of some pages for a significant amount of time. Synchrony is
+required to ensure the dynamicity of the backend and to avoid thorny race
+conditions that would unnecessarily and greatly complicate frontswap
+and/or the block I/O subsystem. That said, only the initial "store"
+and "load" operations need be synchronous. A separate asynchronous thread
+is free to manipulate the pages stored by frontswap. For example,
+the "remotification" thread in RAMster uses standard asynchronous
+kernel sockets to move compressed frontswap pages to a remote machine.
+Similarly, a KVM guest-side implementation could do in-guest compression
+and use "batched" hypercalls.
+
+In a virtualized environment, the dynamicity allows the hypervisor
+(or host OS) to do "intelligent overcommit". For example, it can
+choose to accept pages only until host-swapping might be imminent,
+then force guests to do their own swapping.
+
+There is a downside to the transcendent memory specifications for
+frontswap: Since any "store" might fail, there must always be a real
+slot on a real swap device to swap the page. Thus frontswap must be
+implemented as a "shadow" to every swapon'd device with the potential
+capability of holding every page that the swap device might have held
+and the possibility that it might hold no pages at all. This means
+that frontswap cannot contain more pages than the total of swapon'd
+swap devices. For example, if NO swap device is configured on some
+installation, frontswap is useless. Swapless portable devices
+can still use frontswap but a backend for such devices must configure
+some kind of "ghost" swap device and ensure that it is never used.
+
+5) Why this weird definition about "duplicate stores"? If a page
+ has been previously successfully stored, can't it always be
+ successfully overwritten?
+
+Nearly always it can, but no, sometimes it cannot. Consider an example
+where data is compressed and the original 4K page has been compressed
+to 1K. Now an attempt is made to overwrite the page with data that
+is non-compressible and so would take the entire 4K. But the backend
+has no more space. In this case, the store must be rejected. Whenever
+frontswap rejects a store that would overwrite, it also must invalidate
+the old data and ensure that it is no longer accessible. Since the
+swap subsystem then writes the new data to the read swap device,
+this is the correct course of action to ensure coherency.
+
+6) What is frontswap_shrink for?
+
+When the (non-frontswap) swap subsystem swaps out a page to a real
+swap device, that page is only taking up low-value pre-allocated disk
+space. But if frontswap has placed a page in transcendent memory, that
+page may be taking up valuable real estate. The frontswap_shrink
+routine allows code outside of the swap subsystem to force pages out
+of the memory managed by frontswap and back into kernel-addressable memory.
+For example, in RAMster, a "suction driver" thread will attempt
+to "repatriate" pages sent to a remote machine back to the local machine;
+this is driven using the frontswap_shrink mechanism when memory pressure
+subsides.
+
+7) Why does the frontswap patch create the new include file swapfile.h?
+
+The frontswap code depends on some swap-subsystem-internal data
+structures that have, over the years, moved back and forth between
+static and global. This seemed a reasonable compromise: Define
+them as global but declare them in a new include file that isn't
+included by the large number of source files that include swap.h.
+
+Dan Magenheimer, last updated April 9, 2012
diff --git a/Documentation/vm/highmem.txt b/Documentation/vm/highmem.txt
new file mode 100644
index 00000000000..4324d24ffac
--- /dev/null
+++ b/Documentation/vm/highmem.txt
@@ -0,0 +1,162 @@
+
+ ====================
+ HIGH MEMORY HANDLING
+ ====================
+
+By: Peter Zijlstra <a.p.zijlstra@chello.nl>
+
+Contents:
+
+ (*) What is high memory?
+
+ (*) Temporary virtual mappings.
+
+ (*) Using kmap_atomic.
+
+ (*) Cost of temporary mappings.
+
+ (*) i386 PAE.
+
+
+====================
+WHAT IS HIGH MEMORY?
+====================
+
+High memory (highmem) is used when the size of physical memory approaches or
+exceeds the maximum size of virtual memory. At that point it becomes
+impossible for the kernel to keep all of the available physical memory mapped
+at all times. This means the kernel needs to start using temporary mappings of
+the pieces of physical memory that it wants to access.
+
+The part of (physical) memory not covered by a permanent mapping is what we
+refer to as 'highmem'. There are various architecture dependent constraints on
+where exactly that border lies.
+
+In the i386 arch, for example, we choose to map the kernel into every process's
+VM space so that we don't have to pay the full TLB invalidation costs for
+kernel entry/exit. This means the available virtual memory space (4GiB on
+i386) has to be divided between user and kernel space.
+
+The traditional split for architectures using this approach is 3:1, 3GiB for
+userspace and the top 1GiB for kernel space:
+
+ +--------+ 0xffffffff
+ | Kernel |
+ +--------+ 0xc0000000
+ | |
+ | User |
+ | |
+ +--------+ 0x00000000
+
+This means that the kernel can at most map 1GiB of physical memory at any one
+time, but because we need virtual address space for other things - including
+temporary maps to access the rest of the physical memory - the actual direct
+map will typically be less (usually around ~896MiB).
+
+Other architectures that have mm context tagged TLBs can have separate kernel
+and user maps. Some hardware (like some ARMs), however, have limited virtual
+space when they use mm context tags.
+
+
+==========================
+TEMPORARY VIRTUAL MAPPINGS
+==========================
+
+The kernel contains several ways of creating temporary mappings:
+
+ (*) vmap(). This can be used to make a long duration mapping of multiple
+ physical pages into a contiguous virtual space. It needs global
+ synchronization to unmap.
+
+ (*) kmap(). This permits a short duration mapping of a single page. It needs
+ global synchronization, but is amortized somewhat. It is also prone to
+ deadlocks when using in a nested fashion, and so it is not recommended for
+ new code.
+
+ (*) kmap_atomic(). This permits a very short duration mapping of a single
+ page. Since the mapping is restricted to the CPU that issued it, it
+ performs well, but the issuing task is therefore required to stay on that
+ CPU until it has finished, lest some other task displace its mappings.
+
+ kmap_atomic() may also be used by interrupt contexts, since it is does not
+ sleep and the caller may not sleep until after kunmap_atomic() is called.
+
+ It may be assumed that k[un]map_atomic() won't fail.
+
+
+=================
+USING KMAP_ATOMIC
+=================
+
+When and where to use kmap_atomic() is straightforward. It is used when code
+wants to access the contents of a page that might be allocated from high memory
+(see __GFP_HIGHMEM), for example a page in the pagecache. The API has two
+functions, and they can be used in a manner similar to the following:
+
+ /* Find the page of interest. */
+ struct page *page = find_get_page(mapping, offset);
+
+ /* Gain access to the contents of that page. */
+ void *vaddr = kmap_atomic(page);
+
+ /* Do something to the contents of that page. */
+ memset(vaddr, 0, PAGE_SIZE);
+
+ /* Unmap that page. */
+ kunmap_atomic(vaddr);
+
+Note that the kunmap_atomic() call takes the result of the kmap_atomic() call
+not the argument.
+
+If you need to map two pages because you want to copy from one page to
+another you need to keep the kmap_atomic calls strictly nested, like:
+
+ vaddr1 = kmap_atomic(page1);
+ vaddr2 = kmap_atomic(page2);
+
+ memcpy(vaddr1, vaddr2, PAGE_SIZE);
+
+ kunmap_atomic(vaddr2);
+ kunmap_atomic(vaddr1);
+
+
+==========================
+COST OF TEMPORARY MAPPINGS
+==========================
+
+The cost of creating temporary mappings can be quite high. The arch has to
+manipulate the kernel's page tables, the data TLB and/or the MMU's registers.
+
+If CONFIG_HIGHMEM is not set, then the kernel will try and create a mapping
+simply with a bit of arithmetic that will convert the page struct address into
+a pointer to the page contents rather than juggling mappings about. In such a
+case, the unmap operation may be a null operation.
+
+If CONFIG_MMU is not set, then there can be no temporary mappings and no
+highmem. In such a case, the arithmetic approach will also be used.
+
+
+========
+i386 PAE
+========
+
+The i386 arch, under some circumstances, will permit you to stick up to 64GiB
+of RAM into your 32-bit machine. This has a number of consequences:
+
+ (*) Linux needs a page-frame structure for each page in the system and the
+ pageframes need to live in the permanent mapping, which means:
+
+ (*) you can have 896M/sizeof(struct page) page-frames at most; with struct
+ page being 32-bytes that would end up being something in the order of 112G
+ worth of pages; the kernel, however, needs to store more than just
+ page-frames in that memory...
+
+ (*) PAE makes your page tables larger - which slows the system down as more
+ data has to be accessed to traverse in TLB fills and the like. One
+ advantage is that PAE has more PTE bits and can provide advanced features
+ like NX and PAT.
+
+The general recommendation is that you don't use more than 8GiB on a 32-bit
+machine - although more might work for you and your workload, you're pretty
+much on your own - don't expect kernel developers to really care much if things
+come apart.
diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt
index bc31636973e..bdd4bb97fff 100644
--- a/Documentation/vm/hugetlbpage.txt
+++ b/Documentation/vm/hugetlbpage.txt
@@ -72,7 +72,7 @@ number of huge pages requested. This is the most reliable method of
allocating huge pages as memory has not yet become fragmented.
Some platforms support multiple huge page sizes. To allocate huge pages
-of a specific size, one must preceed the huge pages boot command parameters
+of a specific size, one must precede the huge pages boot command parameters
with a huge page size selection parameter "hugepagesz=<size>". <size> must
be specified in bytes with optional scale suffix [kKmMgG]. The default huge
page size may be selected with the "default_hugepagesz=<size>" boot parameter.
@@ -165,6 +165,7 @@ which function as described above for the default huge page-sized case.
Interaction of Task Memory Policy with Huge Page Allocation/Freeing
+===================================================================
Whether huge pages are allocated and freed via the /proc interface or
the /sysfs interface using the nr_hugepages_mempolicy attribute, the NUMA
@@ -229,6 +230,7 @@ resulting effect on persistent huge page allocation is as follows:
of huge pages over all on-lines nodes with memory.
Per Node Hugepages Attributes
+=============================
A subset of the contents of the root huge page control directory in sysfs,
described above, will be replicated under each the system device of each
@@ -258,6 +260,7 @@ applied, from which node the huge page allocation will be attempted.
Using Huge Pages
+================
If the user applications are going to request huge pages using mmap system
call, then it is required that system administrator mount a file system of
@@ -296,179 +299,16 @@ calls, though the mount of filesystem will be required for using mmap calls
without MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see
map_hugetlb.c.
-*******************************************************************
-
-/*
- * Example of using huge page memory in a user application using Sys V shared
- * memory system calls. In this example the app is requesting 256MB of
- * memory that is backed by huge pages. The application uses the flag
- * SHM_HUGETLB in the shmget system call to inform the kernel that it is
- * requesting huge pages.
- *
- * For the ia64 architecture, the Linux kernel reserves Region number 4 for
- * huge pages. That means that if one requires a fixed address, a huge page
- * aligned address starting with 0x800000... will be required. If a fixed
- * address is not required, the kernel will select an address in the proper
- * range.
- * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
- *
- * Note: The default shared memory limit is quite low on many kernels,
- * you may need to increase it via:
- *
- * echo 268435456 > /proc/sys/kernel/shmmax
- *
- * This will increase the maximum size per shared memory segment to 256MB.
- * The other limit that you will hit eventually is shmall which is the
- * total amount of shared memory in pages. To set it to 16GB on a system
- * with a 4kB pagesize do:
- *
- * echo 4194304 > /proc/sys/kernel/shmall
- */
-#include <stdlib.h>
-#include <stdio.h>
-#include <sys/types.h>
-#include <sys/ipc.h>
-#include <sys/shm.h>
-#include <sys/mman.h>
-
-#ifndef SHM_HUGETLB
-#define SHM_HUGETLB 04000
-#endif
-
-#define LENGTH (256UL*1024*1024)
-
-#define dprintf(x) printf(x)
-
-#define ADDR (void *)(0x0UL) /* let kernel choose address */
-#define SHMAT_FLAGS (0)
-
-int main(void)
-{
- int shmid;
- unsigned long i;
- char *shmaddr;
-
- if ((shmid = shmget(2, LENGTH,
- SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W)) < 0) {
- perror("shmget");
- exit(1);
- }
- printf("shmid: 0x%x\n", shmid);
-
- shmaddr = shmat(shmid, ADDR, SHMAT_FLAGS);
- if (shmaddr == (char *)-1) {
- perror("Shared memory attach failure");
- shmctl(shmid, IPC_RMID, NULL);
- exit(2);
- }
- printf("shmaddr: %p\n", shmaddr);
-
- dprintf("Starting the writes:\n");
- for (i = 0; i < LENGTH; i++) {
- shmaddr[i] = (char)(i);
- if (!(i % (1024 * 1024)))
- dprintf(".");
- }
- dprintf("\n");
-
- dprintf("Starting the Check...");
- for (i = 0; i < LENGTH; i++)
- if (shmaddr[i] != (char)i)
- printf("\nIndex %lu mismatched\n", i);
- dprintf("Done.\n");
-
- if (shmdt((const void *)shmaddr) != 0) {
- perror("Detach failure");
- shmctl(shmid, IPC_RMID, NULL);
- exit(3);
- }
-
- shmctl(shmid, IPC_RMID, NULL);
-
- return 0;
-}
-
-*******************************************************************
-
-/*
- * Example of using huge page memory in a user application using the mmap
- * system call. Before running this application, make sure that the
- * administrator has mounted the hugetlbfs filesystem (on some directory
- * like /mnt) using the command mount -t hugetlbfs nodev /mnt. In this
- * example, the app is requesting memory of size 256MB that is backed by
- * huge pages.
- *
- * For the ia64 architecture, the Linux kernel reserves Region number 4 for
- * huge pages. That means that if one requires a fixed address, a huge page
- * aligned address starting with 0x800000... will be required. If a fixed
- * address is not required, the kernel will select an address in the proper
- * range.
- * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
- */
-#include <stdlib.h>
-#include <stdio.h>
-#include <unistd.h>
-#include <sys/mman.h>
-#include <fcntl.h>
-
-#define FILE_NAME "/mnt/hugepagefile"
-#define LENGTH (256UL*1024*1024)
-#define PROTECTION (PROT_READ | PROT_WRITE)
-
-#define ADDR (void *)(0x0UL) /* let kernel choose address */
-#define FLAGS (MAP_SHARED)
-
-void check_bytes(char *addr)
-{
- printf("First hex is %x\n", *((unsigned int *)addr));
-}
-
-void write_bytes(char *addr)
-{
- unsigned long i;
-
- for (i = 0; i < LENGTH; i++)
- *(addr + i) = (char)i;
-}
-
-void read_bytes(char *addr)
-{
- unsigned long i;
-
- check_bytes(addr);
- for (i = 0; i < LENGTH; i++)
- if (*(addr + i) != (char)i) {
- printf("Mismatch at %lu\n", i);
- break;
- }
-}
-
-int main(void)
-{
- void *addr;
- int fd;
-
- fd = open(FILE_NAME, O_CREAT | O_RDWR, 0755);
- if (fd < 0) {
- perror("Open failed");
- exit(1);
- }
-
- addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, fd, 0);
- if (addr == MAP_FAILED) {
- perror("mmap");
- unlink(FILE_NAME);
- exit(1);
- }
-
- printf("Returned address is %p\n", addr);
- check_bytes(addr);
- write_bytes(addr);
- read_bytes(addr);
-
- munmap(addr, LENGTH);
- close(fd);
- unlink(FILE_NAME);
-
- return 0;
-}
+Examples
+========
+
+1) map_hugetlb: see tools/testing/selftests/vm/map_hugetlb.c
+
+2) hugepage-shm: see tools/testing/selftests/vm/hugepage-shm.c
+
+3) hugepage-mmap: see tools/testing/selftests/vm/hugepage-mmap.c
+
+4) The libhugetlbfs (http://libhugetlbfs.sourceforge.net) library provides a
+ wide range of userspace tools to help with huge page usability, environment
+ setup, and control. Furthermore it provides useful test cases that should be
+ used when modifying code to ensure no regressions are introduced.
diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt
index 12f9ba20ccb..6ae89a9edf2 100644
--- a/Documentation/vm/hwpoison.txt
+++ b/Documentation/vm/hwpoison.txt
@@ -84,6 +84,11 @@ PR_MCE_KILL
PR_MCE_KILL_EARLY: Early kill
PR_MCE_KILL_LATE: Late kill
PR_MCE_KILL_DEFAULT: Use system global default
+ Note that if you want to have a dedicated thread which handles
+ the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should
+ call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise,
+ the SIGBUS is sent to the main thread.
+
PR_MCE_KILL_GET
return current mode
@@ -129,12 +134,12 @@ Limit injection to pages owned by memgroup. Specified by inode number
of the memcg.
Example:
- mkdir /cgroup/hwpoison
+ mkdir /sys/fs/cgroup/mem/hwpoison
usemem -m 100 -s 1000 &
- echo `jobs -p` > /cgroup/hwpoison/tasks
+ echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks
- memcg_ino=$(ls -id /cgroup/hwpoison | cut -f1 -d' ')
+ memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ')
echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
page-types -p `pidof init` --hwpoison # shall do nothing
diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
index b392e496f81..f34a8ee6f86 100644
--- a/Documentation/vm/ksm.txt
+++ b/Documentation/vm/ksm.txt
@@ -58,6 +58,21 @@ sleep_millisecs - how many milliseconds ksmd should sleep before next scan
e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
Default: 20 (chosen for demonstration purposes)
+merge_across_nodes - specifies if pages from different numa nodes can be merged.
+ When set to 0, ksm merges only pages which physically
+ reside in the memory area of same NUMA node. That brings
+ lower latency to access of shared pages. Systems with more
+ nodes, at significant NUMA distances, are likely to benefit
+ from the lower latency of setting 0. Smaller systems, which
+ need to minimize memory usage, are likely to benefit from
+ the greater sharing of setting 1 (default). You may wish to
+ compare how your system performs under each setting, before
+ deciding on which to use. merge_across_nodes setting can be
+ changed only when there are no ksm shared pages in system:
+ set run 2 to unmerge pages first, then to 1 after changing
+ merge_across_nodes, to remerge according to the new setting.
+ Default: 1 (merging across nodes as in earlier releases)
+
run - set 0 to stop ksmd from running but keep merged pages,
set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
set 2 to stop ksmd and unmerge all pages currently merged,
diff --git a/Documentation/vm/locking b/Documentation/vm/locking
deleted file mode 100644
index 25fadb44876..00000000000
--- a/Documentation/vm/locking
+++ /dev/null
@@ -1,130 +0,0 @@
-Started Oct 1999 by Kanoj Sarcar <kanojsarcar@yahoo.com>
-
-The intent of this file is to have an uptodate, running commentary
-from different people about how locking and synchronization is done
-in the Linux vm code.
-
-page_table_lock & mmap_sem
---------------------------------------
-
-Page stealers pick processes out of the process pool and scan for
-the best process to steal pages from. To guarantee the existence
-of the victim mm, a mm_count inc and a mmdrop are done in swap_out().
-Page stealers hold kernel_lock to protect against a bunch of races.
-The vma list of the victim mm is also scanned by the stealer,
-and the page_table_lock is used to preserve list sanity against the
-process adding/deleting to the list. This also guarantees existence
-of the vma. Vma existence is not guaranteed once try_to_swap_out()
-drops the page_table_lock. To guarantee the existence of the underlying
-file structure, a get_file is done before the swapout() method is
-invoked. The page passed into swapout() is guaranteed not to be reused
-for a different purpose because the page reference count due to being
-present in the user's pte is not released till after swapout() returns.
-
-Any code that modifies the vmlist, or the vm_start/vm_end/
-vm_flags:VM_LOCKED/vm_next of any vma *in the list* must prevent
-kswapd from looking at the chain.
-
-The rules are:
-1. To scan the vmlist (look but don't touch) you must hold the
- mmap_sem with read bias, i.e. down_read(&mm->mmap_sem)
-2. To modify the vmlist you need to hold the mmap_sem with
- read&write bias, i.e. down_write(&mm->mmap_sem) *AND*
- you need to take the page_table_lock.
-3. The swapper takes _just_ the page_table_lock, this is done
- because the mmap_sem can be an extremely long lived lock
- and the swapper just cannot sleep on that.
-4. The exception to this rule is expand_stack, which just
- takes the read lock and the page_table_lock, this is ok
- because it doesn't really modify fields anybody relies on.
-5. You must be able to guarantee that while holding page_table_lock
- or page_table_lock of mm A, you will not try to get either lock
- for mm B.
-
-The caveats are:
-1. find_vma() makes use of, and updates, the mmap_cache pointer hint.
-The update of mmap_cache is racy (page stealer can race with other code
-that invokes find_vma with mmap_sem held), but that is okay, since it
-is a hint. This can be fixed, if desired, by having find_vma grab the
-page_table_lock.
-
-
-Code that add/delete elements from the vmlist chain are
-1. callers of insert_vm_struct
-2. callers of merge_segments
-3. callers of avl_remove
-
-Code that changes vm_start/vm_end/vm_flags:VM_LOCKED of vma's on
-the list:
-1. expand_stack
-2. mprotect
-3. mlock
-4. mremap
-
-It is advisable that changes to vm_start/vm_end be protected, although
-in some cases it is not really needed. Eg, vm_start is modified by
-expand_stack(), it is hard to come up with a destructive scenario without
-having the vmlist protection in this case.
-
-The page_table_lock nests with the inode i_mmap_lock and the kmem cache
-c_spinlock spinlocks. This is okay, since the kmem code asks for pages after
-dropping c_spinlock. The page_table_lock also nests with pagecache_lock and
-pagemap_lru_lock spinlocks, and no code asks for memory with these locks
-held.
-
-The page_table_lock is grabbed while holding the kernel_lock spinning monitor.
-
-The page_table_lock is a spin lock.
-
-Note: PTL can also be used to guarantee that no new clones using the
-mm start up ... this is a loose form of stability on mm_users. For
-example, it is used in copy_mm to protect against a racing tlb_gather_mmu
-single address space optimization, so that the zap_page_range (from
-truncate) does not lose sending ipi's to cloned threads that might
-be spawned underneath it and go to user mode to drag in pte's into tlbs.
-
-swap_lock
---------------
-The swap devices are chained in priority order from the "swap_list" header.
-The "swap_list" is used for the round-robin swaphandle allocation strategy.
-The #free swaphandles is maintained in "nr_swap_pages". These two together
-are protected by the swap_lock.
-
-The swap_lock also protects all the device reference counts on the
-corresponding swaphandles, maintained in the "swap_map" array, and the
-"highest_bit" and "lowest_bit" fields.
-
-The swap_lock is a spinlock, and is never acquired from intr level.
-
-To prevent races between swap space deletion or async readahead swapins
-deciding whether a swap handle is being used, ie worthy of being read in
-from disk, and an unmap -> swap_free making the handle unused, the swap
-delete and readahead code grabs a temp reference on the swaphandle to
-prevent warning messages from swap_duplicate <- read_swap_cache_async.
-
-Swap cache locking
-------------------
-Pages are added into the swap cache with kernel_lock held, to make sure
-that multiple pages are not being added (and hence lost) by associating
-all of them with the same swaphandle.
-
-Pages are guaranteed not to be removed from the scache if the page is
-"shared": ie, other processes hold reference on the page or the associated
-swap handle. The only code that does not follow this rule is shrink_mmap,
-which deletes pages from the swap cache if no process has a reference on
-the page (multiple processes might have references on the corresponding
-swap handle though). lookup_swap_cache() races with shrink_mmap, when
-establishing a reference on a scache page, so, it must check whether the
-page it located is still in the swapcache, or shrink_mmap deleted it.
-(This race is due to the fact that shrink_mmap looks at the page ref
-count with pagecache_lock, but then drops pagecache_lock before deleting
-the page from the scache).
-
-do_wp_page and do_swap_page have MP races in them while trying to figure
-out whether a page is "shared", by looking at the page_count + swap_count.
-To preserve the sum of the counts, the page lock _must_ be acquired before
-calling is_page_shared (else processes might switch their swap_count refs
-to the page count refs, after the page count ref has been snapshotted).
-
-Swap device deletion code currently breaks all the scache assumptions,
-since it grabs neither mmap_sem nor page_table_lock.
diff --git a/Documentation/vm/map_hugetlb.c b/Documentation/vm/map_hugetlb.c
deleted file mode 100644
index e2bdae37f49..00000000000
--- a/Documentation/vm/map_hugetlb.c
+++ /dev/null
@@ -1,77 +0,0 @@
-/*
- * Example of using hugepage memory in a user application using the mmap
- * system call with MAP_HUGETLB flag. Before running this program make
- * sure the administrator has allocated enough default sized huge pages
- * to cover the 256 MB allocation.
- *
- * For ia64 architecture, Linux kernel reserves Region number 4 for hugepages.
- * That means the addresses starting with 0x800000... will need to be
- * specified. Specifying a fixed address is not required on ppc64, i386
- * or x86_64.
- */
-#include <stdlib.h>
-#include <stdio.h>
-#include <unistd.h>
-#include <sys/mman.h>
-#include <fcntl.h>
-
-#define LENGTH (256UL*1024*1024)
-#define PROTECTION (PROT_READ | PROT_WRITE)
-
-#ifndef MAP_HUGETLB
-#define MAP_HUGETLB 0x40
-#endif
-
-/* Only ia64 requires this */
-#ifdef __ia64__
-#define ADDR (void *)(0x8000000000000000UL)
-#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_FIXED)
-#else
-#define ADDR (void *)(0x0UL)
-#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB)
-#endif
-
-void check_bytes(char *addr)
-{
- printf("First hex is %x\n", *((unsigned int *)addr));
-}
-
-void write_bytes(char *addr)
-{
- unsigned long i;
-
- for (i = 0; i < LENGTH; i++)
- *(addr + i) = (char)i;
-}
-
-void read_bytes(char *addr)
-{
- unsigned long i;
-
- check_bytes(addr);
- for (i = 0; i < LENGTH; i++)
- if (*(addr + i) != (char)i) {
- printf("Mismatch at %lu\n", i);
- break;
- }
-}
-
-int main(void)
-{
- void *addr;
-
- addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, 0, 0);
- if (addr == MAP_FAILED) {
- perror("mmap");
- exit(1);
- }
-
- printf("Returned address is %p\n", addr);
- check_bytes(addr);
- write_bytes(addr);
- read_bytes(addr);
-
- munmap(addr, LENGTH);
-
- return 0;
-}
diff --git a/Documentation/vm/numa b/Documentation/vm/numa
index e93ad9425e2..ade01274212 100644
--- a/Documentation/vm/numa
+++ b/Documentation/vm/numa
@@ -1,41 +1,149 @@
Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
-The intent of this file is to have an uptodate, running commentary
-from different people about NUMA specific code in the Linux vm.
-
-What is NUMA? It is an architecture where the memory access times
-for different regions of memory from a given processor varies
-according to the "distance" of the memory region from the processor.
-Each region of memory to which access times are the same from any
-cpu, is called a node. On such architectures, it is beneficial if
-the kernel tries to minimize inter node communications. Schemes
-for this range from kernel text and read-only data replication
-across nodes, and trying to house all the data structures that
-key components of the kernel need on memory on that node.
-
-Currently, all the numa support is to provide efficient handling
-of widely discontiguous physical memory, so architectures which
-are not NUMA but can have huge holes in the physical address space
-can use the same code. All this code is bracketed by CONFIG_DISCONTIGMEM.
-
-The initial port includes NUMAizing the bootmem allocator code by
-encapsulating all the pieces of information into a bootmem_data_t
-structure. Node specific calls have been added to the allocator.
-In theory, any platform which uses the bootmem allocator should
-be able to put the bootmem and mem_map data structures anywhere
-it deems best.
-
-Each node's page allocation data structures have also been encapsulated
-into a pg_data_t. The bootmem_data_t is just one part of this. To
-make the code look uniform between NUMA and regular UMA platforms,
-UMA platforms have a statically allocated pg_data_t too (contig_page_data).
-For the sake of uniformity, the function num_online_nodes() is also defined
-for all platforms. As we run benchmarks, we might decide to NUMAize
-more variables like low_on_memory, nr_free_pages etc into the pg_data_t.
-
-The NUMA aware page allocation code currently tries to allocate pages
-from different nodes in a round robin manner. This will be changed to
-do concentratic circle search, starting from current node, once the
-NUMA port achieves more maturity. The call alloc_pages_node has been
-added, so that drivers can make the call and not worry about whether
-it is running on a NUMA or UMA platform.
+What is NUMA?
+
+This question can be answered from a couple of perspectives: the
+hardware view and the Linux software view.
+
+From the hardware perspective, a NUMA system is a computer platform that
+comprises multiple components or assemblies each of which may contain 0
+or more CPUs, local memory, and/or IO buses. For brevity and to
+disambiguate the hardware view of these physical components/assemblies
+from the software abstraction thereof, we'll call the components/assemblies
+'cells' in this document.
+
+Each of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset
+of the system--although some components necessary for a stand-alone SMP system
+may not be populated on any given cell. The cells of the NUMA system are
+connected together with some sort of system interconnect--e.g., a crossbar or
+point-to-point link are common types of NUMA system interconnects. Both of
+these types of interconnects can be aggregated to create NUMA platforms with
+cells at multiple distances from other cells.
+
+For Linux, the NUMA platforms of interest are primarily what is known as Cache
+Coherent NUMA or ccNUMA systems. With ccNUMA systems, all memory is visible
+to and accessible from any CPU attached to any cell and cache coherency
+is handled in hardware by the processor caches and/or the system interconnect.
+
+Memory access time and effective memory bandwidth varies depending on how far
+away the cell containing the CPU or IO bus making the memory access is from the
+cell containing the target memory. For example, access to memory by CPUs
+attached to the same cell will experience faster access times and higher
+bandwidths than accesses to memory on other, remote cells. NUMA platforms
+can have cells at multiple remote distances from any given cell.
+
+Platform vendors don't build NUMA systems just to make software developers'
+lives interesting. Rather, this architecture is a means to provide scalable
+memory bandwidth. However, to achieve scalable memory bandwidth, system and
+application software must arrange for a large majority of the memory references
+[cache misses] to be to "local" memory--memory on the same cell, if any--or
+to the closest cell with memory.
+
+This leads to the Linux software view of a NUMA system:
+
+Linux divides the system's hardware resources into multiple software
+abstractions called "nodes". Linux maps the nodes onto the physical cells
+of the hardware platform, abstracting away some of the details for some
+architectures. As with physical cells, software nodes may contain 0 or more
+CPUs, memory and/or IO buses. And, again, memory accesses to memory on
+"closer" nodes--nodes that map to closer cells--will generally experience
+faster access times and higher effective bandwidth than accesses to more
+remote cells.
+
+For some architectures, such as x86, Linux will "hide" any node representing a
+physical cell that has no memory attached, and reassign any CPUs attached to
+that cell to a node representing a cell that does have memory. Thus, on
+these architectures, one cannot assume that all CPUs that Linux associates with
+a given node will see the same local memory access times and bandwidth.
+
+In addition, for some architectures, again x86 is an example, Linux supports
+the emulation of additional nodes. For NUMA emulation, linux will carve up
+the existing nodes--or the system memory for non-NUMA platforms--into multiple
+nodes. Each emulated node will manage a fraction of the underlying cells'
+physical memory. NUMA emluation is useful for testing NUMA kernel and
+application features on non-NUMA platforms, and as a sort of memory resource
+management mechanism when used together with cpusets.
+[see Documentation/cgroups/cpusets.txt]
+
+For each node with memory, Linux constructs an independent memory management
+subsystem, complete with its own free page lists, in-use page lists, usage
+statistics and locks to mediate access. In addition, Linux constructs for
+each memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE],
+an ordered "zonelist". A zonelist specifies the zones/nodes to visit when a
+selected zone/node cannot satisfy the allocation request. This situation,
+when a zone has no available memory to satisfy a request, is called
+"overflow" or "fallback".
+
+Because some nodes contain multiple zones containing different types of
+memory, Linux must decide whether to order the zonelists such that allocations
+fall back to the same zone type on a different node, or to a different zone
+type on the same node. This is an important consideration because some zones,
+such as DMA or DMA32, represent relatively scarce resources. Linux chooses
+a default zonelist order based on the sizes of the various zone types relative
+to the total memory of the node and the total memory of the system. The
+default zonelist order may be overridden using the numa_zonelist_order kernel
+boot parameter or sysctl. [see Documentation/kernel-parameters.txt and
+Documentation/sysctl/vm.txt]
+
+By default, Linux will attempt to satisfy memory allocation requests from the
+node to which the CPU that executes the request is assigned. Specifically,
+Linux will attempt to allocate from the first node in the appropriate zonelist
+for the node where the request originates. This is called "local allocation."
+If the "local" node cannot satisfy the request, the kernel will examine other
+nodes' zones in the selected zonelist looking for the first zone in the list
+that can satisfy the request.
+
+Local allocation will tend to keep subsequent access to the allocated memory
+"local" to the underlying physical resources and off the system interconnect--
+as long as the task on whose behalf the kernel allocated some memory does not
+later migrate away from that memory. The Linux scheduler is aware of the
+NUMA topology of the platform--embodied in the "scheduling domains" data
+structures [see Documentation/scheduler/sched-domains.txt]--and the scheduler
+attempts to minimize task migration to distant scheduling domains. However,
+the scheduler does not take a task's NUMA footprint into account directly.
+Thus, under sufficient imbalance, tasks can migrate between nodes, remote
+from their initial node and kernel data structures.
+
+System administrators and application designers can restrict a task's migration
+to improve NUMA locality using various CPU affinity command line interfaces,
+such as taskset(1) and numactl(1), and program interfaces such as
+sched_setaffinity(2). Further, one can modify the kernel's default local
+allocation behavior using Linux NUMA memory policy.
+[see Documentation/vm/numa_memory_policy.txt.]
+
+System administrators can restrict the CPUs and nodes' memories that a non-
+privileged user can specify in the scheduling or NUMA commands and functions
+using control groups and CPUsets. [see Documentation/cgroups/cpusets.txt]
+
+On architectures that do not hide memoryless nodes, Linux will include only
+zones [nodes] with memory in the zonelists. This means that for a memoryless
+node the "local memory node"--the node of the first zone in CPU's node's
+zonelist--will not be the node itself. Rather, it will be the node that the
+kernel selected as the nearest node with memory when it built the zonelists.
+So, default, local allocations will succeed with the kernel supplying the
+closest available memory. This is a consequence of the same mechanism that
+allows such allocations to fallback to other nearby nodes when a node that
+does contain memory overflows.
+
+Some kernel allocations do not want or cannot tolerate this allocation fallback
+behavior. Rather they want to be sure they get memory from the specified node
+or get notified that the node has no free memory. This is usually the case when
+a subsystem allocates per CPU memory resources, for example.
+
+A typical model for making such an allocation is to obtain the node id of the
+node to which the "current CPU" is attached using one of the kernel's
+numa_node_id() or CPU_to_node() functions and then request memory from only
+the node id returned. When such an allocation fails, the requesting subsystem
+may revert to its own fallback path. The slab kernel memory allocator is an
+example of this. Or, the subsystem may choose to disable or not to enable
+itself on allocation failure. The kernel profiling subsystem is an example of
+this.
+
+If the architecture supports--does not hide--memoryless nodes, then CPUs
+attached to memoryless nodes would always incur the fallback path overhead
+or some subsystems would fail to initialize if they attempted to allocated
+memory exclusively from a node without memory. To support such
+architectures transparently, kernel subsystems can use the numa_mem_id()
+or cpu_to_mem() function to locate the "local memory node" for the calling or
+specified CPU. Again, this is the same node from which default, local page
+allocations will be attempted.
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt
index be45dbb9d7f..badb0507608 100644
--- a/Documentation/vm/numa_memory_policy.txt
+++ b/Documentation/vm/numa_memory_policy.txt
@@ -45,7 +45,7 @@ most general to most specific:
to establish the task policy for a child task exec()'d from an
executable image that has no awareness of memory policy. See the
MEMORY POLICY APIS section, below, for an overview of the system call
- that a task may use to set/change it's task/process policy.
+ that a task may use to set/change its task/process policy.
In a multi-threaded task, task policies apply only to the thread
[Linux kernel task] that installs the policy and any threads
@@ -174,7 +174,6 @@ Components of Memory Policies
allocation fails, the kernel will search other nodes, in order of
increasing distance from the preferred node based on information
provided by the platform firmware.
- containing the cpu where the allocation takes place.
Internally, the Preferred policy uses a single node--the
preferred_node member of struct mempolicy. When the internal
@@ -275,9 +274,9 @@ Components of Memory Policies
For example, consider a task that is attached to a cpuset with
mems 2-5 that sets an Interleave policy over the same set with
MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the
- interleave now occurs over nodes 3,5-6. If the cpuset's mems
+ interleave now occurs over nodes 3,5-7. If the cpuset's mems
then change to 0,2-3,5, then the interleave occurs over nodes
- 0,3,5.
+ 0,2-3,5.
Thanks to the consistent remapping, applications preparing
nodemasks to specify memory policies using this flag should
@@ -301,7 +300,7 @@ decrement this reference count, respectively. mpol_put() will only free
the structure back to the mempolicy kmem cache when the reference count
goes to zero.
-When a new memory policy is allocated, it's reference count is initialized
+When a new memory policy is allocated, its reference count is initialized
to '1', representing the reference held by the task that is installing the
new policy. When a pointer to a memory policy structure is stored in another
structure, another reference is added, as the task's reference will be dropped
@@ -424,7 +423,7 @@ a command line tool, numactl(8), exists that allows one to:
+ set the shared policy for a shared memory segment via mbind(2)
-The numactl(8) tool is packages with the run-time version of the library
+The numactl(8) tool is packaged with the run-time version of the library
containing the memory policy system call wrappers. Some distributions
package the headers and compile-time libraries in a separate development
package.
diff --git a/Documentation/vm/overcommit-accounting b/Documentation/vm/overcommit-accounting
index 21c7b1f8f32..cbfaaa67411 100644
--- a/Documentation/vm/overcommit-accounting
+++ b/Documentation/vm/overcommit-accounting
@@ -4,23 +4,30 @@ The Linux kernel supports the following overcommit handling modes
address space are refused. Used for a typical system. It
ensures a seriously wild allocation fails while allowing
overcommit to reduce swap usage. root is allowed to
- allocate slighly more memory in this mode. This is the
+ allocate slightly more memory in this mode. This is the
default.
1 - Always overcommit. Appropriate for some scientific
- applications.
+ applications. Classic example is code using sparse arrays
+ and just relying on the virtual memory consisting almost
+ entirely of zero pages.
2 - Don't overcommit. The total address space commit
for the system is not permitted to exceed swap + a
- configurable percentage (default is 50) of physical RAM.
- Depending on the percentage you use, in most situations
+ configurable amount (default is 50%) of physical RAM.
+ Depending on the amount you use, in most situations
this means a process will not be killed while accessing
pages but will receive errors on memory allocation as
appropriate.
+ Useful for applications that want to guarantee their
+ memory allocations will be available in the future
+ without having to initialize every page.
+
The overcommit policy is set via the sysctl `vm.overcommit_memory'.
-The overcommit percentage is set via `vm.overcommit_ratio'.
+The overcommit amount can be set via `vm.overcommit_ratio' (percentage)
+or `vm.overcommit_kbytes' (absolute value).
The current overcommit limit and amount committed are viewable in
/proc/meminfo as CommitLimit and Committed_AS respectively.
diff --git a/Documentation/vm/page-types.c b/Documentation/vm/page-types.c
deleted file mode 100644
index 66e9358e214..00000000000
--- a/Documentation/vm/page-types.c
+++ /dev/null
@@ -1,1003 +0,0 @@
-/*
- * page-types: Tool for querying page flags
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the Free
- * Software Foundation; version 2.
- *
- * This program is distributed in the hope that it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
- * more details.
- *
- * You should find a copy of v2 of the GNU General Public License somewhere on
- * your Linux system; if not, write to the Free Software Foundation, Inc., 59
- * Temple Place, Suite 330, Boston, MA 02111-1307 USA.
- *
- * Copyright (C) 2009 Intel corporation
- *
- * Authors: Wu Fengguang <fengguang.wu@intel.com>
- */
-
-#define _LARGEFILE64_SOURCE
-#include <stdio.h>
-#include <stdlib.h>
-#include <unistd.h>
-#include <stdint.h>
-#include <stdarg.h>
-#include <string.h>
-#include <getopt.h>
-#include <limits.h>
-#include <assert.h>
-#include <sys/types.h>
-#include <sys/errno.h>
-#include <sys/fcntl.h>
-
-
-/*
- * pagemap kernel ABI bits
- */
-
-#define PM_ENTRY_BYTES sizeof(uint64_t)
-#define PM_STATUS_BITS 3
-#define PM_STATUS_OFFSET (64 - PM_STATUS_BITS)
-#define PM_STATUS_MASK (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET)
-#define PM_STATUS(nr) (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK)
-#define PM_PSHIFT_BITS 6
-#define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS)
-#define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET)
-#define PM_PSHIFT(x) (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
-#define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1)
-#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
-
-#define PM_PRESENT PM_STATUS(4LL)
-#define PM_SWAP PM_STATUS(2LL)
-
-
-/*
- * kernel page flags
- */
-
-#define KPF_BYTES 8
-#define PROC_KPAGEFLAGS "/proc/kpageflags"
-
-/* copied from kpageflags_read() */
-#define KPF_LOCKED 0
-#define KPF_ERROR 1
-#define KPF_REFERENCED 2
-#define KPF_UPTODATE 3
-#define KPF_DIRTY 4
-#define KPF_LRU 5
-#define KPF_ACTIVE 6
-#define KPF_SLAB 7
-#define KPF_WRITEBACK 8
-#define KPF_RECLAIM 9
-#define KPF_BUDDY 10
-
-/* [11-20] new additions in 2.6.31 */
-#define KPF_MMAP 11
-#define KPF_ANON 12
-#define KPF_SWAPCACHE 13
-#define KPF_SWAPBACKED 14
-#define KPF_COMPOUND_HEAD 15
-#define KPF_COMPOUND_TAIL 16
-#define KPF_HUGE 17
-#define KPF_UNEVICTABLE 18
-#define KPF_HWPOISON 19
-#define KPF_NOPAGE 20
-#define KPF_KSM 21
-
-/* [32-] kernel hacking assistances */
-#define KPF_RESERVED 32
-#define KPF_MLOCKED 33
-#define KPF_MAPPEDTODISK 34
-#define KPF_PRIVATE 35
-#define KPF_PRIVATE_2 36
-#define KPF_OWNER_PRIVATE 37
-#define KPF_ARCH 38
-#define KPF_UNCACHED 39
-
-/* [48-] take some arbitrary free slots for expanding overloaded flags
- * not part of kernel API
- */
-#define KPF_READAHEAD 48
-#define KPF_SLOB_FREE 49
-#define KPF_SLUB_FROZEN 50
-#define KPF_SLUB_DEBUG 51
-
-#define KPF_ALL_BITS ((uint64_t)~0ULL)
-#define KPF_HACKERS_BITS (0xffffULL << 32)
-#define KPF_OVERLOADED_BITS (0xffffULL << 48)
-#define BIT(name) (1ULL << KPF_##name)
-#define BITS_COMPOUND (BIT(COMPOUND_HEAD) | BIT(COMPOUND_TAIL))
-
-static const char *page_flag_names[] = {
- [KPF_LOCKED] = "L:locked",
- [KPF_ERROR] = "E:error",
- [KPF_REFERENCED] = "R:referenced",
- [KPF_UPTODATE] = "U:uptodate",
- [KPF_DIRTY] = "D:dirty",
- [KPF_LRU] = "l:lru",
- [KPF_ACTIVE] = "A:active",
- [KPF_SLAB] = "S:slab",
- [KPF_WRITEBACK] = "W:writeback",
- [KPF_RECLAIM] = "I:reclaim",
- [KPF_BUDDY] = "B:buddy",
-
- [KPF_MMAP] = "M:mmap",
- [KPF_ANON] = "a:anonymous",
- [KPF_SWAPCACHE] = "s:swapcache",
- [KPF_SWAPBACKED] = "b:swapbacked",
- [KPF_COMPOUND_HEAD] = "H:compound_head",
- [KPF_COMPOUND_TAIL] = "T:compound_tail",
- [KPF_HUGE] = "G:huge",
- [KPF_UNEVICTABLE] = "u:unevictable",
- [KPF_HWPOISON] = "X:hwpoison",
- [KPF_NOPAGE] = "n:nopage",
- [KPF_KSM] = "x:ksm",
-
- [KPF_RESERVED] = "r:reserved",
- [KPF_MLOCKED] = "m:mlocked",
- [KPF_MAPPEDTODISK] = "d:mappedtodisk",
- [KPF_PRIVATE] = "P:private",
- [KPF_PRIVATE_2] = "p:private_2",
- [KPF_OWNER_PRIVATE] = "O:owner_private",
- [KPF_ARCH] = "h:arch",
- [KPF_UNCACHED] = "c:uncached",
-
- [KPF_READAHEAD] = "I:readahead",
- [KPF_SLOB_FREE] = "P:slob_free",
- [KPF_SLUB_FROZEN] = "A:slub_frozen",
- [KPF_SLUB_DEBUG] = "E:slub_debug",
-};
-
-
-/*
- * data structures
- */
-
-static int opt_raw; /* for kernel developers */
-static int opt_list; /* list pages (in ranges) */
-static int opt_no_summary; /* don't show summary */
-static pid_t opt_pid; /* process to walk */
-
-#define MAX_ADDR_RANGES 1024
-static int nr_addr_ranges;
-static unsigned long opt_offset[MAX_ADDR_RANGES];
-static unsigned long opt_size[MAX_ADDR_RANGES];
-
-#define MAX_VMAS 10240
-static int nr_vmas;
-static unsigned long pg_start[MAX_VMAS];
-static unsigned long pg_end[MAX_VMAS];
-
-#define MAX_BIT_FILTERS 64
-static int nr_bit_filters;
-static uint64_t opt_mask[MAX_BIT_FILTERS];
-static uint64_t opt_bits[MAX_BIT_FILTERS];
-
-static int page_size;
-
-static int pagemap_fd;
-static int kpageflags_fd;
-
-static int opt_hwpoison;
-static int opt_unpoison;
-
-static const char hwpoison_debug_fs[] = "/debug/hwpoison";
-static int hwpoison_inject_fd;
-static int hwpoison_forget_fd;
-
-#define HASH_SHIFT 13
-#define HASH_SIZE (1 << HASH_SHIFT)
-#define HASH_MASK (HASH_SIZE - 1)
-#define HASH_KEY(flags) (flags & HASH_MASK)
-
-static unsigned long total_pages;
-static unsigned long nr_pages[HASH_SIZE];
-static uint64_t page_flags[HASH_SIZE];
-
-
-/*
- * helper functions
- */
-
-#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
-
-#define min_t(type, x, y) ({ \
- type __min1 = (x); \
- type __min2 = (y); \
- __min1 < __min2 ? __min1 : __min2; })
-
-#define max_t(type, x, y) ({ \
- type __max1 = (x); \
- type __max2 = (y); \
- __max1 > __max2 ? __max1 : __max2; })
-
-static unsigned long pages2mb(unsigned long pages)
-{
- return (pages * page_size) >> 20;
-}
-
-static void fatal(const char *x, ...)
-{
- va_list ap;
-
- va_start(ap, x);
- vfprintf(stderr, x, ap);
- va_end(ap);
- exit(EXIT_FAILURE);
-}
-
-static int checked_open(const char *pathname, int flags)
-{
- int fd = open(pathname, flags);
-
- if (fd < 0) {
- perror(pathname);
- exit(EXIT_FAILURE);
- }
-
- return fd;
-}
-
-/*
- * pagemap/kpageflags routines
- */
-
-static unsigned long do_u64_read(int fd, char *name,
- uint64_t *buf,
- unsigned long index,
- unsigned long count)
-{
- long bytes;
-
- if (index > ULONG_MAX / 8)
- fatal("index overflow: %lu\n", index);
-
- if (lseek(fd, index * 8, SEEK_SET) < 0) {
- perror(name);
- exit(EXIT_FAILURE);
- }
-
- bytes = read(fd, buf, count * 8);
- if (bytes < 0) {
- perror(name);
- exit(EXIT_FAILURE);
- }
- if (bytes % 8)
- fatal("partial read: %lu bytes\n", bytes);
-
- return bytes / 8;
-}
-
-static unsigned long kpageflags_read(uint64_t *buf,
- unsigned long index,
- unsigned long pages)
-{
- return do_u64_read(kpageflags_fd, PROC_KPAGEFLAGS, buf, index, pages);
-}
-
-static unsigned long pagemap_read(uint64_t *buf,
- unsigned long index,
- unsigned long pages)
-{
- return do_u64_read(pagemap_fd, "/proc/pid/pagemap", buf, index, pages);
-}
-
-static unsigned long pagemap_pfn(uint64_t val)
-{
- unsigned long pfn;
-
- if (val & PM_PRESENT)
- pfn = PM_PFRAME(val);
- else
- pfn = 0;
-
- return pfn;
-}
-
-
-/*
- * page flag names
- */
-
-static char *page_flag_name(uint64_t flags)
-{
- static char buf[65];
- int present;
- int i, j;
-
- for (i = 0, j = 0; i < ARRAY_SIZE(page_flag_names); i++) {
- present = (flags >> i) & 1;
- if (!page_flag_names[i]) {
- if (present)
- fatal("unknown flag bit %d\n", i);
- continue;
- }
- buf[j++] = present ? page_flag_names[i][0] : '_';
- }
-
- return buf;
-}
-
-static char *page_flag_longname(uint64_t flags)
-{
- static char buf[1024];
- int i, n;
-
- for (i = 0, n = 0; i < ARRAY_SIZE(page_flag_names); i++) {
- if (!page_flag_names[i])
- continue;
- if ((flags >> i) & 1)
- n += snprintf(buf + n, sizeof(buf) - n, "%s,",
- page_flag_names[i] + 2);
- }
- if (n)
- n--;
- buf[n] = '\0';
-
- return buf;
-}
-
-
-/*
- * page list and summary
- */
-
-static void show_page_range(unsigned long voffset,
- unsigned long offset, uint64_t flags)
-{
- static uint64_t flags0;
- static unsigned long voff;
- static unsigned long index;
- static unsigned long count;
-
- if (flags == flags0 && offset == index + count &&
- (!opt_pid || voffset == voff + count)) {
- count++;
- return;
- }
-
- if (count) {
- if (opt_pid)
- printf("%lx\t", voff);
- printf("%lx\t%lx\t%s\n",
- index, count, page_flag_name(flags0));
- }
-
- flags0 = flags;
- index = offset;
- voff = voffset;
- count = 1;
-}
-
-static void show_page(unsigned long voffset,
- unsigned long offset, uint64_t flags)
-{
- if (opt_pid)
- printf("%lx\t", voffset);
- printf("%lx\t%s\n", offset, page_flag_name(flags));
-}
-
-static void show_summary(void)
-{
- int i;
-
- printf(" flags\tpage-count MB"
- " symbolic-flags\t\t\tlong-symbolic-flags\n");
-
- for (i = 0; i < ARRAY_SIZE(nr_pages); i++) {
- if (nr_pages[i])
- printf("0x%016llx\t%10lu %8lu %s\t%s\n",
- (unsigned long long)page_flags[i],
- nr_pages[i],
- pages2mb(nr_pages[i]),
- page_flag_name(page_flags[i]),
- page_flag_longname(page_flags[i]));
- }
-
- printf(" total\t%10lu %8lu\n",
- total_pages, pages2mb(total_pages));
-}
-
-
-/*
- * page flag filters
- */
-
-static int bit_mask_ok(uint64_t flags)
-{
- int i;
-
- for (i = 0; i < nr_bit_filters; i++) {
- if (opt_bits[i] == KPF_ALL_BITS) {
- if ((flags & opt_mask[i]) == 0)
- return 0;
- } else {
- if ((flags & opt_mask[i]) != opt_bits[i])
- return 0;
- }
- }
-
- return 1;
-}
-
-static uint64_t expand_overloaded_flags(uint64_t flags)
-{
- /* SLOB/SLUB overload several page flags */
- if (flags & BIT(SLAB)) {
- if (flags & BIT(PRIVATE))
- flags ^= BIT(PRIVATE) | BIT(SLOB_FREE);
- if (flags & BIT(ACTIVE))
- flags ^= BIT(ACTIVE) | BIT(SLUB_FROZEN);
- if (flags & BIT(ERROR))
- flags ^= BIT(ERROR) | BIT(SLUB_DEBUG);
- }
-
- /* PG_reclaim is overloaded as PG_readahead in the read path */
- if ((flags & (BIT(RECLAIM) | BIT(WRITEBACK))) == BIT(RECLAIM))
- flags ^= BIT(RECLAIM) | BIT(READAHEAD);
-
- return flags;
-}
-
-static uint64_t well_known_flags(uint64_t flags)
-{
- /* hide flags intended only for kernel hacker */
- flags &= ~KPF_HACKERS_BITS;
-
- /* hide non-hugeTLB compound pages */
- if ((flags & BITS_COMPOUND) && !(flags & BIT(HUGE)))
- flags &= ~BITS_COMPOUND;
-
- return flags;
-}
-
-static uint64_t kpageflags_flags(uint64_t flags)
-{
- flags = expand_overloaded_flags(flags);
-
- if (!opt_raw)
- flags = well_known_flags(flags);
-
- return flags;
-}
-
-/*
- * page actions
- */
-
-static void prepare_hwpoison_fd(void)
-{
- char buf[100];
-
- if (opt_hwpoison && !hwpoison_inject_fd) {
- sprintf(buf, "%s/corrupt-pfn", hwpoison_debug_fs);
- hwpoison_inject_fd = checked_open(buf, O_WRONLY);
- }
-
- if (opt_unpoison && !hwpoison_forget_fd) {
- sprintf(buf, "%s/renew-pfn", hwpoison_debug_fs);
- hwpoison_forget_fd = checked_open(buf, O_WRONLY);
- }
-}
-
-static int hwpoison_page(unsigned long offset)
-{
- char buf[100];
- int len;
-
- len = sprintf(buf, "0x%lx\n", offset);
- len = write(hwpoison_inject_fd, buf, len);
- if (len < 0) {
- perror("hwpoison inject");
- return len;
- }
- return 0;
-}
-
-static int unpoison_page(unsigned long offset)
-{
- char buf[100];
- int len;
-
- len = sprintf(buf, "0x%lx\n", offset);
- len = write(hwpoison_forget_fd, buf, len);
- if (len < 0) {
- perror("hwpoison forget");
- return len;
- }
- return 0;
-}
-
-/*
- * page frame walker
- */
-
-static int hash_slot(uint64_t flags)
-{
- int k = HASH_KEY(flags);
- int i;
-
- /* Explicitly reserve slot 0 for flags 0: the following logic
- * cannot distinguish an unoccupied slot from slot (flags==0).
- */
- if (flags == 0)
- return 0;
-
- /* search through the remaining (HASH_SIZE-1) slots */
- for (i = 1; i < ARRAY_SIZE(page_flags); i++, k++) {
- if (!k || k >= ARRAY_SIZE(page_flags))
- k = 1;
- if (page_flags[k] == 0) {
- page_flags[k] = flags;
- return k;
- }
- if (page_flags[k] == flags)
- return k;
- }
-
- fatal("hash table full: bump up HASH_SHIFT?\n");
- exit(EXIT_FAILURE);
-}
-
-static void add_page(unsigned long voffset,
- unsigned long offset, uint64_t flags)
-{
- flags = kpageflags_flags(flags);
-
- if (!bit_mask_ok(flags))
- return;
-
- if (opt_hwpoison)
- hwpoison_page(offset);
- if (opt_unpoison)
- unpoison_page(offset);
-
- if (opt_list == 1)
- show_page_range(voffset, offset, flags);
- else if (opt_list == 2)
- show_page(voffset, offset, flags);
-
- nr_pages[hash_slot(flags)]++;
- total_pages++;
-}
-
-#define KPAGEFLAGS_BATCH (64 << 10) /* 64k pages */
-static void walk_pfn(unsigned long voffset,
- unsigned long index,
- unsigned long count)
-{
- uint64_t buf[KPAGEFLAGS_BATCH];
- unsigned long batch;
- long pages;
- unsigned long i;
-
- while (count) {
- batch = min_t(unsigned long, count, KPAGEFLAGS_BATCH);
- pages = kpageflags_read(buf, index, batch);
- if (pages == 0)
- break;
-
- for (i = 0; i < pages; i++)
- add_page(voffset + i, index + i, buf[i]);
-
- index += pages;
- count -= pages;
- }
-}
-
-#define PAGEMAP_BATCH (64 << 10)
-static void walk_vma(unsigned long index, unsigned long count)
-{
- uint64_t buf[PAGEMAP_BATCH];
- unsigned long batch;
- unsigned long pages;
- unsigned long pfn;
- unsigned long i;
-
- while (count) {
- batch = min_t(unsigned long, count, PAGEMAP_BATCH);
- pages = pagemap_read(buf, index, batch);
- if (pages == 0)
- break;
-
- for (i = 0; i < pages; i++) {
- pfn = pagemap_pfn(buf[i]);
- if (pfn)
- walk_pfn(index + i, pfn, 1);
- }
-
- index += pages;
- count -= pages;
- }
-}
-
-static void walk_task(unsigned long index, unsigned long count)
-{
- const unsigned long end = index + count;
- unsigned long start;
- int i = 0;
-
- while (index < end) {
-
- while (pg_end[i] <= index)
- if (++i >= nr_vmas)
- return;
- if (pg_start[i] >= end)
- return;
-
- start = max_t(unsigned long, pg_start[i], index);
- index = min_t(unsigned long, pg_end[i], end);
-
- assert(start < index);
- walk_vma(start, index - start);
- }
-}
-
-static void add_addr_range(unsigned long offset, unsigned long size)
-{
- if (nr_addr_ranges >= MAX_ADDR_RANGES)
- fatal("too many addr ranges\n");
-
- opt_offset[nr_addr_ranges] = offset;
- opt_size[nr_addr_ranges] = min_t(unsigned long, size, ULONG_MAX-offset);
- nr_addr_ranges++;
-}
-
-static void walk_addr_ranges(void)
-{
- int i;
-
- kpageflags_fd = checked_open(PROC_KPAGEFLAGS, O_RDONLY);
-
- if (!nr_addr_ranges)
- add_addr_range(0, ULONG_MAX);
-
- for (i = 0; i < nr_addr_ranges; i++)
- if (!opt_pid)
- walk_pfn(0, opt_offset[i], opt_size[i]);
- else
- walk_task(opt_offset[i], opt_size[i]);
-
- close(kpageflags_fd);
-}
-
-
-/*
- * user interface
- */
-
-static const char *page_flag_type(uint64_t flag)
-{
- if (flag & KPF_HACKERS_BITS)
- return "(r)";
- if (flag & KPF_OVERLOADED_BITS)
- return "(o)";
- return " ";
-}
-
-static void usage(void)
-{
- int i, j;
-
- printf(
-"page-types [options]\n"
-" -r|--raw Raw mode, for kernel developers\n"
-" -d|--describe flags Describe flags\n"
-" -a|--addr addr-spec Walk a range of pages\n"
-" -b|--bits bits-spec Walk pages with specified bits\n"
-" -p|--pid pid Walk process address space\n"
-#if 0 /* planned features */
-" -f|--file filename Walk file address space\n"
-#endif
-" -l|--list Show page details in ranges\n"
-" -L|--list-each Show page details one by one\n"
-" -N|--no-summary Don't show summay info\n"
-" -X|--hwpoison hwpoison pages\n"
-" -x|--unpoison unpoison pages\n"
-" -h|--help Show this usage message\n"
-"flags:\n"
-" 0x10 bitfield format, e.g.\n"
-" anon bit-name, e.g.\n"
-" 0x10,anon comma-separated list, e.g.\n"
-"addr-spec:\n"
-" N one page at offset N (unit: pages)\n"
-" N+M pages range from N to N+M-1\n"
-" N,M pages range from N to M-1\n"
-" N, pages range from N to end\n"
-" ,M pages range from 0 to M-1\n"
-"bits-spec:\n"
-" bit1,bit2 (flags & (bit1|bit2)) != 0\n"
-" bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1\n"
-" bit1,~bit2 (flags & (bit1|bit2)) == bit1\n"
-" =bit1,bit2 flags == (bit1|bit2)\n"
-"bit-names:\n"
- );
-
- for (i = 0, j = 0; i < ARRAY_SIZE(page_flag_names); i++) {
- if (!page_flag_names[i])
- continue;
- printf("%16s%s", page_flag_names[i] + 2,
- page_flag_type(1ULL << i));
- if (++j > 3) {
- j = 0;
- putchar('\n');
- }
- }
- printf("\n "
- "(r) raw mode bits (o) overloaded bits\n");
-}
-
-static unsigned long long parse_number(const char *str)
-{
- unsigned long long n;
-
- n = strtoll(str, NULL, 0);
-
- if (n == 0 && str[0] != '0')
- fatal("invalid name or number: %s\n", str);
-
- return n;
-}
-
-static void parse_pid(const char *str)
-{
- FILE *file;
- char buf[5000];
-
- opt_pid = parse_number(str);
-
- sprintf(buf, "/proc/%d/pagemap", opt_pid);
- pagemap_fd = checked_open(buf, O_RDONLY);
-
- sprintf(buf, "/proc/%d/maps", opt_pid);
- file = fopen(buf, "r");
- if (!file) {
- perror(buf);
- exit(EXIT_FAILURE);
- }
-
- while (fgets(buf, sizeof(buf), file) != NULL) {
- unsigned long vm_start;
- unsigned long vm_end;
- unsigned long long pgoff;
- int major, minor;
- char r, w, x, s;
- unsigned long ino;
- int n;
-
- n = sscanf(buf, "%lx-%lx %c%c%c%c %llx %x:%x %lu",
- &vm_start,
- &vm_end,
- &r, &w, &x, &s,
- &pgoff,
- &major, &minor,
- &ino);
- if (n < 10) {
- fprintf(stderr, "unexpected line: %s\n", buf);
- continue;
- }
- pg_start[nr_vmas] = vm_start / page_size;
- pg_end[nr_vmas] = vm_end / page_size;
- if (++nr_vmas >= MAX_VMAS) {
- fprintf(stderr, "too many VMAs\n");
- break;
- }
- }
- fclose(file);
-}
-
-static void parse_file(const char *name)
-{
-}
-
-static void parse_addr_range(const char *optarg)
-{
- unsigned long offset;
- unsigned long size;
- char *p;
-
- p = strchr(optarg, ',');
- if (!p)
- p = strchr(optarg, '+');
-
- if (p == optarg) {
- offset = 0;
- size = parse_number(p + 1);
- } else if (p) {
- offset = parse_number(optarg);
- if (p[1] == '\0')
- size = ULONG_MAX;
- else {
- size = parse_number(p + 1);
- if (*p == ',') {
- if (size < offset)
- fatal("invalid range: %lu,%lu\n",
- offset, size);
- size -= offset;
- }
- }
- } else {
- offset = parse_number(optarg);
- size = 1;
- }
-
- add_addr_range(offset, size);
-}
-
-static void add_bits_filter(uint64_t mask, uint64_t bits)
-{
- if (nr_bit_filters >= MAX_BIT_FILTERS)
- fatal("too much bit filters\n");
-
- opt_mask[nr_bit_filters] = mask;
- opt_bits[nr_bit_filters] = bits;
- nr_bit_filters++;
-}
-
-static uint64_t parse_flag_name(const char *str, int len)
-{
- int i;
-
- if (!*str || !len)
- return 0;
-
- if (len <= 8 && !strncmp(str, "compound", len))
- return BITS_COMPOUND;
-
- for (i = 0; i < ARRAY_SIZE(page_flag_names); i++) {
- if (!page_flag_names[i])
- continue;
- if (!strncmp(str, page_flag_names[i] + 2, len))
- return 1ULL << i;
- }
-
- return parse_number(str);
-}
-
-static uint64_t parse_flag_names(const char *str, int all)
-{
- const char *p = str;
- uint64_t flags = 0;
-
- while (1) {
- if (*p == ',' || *p == '=' || *p == '\0') {
- if ((*str != '~') || (*str == '~' && all && *++str))
- flags |= parse_flag_name(str, p - str);
- if (*p != ',')
- break;
- str = p + 1;
- }
- p++;
- }
-
- return flags;
-}
-
-static void parse_bits_mask(const char *optarg)
-{
- uint64_t mask;
- uint64_t bits;
- const char *p;
-
- p = strchr(optarg, '=');
- if (p == optarg) {
- mask = KPF_ALL_BITS;
- bits = parse_flag_names(p + 1, 0);
- } else if (p) {
- mask = parse_flag_names(optarg, 0);
- bits = parse_flag_names(p + 1, 0);
- } else if (strchr(optarg, '~')) {
- mask = parse_flag_names(optarg, 1);
- bits = parse_flag_names(optarg, 0);
- } else {
- mask = parse_flag_names(optarg, 0);
- bits = KPF_ALL_BITS;
- }
-
- add_bits_filter(mask, bits);
-}
-
-static void describe_flags(const char *optarg)
-{
- uint64_t flags = parse_flag_names(optarg, 0);
-
- printf("0x%016llx\t%s\t%s\n",
- (unsigned long long)flags,
- page_flag_name(flags),
- page_flag_longname(flags));
-}
-
-static const struct option opts[] = {
- { "raw" , 0, NULL, 'r' },
- { "pid" , 1, NULL, 'p' },
- { "file" , 1, NULL, 'f' },
- { "addr" , 1, NULL, 'a' },
- { "bits" , 1, NULL, 'b' },
- { "describe" , 1, NULL, 'd' },
- { "list" , 0, NULL, 'l' },
- { "list-each" , 0, NULL, 'L' },
- { "no-summary", 0, NULL, 'N' },
- { "hwpoison" , 0, NULL, 'X' },
- { "unpoison" , 0, NULL, 'x' },
- { "help" , 0, NULL, 'h' },
- { NULL , 0, NULL, 0 }
-};
-
-int main(int argc, char *argv[])
-{
- int c;
-
- page_size = getpagesize();
-
- while ((c = getopt_long(argc, argv,
- "rp:f:a:b:d:lLNXxh", opts, NULL)) != -1) {
- switch (c) {
- case 'r':
- opt_raw = 1;
- break;
- case 'p':
- parse_pid(optarg);
- break;
- case 'f':
- parse_file(optarg);
- break;
- case 'a':
- parse_addr_range(optarg);
- break;
- case 'b':
- parse_bits_mask(optarg);
- break;
- case 'd':
- describe_flags(optarg);
- exit(0);
- case 'l':
- opt_list = 1;
- break;
- case 'L':
- opt_list = 2;
- break;
- case 'N':
- opt_no_summary = 1;
- break;
- case 'X':
- opt_hwpoison = 1;
- prepare_hwpoison_fd();
- break;
- case 'x':
- opt_unpoison = 1;
- prepare_hwpoison_fd();
- break;
- case 'h':
- usage();
- exit(0);
- default:
- usage();
- exit(1);
- }
- }
-
- if (opt_list && opt_pid)
- printf("voffset\t");
- if (opt_list == 1)
- printf("offset\tlen\tflags\n");
- if (opt_list == 2)
- printf("offset\tflags\n");
-
- walk_addr_ranges();
-
- if (opt_list == 1)
- show_page_range(0, 0, 0); /* drain the buffer */
-
- if (opt_no_summary)
- return 0;
-
- if (opt_list)
- printf("\n\n");
-
- show_summary();
-
- return 0;
-}
diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index df09b9650a8..5948e455c4d 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -15,8 +15,9 @@ There are three components to pagemap:
* Bits 0-54 page frame number (PFN) if present
* Bits 0-4 swap type if swapped
* Bits 5-54 swap offset if swapped
- * Bits 55-60 page shift (page size = 1<<page shift)
- * Bit 61 reserved for future use
+ * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
+ * Bits 56-60 zero
+ * Bit 61 page is file-page or shared-anon
* Bit 62 page swapped
* Bit 63 page present
@@ -60,6 +61,7 @@ There are three components to pagemap:
19. HWPOISON
20. NOPAGE
21. KSM
+ 22. THP
Short descriptions to the page flags:
@@ -97,6 +99,9 @@ Short descriptions to the page flags:
21. KSM
identical memory pages dynamically shared between one or more processes
+22. THP
+ contiguous pages which construct transparent hugepages
+
[IO related page flags]
1. ERROR IO error occurred
3. UPTODATE page has up-to-date data
@@ -143,5 +148,5 @@ once.
Other notes:
Reading from any of the files will return -EINVAL if you are not starting
-the read on an 8-byte boundary (e.g., if you seeked an odd number of bytes
+the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
into the file), or if the size of the read is not a multiple of 8 bytes.
diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.txt
new file mode 100644
index 00000000000..560e4363a55
--- /dev/null
+++ b/Documentation/vm/remap_file_pages.txt
@@ -0,0 +1,28 @@
+The remap_file_pages() system call is used to create a nonlinear mapping,
+that is, a mapping in which the pages of the file are mapped into a
+nonsequential order in memory. The advantage of using remap_file_pages()
+over using repeated calls to mmap(2) is that the former approach does not
+require the kernel to create additional VMA (Virtual Memory Area) data
+structures.
+
+Supporting of nonlinear mapping requires significant amount of non-trivial
+code in kernel virtual memory subsystem including hot paths. Also to get
+nonlinear mapping work kernel need a way to distinguish normal page table
+entries from entries with file offset (pte_file). Kernel reserves flag in
+PTE for this purpose. PTE flags are scarce resource especially on some CPU
+architectures. It would be nice to free up the flag for other usage.
+
+Fortunately, there are not many users of remap_file_pages() in the wild.
+It's only known that one enterprise RDBMS implementation uses the syscall
+on 32-bit systems to map files bigger than can linearly fit into 32-bit
+virtual address space. This use-case is not critical anymore since 64-bit
+systems are widely available.
+
+The plan is to deprecate the syscall and replace it with an emulation.
+The emulation will create new VMAs instead of nonlinear mappings. It's
+going to work slower for rare users of remap_file_pages() but ABI is
+preserved.
+
+One side effect of emulation (apart from performance) is that user can hit
+vm.max_map_count limit more easily due to additional VMAs. See comment for
+DEFAULT_MAX_MAP_COUNT for more details on the limit.
diff --git a/Documentation/vm/slabinfo.c b/Documentation/vm/slabinfo.c
deleted file mode 100644
index 92e729f4b67..00000000000
--- a/Documentation/vm/slabinfo.c
+++ /dev/null
@@ -1,1364 +0,0 @@
-/*
- * Slabinfo: Tool to get reports about slabs
- *
- * (C) 2007 sgi, Christoph Lameter
- *
- * Compile by:
- *
- * gcc -o slabinfo slabinfo.c
- */
-#include <stdio.h>
-#include <stdlib.h>
-#include <sys/types.h>
-#include <dirent.h>
-#include <strings.h>
-#include <string.h>
-#include <unistd.h>
-#include <stdarg.h>
-#include <getopt.h>
-#include <regex.h>
-#include <errno.h>
-
-#define MAX_SLABS 500
-#define MAX_ALIASES 500
-#define MAX_NODES 1024
-
-struct slabinfo {
- char *name;
- int alias;
- int refs;
- int aliases, align, cache_dma, cpu_slabs, destroy_by_rcu;
- int hwcache_align, object_size, objs_per_slab;
- int sanity_checks, slab_size, store_user, trace;
- int order, poison, reclaim_account, red_zone;
- unsigned long partial, objects, slabs, objects_partial, objects_total;
- unsigned long alloc_fastpath, alloc_slowpath;
- unsigned long free_fastpath, free_slowpath;
- unsigned long free_frozen, free_add_partial, free_remove_partial;
- unsigned long alloc_from_partial, alloc_slab, free_slab, alloc_refill;
- unsigned long cpuslab_flush, deactivate_full, deactivate_empty;
- unsigned long deactivate_to_head, deactivate_to_tail;
- unsigned long deactivate_remote_frees, order_fallback;
- int numa[MAX_NODES];
- int numa_partial[MAX_NODES];
-} slabinfo[MAX_SLABS];
-
-struct aliasinfo {
- char *name;
- char *ref;
- struct slabinfo *slab;
-} aliasinfo[MAX_ALIASES];
-
-int slabs = 0;
-int actual_slabs = 0;
-int aliases = 0;
-int alias_targets = 0;
-int highest_node = 0;
-
-char buffer[4096];
-
-int show_empty = 0;
-int show_report = 0;
-int show_alias = 0;
-int show_slab = 0;
-int skip_zero = 1;
-int show_numa = 0;
-int show_track = 0;
-int show_first_alias = 0;
-int validate = 0;
-int shrink = 0;
-int show_inverted = 0;
-int show_single_ref = 0;
-int show_totals = 0;
-int sort_size = 0;
-int sort_active = 0;
-int set_debug = 0;
-int show_ops = 0;
-int show_activity = 0;
-
-/* Debug options */
-int sanity = 0;
-int redzone = 0;
-int poison = 0;
-int tracking = 0;
-int tracing = 0;
-
-int page_size;
-
-regex_t pattern;
-
-static void fatal(const char *x, ...)
-{
- va_list ap;
-
- va_start(ap, x);
- vfprintf(stderr, x, ap);
- va_end(ap);
- exit(EXIT_FAILURE);
-}
-
-static void usage(void)
-{
- printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
- "slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
- "-a|--aliases Show aliases\n"
- "-A|--activity Most active slabs first\n"
- "-d<options>|--debug=<options> Set/Clear Debug options\n"
- "-D|--display-active Switch line format to activity\n"
- "-e|--empty Show empty slabs\n"
- "-f|--first-alias Show first alias\n"
- "-h|--help Show usage information\n"
- "-i|--inverted Inverted list\n"
- "-l|--slabs Show slabs\n"
- "-n|--numa Show NUMA information\n"
- "-o|--ops Show kmem_cache_ops\n"
- "-s|--shrink Shrink slabs\n"
- "-r|--report Detailed report on single slabs\n"
- "-S|--Size Sort by size\n"
- "-t|--tracking Show alloc/free information\n"
- "-T|--Totals Show summary information\n"
- "-v|--validate Validate slabs\n"
- "-z|--zero Include empty slabs\n"
- "-1|--1ref Single reference\n"
- "\nValid debug options (FZPUT may be combined)\n"
- "a / A Switch on all debug options (=FZUP)\n"
- "- Switch off all debug options\n"
- "f / F Sanity Checks (SLAB_DEBUG_FREE)\n"
- "z / Z Redzoning\n"
- "p / P Poisoning\n"
- "u / U Tracking\n"
- "t / T Tracing\n"
- );
-}
-
-static unsigned long read_obj(const char *name)
-{
- FILE *f = fopen(name, "r");
-
- if (!f)
- buffer[0] = 0;
- else {
- if (!fgets(buffer, sizeof(buffer), f))
- buffer[0] = 0;
- fclose(f);
- if (buffer[strlen(buffer)] == '\n')
- buffer[strlen(buffer)] = 0;
- }
- return strlen(buffer);
-}
-
-
-/*
- * Get the contents of an attribute
- */
-static unsigned long get_obj(const char *name)
-{
- if (!read_obj(name))
- return 0;
-
- return atol(buffer);
-}
-
-static unsigned long get_obj_and_str(const char *name, char **x)
-{
- unsigned long result = 0;
- char *p;
-
- *x = NULL;
-
- if (!read_obj(name)) {
- x = NULL;
- return 0;
- }
- result = strtoul(buffer, &p, 10);
- while (*p == ' ')
- p++;
- if (*p)
- *x = strdup(p);
- return result;
-}
-
-static void set_obj(struct slabinfo *s, const char *name, int n)
-{
- char x[100];
- FILE *f;
-
- snprintf(x, 100, "%s/%s", s->name, name);
- f = fopen(x, "w");
- if (!f)
- fatal("Cannot write to %s\n", x);
-
- fprintf(f, "%d\n", n);
- fclose(f);
-}
-
-static unsigned long read_slab_obj(struct slabinfo *s, const char *name)
-{
- char x[100];
- FILE *f;
- size_t l;
-
- snprintf(x, 100, "%s/%s", s->name, name);
- f = fopen(x, "r");
- if (!f) {
- buffer[0] = 0;
- l = 0;
- } else {
- l = fread(buffer, 1, sizeof(buffer), f);
- buffer[l] = 0;
- fclose(f);
- }
- return l;
-}
-
-
-/*
- * Put a size string together
- */
-static int store_size(char *buffer, unsigned long value)
-{
- unsigned long divisor = 1;
- char trailer = 0;
- int n;
-
- if (value > 1000000000UL) {
- divisor = 100000000UL;
- trailer = 'G';
- } else if (value > 1000000UL) {
- divisor = 100000UL;
- trailer = 'M';
- } else if (value > 1000UL) {
- divisor = 100;
- trailer = 'K';
- }
-
- value /= divisor;
- n = sprintf(buffer, "%ld",value);
- if (trailer) {
- buffer[n] = trailer;
- n++;
- buffer[n] = 0;
- }
- if (divisor != 1) {
- memmove(buffer + n - 2, buffer + n - 3, 4);
- buffer[n-2] = '.';
- n++;
- }
- return n;
-}
-
-static void decode_numa_list(int *numa, char *t)
-{
- int node;
- int nr;
-
- memset(numa, 0, MAX_NODES * sizeof(int));
-
- if (!t)
- return;
-
- while (*t == 'N') {
- t++;
- node = strtoul(t, &t, 10);
- if (*t == '=') {
- t++;
- nr = strtoul(t, &t, 10);
- numa[node] = nr;
- if (node > highest_node)
- highest_node = node;
- }
- while (*t == ' ')
- t++;
- }
-}
-
-static void slab_validate(struct slabinfo *s)
-{
- if (strcmp(s->name, "*") == 0)
- return;
-
- set_obj(s, "validate", 1);
-}
-
-static void slab_shrink(struct slabinfo *s)
-{
- if (strcmp(s->name, "*") == 0)
- return;
-
- set_obj(s, "shrink", 1);
-}
-
-int line = 0;
-
-static void first_line(void)
-{
- if (show_activity)
- printf("Name Objects Alloc Free %%Fast Fallb O\n");
- else
- printf("Name Objects Objsize Space "
- "Slabs/Part/Cpu O/S O %%Fr %%Ef Flg\n");
-}
-
-/*
- * Find the shortest alias of a slab
- */
-static struct aliasinfo *find_one_alias(struct slabinfo *find)
-{
- struct aliasinfo *a;
- struct aliasinfo *best = NULL;
-
- for(a = aliasinfo;a < aliasinfo + aliases; a++) {
- if (a->slab == find &&
- (!best || strlen(best->name) < strlen(a->name))) {
- best = a;
- if (strncmp(a->name,"kmall", 5) == 0)
- return best;
- }
- }
- return best;
-}
-
-static unsigned long slab_size(struct slabinfo *s)
-{
- return s->slabs * (page_size << s->order);
-}
-
-static unsigned long slab_activity(struct slabinfo *s)
-{
- return s->alloc_fastpath + s->free_fastpath +
- s->alloc_slowpath + s->free_slowpath;
-}
-
-static void slab_numa(struct slabinfo *s, int mode)
-{
- int node;
-
- if (strcmp(s->name, "*") == 0)
- return;
-
- if (!highest_node) {
- printf("\n%s: No NUMA information available.\n", s->name);
- return;
- }
-
- if (skip_zero && !s->slabs)
- return;
-
- if (!line) {
- printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
- for(node = 0; node <= highest_node; node++)
- printf(" %4d", node);
- printf("\n----------------------");
- for(node = 0; node <= highest_node; node++)
- printf("-----");
- printf("\n");
- }
- printf("%-21s ", mode ? "All slabs" : s->name);
- for(node = 0; node <= highest_node; node++) {
- char b[20];
-
- store_size(b, s->numa[node]);
- printf(" %4s", b);
- }
- printf("\n");
- if (mode) {
- printf("%-21s ", "Partial slabs");
- for(node = 0; node <= highest_node; node++) {
- char b[20];
-
- store_size(b, s->numa_partial[node]);
- printf(" %4s", b);
- }
- printf("\n");
- }
- line++;
-}
-
-static void show_tracking(struct slabinfo *s)
-{
- printf("\n%s: Kernel object allocation\n", s->name);
- printf("-----------------------------------------------------------------------\n");
- if (read_slab_obj(s, "alloc_calls"))
- printf(buffer);
- else
- printf("No Data\n");
-
- printf("\n%s: Kernel object freeing\n", s->name);
- printf("------------------------------------------------------------------------\n");
- if (read_slab_obj(s, "free_calls"))
- printf(buffer);
- else
- printf("No Data\n");
-
-}
-
-static void ops(struct slabinfo *s)
-{
- if (strcmp(s->name, "*") == 0)
- return;
-
- if (read_slab_obj(s, "ops")) {
- printf("\n%s: kmem_cache operations\n", s->name);
- printf("--------------------------------------------\n");
- printf(buffer);
- } else
- printf("\n%s has no kmem_cache operations\n", s->name);
-}
-
-static const char *onoff(int x)
-{
- if (x)
- return "On ";
- return "Off";
-}
-
-static void slab_stats(struct slabinfo *s)
-{
- unsigned long total_alloc;
- unsigned long total_free;
- unsigned long total;
-
- if (!s->alloc_slab)
- return;
-
- total_alloc = s->alloc_fastpath + s->alloc_slowpath;
- total_free = s->free_fastpath + s->free_slowpath;
-
- if (!total_alloc)
- return;
-
- printf("\n");
- printf("Slab Perf Counter Alloc Free %%Al %%Fr\n");
- printf("--------------------------------------------------\n");
- printf("Fastpath %8lu %8lu %3lu %3lu\n",
- s->alloc_fastpath, s->free_fastpath,
- s->alloc_fastpath * 100 / total_alloc,
- s->free_fastpath * 100 / total_free);
- printf("Slowpath %8lu %8lu %3lu %3lu\n",
- total_alloc - s->alloc_fastpath, s->free_slowpath,
- (total_alloc - s->alloc_fastpath) * 100 / total_alloc,
- s->free_slowpath * 100 / total_free);
- printf("Page Alloc %8lu %8lu %3lu %3lu\n",
- s->alloc_slab, s->free_slab,
- s->alloc_slab * 100 / total_alloc,
- s->free_slab * 100 / total_free);
- printf("Add partial %8lu %8lu %3lu %3lu\n",
- s->deactivate_to_head + s->deactivate_to_tail,
- s->free_add_partial,
- (s->deactivate_to_head + s->deactivate_to_tail) * 100 / total_alloc,
- s->free_add_partial * 100 / total_free);
- printf("Remove partial %8lu %8lu %3lu %3lu\n",
- s->alloc_from_partial, s->free_remove_partial,
- s->alloc_from_partial * 100 / total_alloc,
- s->free_remove_partial * 100 / total_free);
-
- printf("RemoteObj/SlabFrozen %8lu %8lu %3lu %3lu\n",
- s->deactivate_remote_frees, s->free_frozen,
- s->deactivate_remote_frees * 100 / total_alloc,
- s->free_frozen * 100 / total_free);
-
- printf("Total %8lu %8lu\n\n", total_alloc, total_free);
-
- if (s->cpuslab_flush)
- printf("Flushes %8lu\n", s->cpuslab_flush);
-
- if (s->alloc_refill)
- printf("Refill %8lu\n", s->alloc_refill);
-
- total = s->deactivate_full + s->deactivate_empty +
- s->deactivate_to_head + s->deactivate_to_tail;
-
- if (total)
- printf("Deactivate Full=%lu(%lu%%) Empty=%lu(%lu%%) "
- "ToHead=%lu(%lu%%) ToTail=%lu(%lu%%)\n",
- s->deactivate_full, (s->deactivate_full * 100) / total,
- s->deactivate_empty, (s->deactivate_empty * 100) / total,
- s->deactivate_to_head, (s->deactivate_to_head * 100) / total,
- s->deactivate_to_tail, (s->deactivate_to_tail * 100) / total);
-}
-
-static void report(struct slabinfo *s)
-{
- if (strcmp(s->name, "*") == 0)
- return;
-
- printf("\nSlabcache: %-20s Aliases: %2d Order : %2d Objects: %lu\n",
- s->name, s->aliases, s->order, s->objects);
- if (s->hwcache_align)
- printf("** Hardware cacheline aligned\n");
- if (s->cache_dma)
- printf("** Memory is allocated in a special DMA zone\n");
- if (s->destroy_by_rcu)
- printf("** Slabs are destroyed via RCU\n");
- if (s->reclaim_account)
- printf("** Reclaim accounting active\n");
-
- printf("\nSizes (bytes) Slabs Debug Memory\n");
- printf("------------------------------------------------------------------------\n");
- printf("Object : %7d Total : %7ld Sanity Checks : %s Total: %7ld\n",
- s->object_size, s->slabs, onoff(s->sanity_checks),
- s->slabs * (page_size << s->order));
- printf("SlabObj: %7d Full : %7ld Redzoning : %s Used : %7ld\n",
- s->slab_size, s->slabs - s->partial - s->cpu_slabs,
- onoff(s->red_zone), s->objects * s->object_size);
- printf("SlabSiz: %7d Partial: %7ld Poisoning : %s Loss : %7ld\n",
- page_size << s->order, s->partial, onoff(s->poison),
- s->slabs * (page_size << s->order) - s->objects * s->object_size);
- printf("Loss : %7d CpuSlab: %7d Tracking : %s Lalig: %7ld\n",
- s->slab_size - s->object_size, s->cpu_slabs, onoff(s->store_user),
- (s->slab_size - s->object_size) * s->objects);
- printf("Align : %7d Objects: %7d Tracing : %s Lpadd: %7ld\n",
- s->align, s->objs_per_slab, onoff(s->trace),
- ((page_size << s->order) - s->objs_per_slab * s->slab_size) *
- s->slabs);
-
- ops(s);
- show_tracking(s);
- slab_numa(s, 1);
- slab_stats(s);
-}
-
-static void slabcache(struct slabinfo *s)
-{
- char size_str[20];
- char dist_str[40];
- char flags[20];
- char *p = flags;
-
- if (strcmp(s->name, "*") == 0)
- return;
-
- if (actual_slabs == 1) {
- report(s);
- return;
- }
-
- if (skip_zero && !show_empty && !s->slabs)
- return;
-
- if (show_empty && s->slabs)
- return;
-
- store_size(size_str, slab_size(s));
- snprintf(dist_str, 40, "%lu/%lu/%d", s->slabs - s->cpu_slabs,
- s->partial, s->cpu_slabs);
-
- if (!line++)
- first_line();
-
- if (s->aliases)
- *p++ = '*';
- if (s->cache_dma)
- *p++ = 'd';
- if (s->hwcache_align)
- *p++ = 'A';
- if (s->poison)
- *p++ = 'P';
- if (s->reclaim_account)
- *p++ = 'a';
- if (s->red_zone)
- *p++ = 'Z';
- if (s->sanity_checks)
- *p++ = 'F';
- if (s->store_user)
- *p++ = 'U';
- if (s->trace)
- *p++ = 'T';
-
- *p = 0;
- if (show_activity) {
- unsigned long total_alloc;
- unsigned long total_free;
-
- total_alloc = s->alloc_fastpath + s->alloc_slowpath;
- total_free = s->free_fastpath + s->free_slowpath;
-
- printf("%-21s %8ld %10ld %10ld %3ld %3ld %5ld %1d\n",
- s->name, s->objects,
- total_alloc, total_free,
- total_alloc ? (s->alloc_fastpath * 100 / total_alloc) : 0,
- total_free ? (s->free_fastpath * 100 / total_free) : 0,
- s->order_fallback, s->order);
- }
- else
- printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
- s->name, s->objects, s->object_size, size_str, dist_str,
- s->objs_per_slab, s->order,
- s->slabs ? (s->partial * 100) / s->slabs : 100,
- s->slabs ? (s->objects * s->object_size * 100) /
- (s->slabs * (page_size << s->order)) : 100,
- flags);
-}
-
-/*
- * Analyze debug options. Return false if something is amiss.
- */
-static int debug_opt_scan(char *opt)
-{
- if (!opt || !opt[0] || strcmp(opt, "-") == 0)
- return 1;
-
- if (strcasecmp(opt, "a") == 0) {
- sanity = 1;
- poison = 1;
- redzone = 1;
- tracking = 1;
- return 1;
- }
-
- for ( ; *opt; opt++)
- switch (*opt) {
- case 'F' : case 'f':
- if (sanity)
- return 0;
- sanity = 1;
- break;
- case 'P' : case 'p':
- if (poison)
- return 0;
- poison = 1;
- break;
-
- case 'Z' : case 'z':
- if (redzone)
- return 0;
- redzone = 1;
- break;
-
- case 'U' : case 'u':
- if (tracking)
- return 0;
- tracking = 1;
- break;
-
- case 'T' : case 't':
- if (tracing)
- return 0;
- tracing = 1;
- break;
- default:
- return 0;
- }
- return 1;
-}
-
-static int slab_empty(struct slabinfo *s)
-{
- if (s->objects > 0)
- return 0;
-
- /*
- * We may still have slabs even if there are no objects. Shrinking will
- * remove them.
- */
- if (s->slabs != 0)
- set_obj(s, "shrink", 1);
-
- return 1;
-}
-
-static void slab_debug(struct slabinfo *s)
-{
- if (strcmp(s->name, "*") == 0)
- return;
-
- if (sanity && !s->sanity_checks) {
- set_obj(s, "sanity", 1);
- }
- if (!sanity && s->sanity_checks) {
- if (slab_empty(s))
- set_obj(s, "sanity", 0);
- else
- fprintf(stderr, "%s not empty cannot disable sanity checks\n", s->name);
- }
- if (redzone && !s->red_zone) {
- if (slab_empty(s))
- set_obj(s, "red_zone", 1);
- else
- fprintf(stderr, "%s not empty cannot enable redzoning\n", s->name);
- }
- if (!redzone && s->red_zone) {
- if (slab_empty(s))
- set_obj(s, "red_zone", 0);
- else
- fprintf(stderr, "%s not empty cannot disable redzoning\n", s->name);
- }
- if (poison && !s->poison) {
- if (slab_empty(s))
- set_obj(s, "poison", 1);
- else
- fprintf(stderr, "%s not empty cannot enable poisoning\n", s->name);
- }
- if (!poison && s->poison) {
- if (slab_empty(s))
- set_obj(s, "poison", 0);
- else
- fprintf(stderr, "%s not empty cannot disable poisoning\n", s->name);
- }
- if (tracking && !s->store_user) {
- if (slab_empty(s))
- set_obj(s, "store_user", 1);
- else
- fprintf(stderr, "%s not empty cannot enable tracking\n", s->name);
- }
- if (!tracking && s->store_user) {
- if (slab_empty(s))
- set_obj(s, "store_user", 0);
- else
- fprintf(stderr, "%s not empty cannot disable tracking\n", s->name);
- }
- if (tracing && !s->trace) {
- if (slabs == 1)
- set_obj(s, "trace", 1);
- else
- fprintf(stderr, "%s can only enable trace for one slab at a time\n", s->name);
- }
- if (!tracing && s->trace)
- set_obj(s, "trace", 1);
-}
-
-static void totals(void)
-{
- struct slabinfo *s;
-
- int used_slabs = 0;
- char b1[20], b2[20], b3[20], b4[20];
- unsigned long long max = 1ULL << 63;
-
- /* Object size */
- unsigned long long min_objsize = max, max_objsize = 0, avg_objsize;
-
- /* Number of partial slabs in a slabcache */
- unsigned long long min_partial = max, max_partial = 0,
- avg_partial, total_partial = 0;
-
- /* Number of slabs in a slab cache */
- unsigned long long min_slabs = max, max_slabs = 0,
- avg_slabs, total_slabs = 0;
-
- /* Size of the whole slab */
- unsigned long long min_size = max, max_size = 0,
- avg_size, total_size = 0;
-
- /* Bytes used for object storage in a slab */
- unsigned long long min_used = max, max_used = 0,
- avg_used, total_used = 0;
-
- /* Waste: Bytes used for alignment and padding */
- unsigned long long min_waste = max, max_waste = 0,
- avg_waste, total_waste = 0;
- /* Number of objects in a slab */
- unsigned long long min_objects = max, max_objects = 0,
- avg_objects, total_objects = 0;
- /* Waste per object */
- unsigned long long min_objwaste = max,
- max_objwaste = 0, avg_objwaste,
- total_objwaste = 0;
-
- /* Memory per object */
- unsigned long long min_memobj = max,
- max_memobj = 0, avg_memobj,
- total_objsize = 0;
-
- /* Percentage of partial slabs per slab */
- unsigned long min_ppart = 100, max_ppart = 0,
- avg_ppart, total_ppart = 0;
-
- /* Number of objects in partial slabs */
- unsigned long min_partobj = max, max_partobj = 0,
- avg_partobj, total_partobj = 0;
-
- /* Percentage of partial objects of all objects in a slab */
- unsigned long min_ppartobj = 100, max_ppartobj = 0,
- avg_ppartobj, total_ppartobj = 0;
-
-
- for (s = slabinfo; s < slabinfo + slabs; s++) {
- unsigned long long size;
- unsigned long used;
- unsigned long long wasted;
- unsigned long long objwaste;
- unsigned long percentage_partial_slabs;
- unsigned long percentage_partial_objs;
-
- if (!s->slabs || !s->objects)
- continue;
-
- used_slabs++;
-
- size = slab_size(s);
- used = s->objects * s->object_size;
- wasted = size - used;
- objwaste = s->slab_size - s->object_size;
-
- percentage_partial_slabs = s->partial * 100 / s->slabs;
- if (percentage_partial_slabs > 100)
- percentage_partial_slabs = 100;
-
- percentage_partial_objs = s->objects_partial * 100
- / s->objects;
-
- if (percentage_partial_objs > 100)
- percentage_partial_objs = 100;
-
- if (s->object_size < min_objsize)
- min_objsize = s->object_size;
- if (s->partial < min_partial)
- min_partial = s->partial;
- if (s->slabs < min_slabs)
- min_slabs = s->slabs;
- if (size < min_size)
- min_size = size;
- if (wasted < min_waste)
- min_waste = wasted;
- if (objwaste < min_objwaste)
- min_objwaste = objwaste;
- if (s->objects < min_objects)
- min_objects = s->objects;
- if (used < min_used)
- min_used = used;
- if (s->objects_partial < min_partobj)
- min_partobj = s->objects_partial;
- if (percentage_partial_slabs < min_ppart)
- min_ppart = percentage_partial_slabs;
- if (percentage_partial_objs < min_ppartobj)
- min_ppartobj = percentage_partial_objs;
- if (s->slab_size < min_memobj)
- min_memobj = s->slab_size;
-
- if (s->object_size > max_objsize)
- max_objsize = s->object_size;
- if (s->partial > max_partial)
- max_partial = s->partial;
- if (s->slabs > max_slabs)
- max_slabs = s->slabs;
- if (size > max_size)
- max_size = size;
- if (wasted > max_waste)
- max_waste = wasted;
- if (objwaste > max_objwaste)
- max_objwaste = objwaste;
- if (s->objects > max_objects)
- max_objects = s->objects;
- if (used > max_used)
- max_used = used;
- if (s->objects_partial > max_partobj)
- max_partobj = s->objects_partial;
- if (percentage_partial_slabs > max_ppart)
- max_ppart = percentage_partial_slabs;
- if (percentage_partial_objs > max_ppartobj)
- max_ppartobj = percentage_partial_objs;
- if (s->slab_size > max_memobj)
- max_memobj = s->slab_size;
-
- total_partial += s->partial;
- total_slabs += s->slabs;
- total_size += size;
- total_waste += wasted;
-
- total_objects += s->objects;
- total_used += used;
- total_partobj += s->objects_partial;
- total_ppart += percentage_partial_slabs;
- total_ppartobj += percentage_partial_objs;
-
- total_objwaste += s->objects * objwaste;
- total_objsize += s->objects * s->slab_size;
- }
-
- if (!total_objects) {
- printf("No objects\n");
- return;
- }
- if (!used_slabs) {
- printf("No slabs\n");
- return;
- }
-
- /* Per slab averages */
- avg_partial = total_partial / used_slabs;
- avg_slabs = total_slabs / used_slabs;
- avg_size = total_size / used_slabs;
- avg_waste = total_waste / used_slabs;
-
- avg_objects = total_objects / used_slabs;
- avg_used = total_used / used_slabs;
- avg_partobj = total_partobj / used_slabs;
- avg_ppart = total_ppart / used_slabs;
- avg_ppartobj = total_ppartobj / used_slabs;
-
- /* Per object object sizes */
- avg_objsize = total_used / total_objects;
- avg_objwaste = total_objwaste / total_objects;
- avg_partobj = total_partobj * 100 / total_objects;
- avg_memobj = total_objsize / total_objects;
-
- printf("Slabcache Totals\n");
- printf("----------------\n");
- printf("Slabcaches : %3d Aliases : %3d->%-3d Active: %3d\n",
- slabs, aliases, alias_targets, used_slabs);
-
- store_size(b1, total_size);store_size(b2, total_waste);
- store_size(b3, total_waste * 100 / total_used);
- printf("Memory used: %6s # Loss : %6s MRatio:%6s%%\n", b1, b2, b3);
-
- store_size(b1, total_objects);store_size(b2, total_partobj);
- store_size(b3, total_partobj * 100 / total_objects);
- printf("# Objects : %6s # PartObj: %6s ORatio:%6s%%\n", b1, b2, b3);
-
- printf("\n");
- printf("Per Cache Average Min Max Total\n");
- printf("---------------------------------------------------------\n");
-
- store_size(b1, avg_objects);store_size(b2, min_objects);
- store_size(b3, max_objects);store_size(b4, total_objects);
- printf("#Objects %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_slabs);store_size(b2, min_slabs);
- store_size(b3, max_slabs);store_size(b4, total_slabs);
- printf("#Slabs %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_partial);store_size(b2, min_partial);
- store_size(b3, max_partial);store_size(b4, total_partial);
- printf("#PartSlab %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
- store_size(b1, avg_ppart);store_size(b2, min_ppart);
- store_size(b3, max_ppart);
- store_size(b4, total_partial * 100 / total_slabs);
- printf("%%PartSlab%10s%% %10s%% %10s%% %10s%%\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_partobj);store_size(b2, min_partobj);
- store_size(b3, max_partobj);
- store_size(b4, total_partobj);
- printf("PartObjs %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_ppartobj);store_size(b2, min_ppartobj);
- store_size(b3, max_ppartobj);
- store_size(b4, total_partobj * 100 / total_objects);
- printf("%% PartObj%10s%% %10s%% %10s%% %10s%%\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_size);store_size(b2, min_size);
- store_size(b3, max_size);store_size(b4, total_size);
- printf("Memory %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_used);store_size(b2, min_used);
- store_size(b3, max_used);store_size(b4, total_used);
- printf("Used %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_waste);store_size(b2, min_waste);
- store_size(b3, max_waste);store_size(b4, total_waste);
- printf("Loss %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
-
- printf("\n");
- printf("Per Object Average Min Max\n");
- printf("---------------------------------------------\n");
-
- store_size(b1, avg_memobj);store_size(b2, min_memobj);
- store_size(b3, max_memobj);
- printf("Memory %10s %10s %10s\n",
- b1, b2, b3);
- store_size(b1, avg_objsize);store_size(b2, min_objsize);
- store_size(b3, max_objsize);
- printf("User %10s %10s %10s\n",
- b1, b2, b3);
-
- store_size(b1, avg_objwaste);store_size(b2, min_objwaste);
- store_size(b3, max_objwaste);
- printf("Loss %10s %10s %10s\n",
- b1, b2, b3);
-}
-
-static void sort_slabs(void)
-{
- struct slabinfo *s1,*s2;
-
- for (s1 = slabinfo; s1 < slabinfo + slabs; s1++) {
- for (s2 = s1 + 1; s2 < slabinfo + slabs; s2++) {
- int result;
-
- if (sort_size)
- result = slab_size(s1) < slab_size(s2);
- else if (sort_active)
- result = slab_activity(s1) < slab_activity(s2);
- else
- result = strcasecmp(s1->name, s2->name);
-
- if (show_inverted)
- result = -result;
-
- if (result > 0) {
- struct slabinfo t;
-
- memcpy(&t, s1, sizeof(struct slabinfo));
- memcpy(s1, s2, sizeof(struct slabinfo));
- memcpy(s2, &t, sizeof(struct slabinfo));
- }
- }
- }
-}
-
-static void sort_aliases(void)
-{
- struct aliasinfo *a1,*a2;
-
- for (a1 = aliasinfo; a1 < aliasinfo + aliases; a1++) {
- for (a2 = a1 + 1; a2 < aliasinfo + aliases; a2++) {
- char *n1, *n2;
-
- n1 = a1->name;
- n2 = a2->name;
- if (show_alias && !show_inverted) {
- n1 = a1->ref;
- n2 = a2->ref;
- }
- if (strcasecmp(n1, n2) > 0) {
- struct aliasinfo t;
-
- memcpy(&t, a1, sizeof(struct aliasinfo));
- memcpy(a1, a2, sizeof(struct aliasinfo));
- memcpy(a2, &t, sizeof(struct aliasinfo));
- }
- }
- }
-}
-
-static void link_slabs(void)
-{
- struct aliasinfo *a;
- struct slabinfo *s;
-
- for (a = aliasinfo; a < aliasinfo + aliases; a++) {
-
- for (s = slabinfo; s < slabinfo + slabs; s++)
- if (strcmp(a->ref, s->name) == 0) {
- a->slab = s;
- s->refs++;
- break;
- }
- if (s == slabinfo + slabs)
- fatal("Unresolved alias %s\n", a->ref);
- }
-}
-
-static void alias(void)
-{
- struct aliasinfo *a;
- char *active = NULL;
-
- sort_aliases();
- link_slabs();
-
- for(a = aliasinfo; a < aliasinfo + aliases; a++) {
-
- if (!show_single_ref && a->slab->refs == 1)
- continue;
-
- if (!show_inverted) {
- if (active) {
- if (strcmp(a->slab->name, active) == 0) {
- printf(" %s", a->name);
- continue;
- }
- }
- printf("\n%-12s <- %s", a->slab->name, a->name);
- active = a->slab->name;
- }
- else
- printf("%-20s -> %s\n", a->name, a->slab->name);
- }
- if (active)
- printf("\n");
-}
-
-
-static void rename_slabs(void)
-{
- struct slabinfo *s;
- struct aliasinfo *a;
-
- for (s = slabinfo; s < slabinfo + slabs; s++) {
- if (*s->name != ':')
- continue;
-
- if (s->refs > 1 && !show_first_alias)
- continue;
-
- a = find_one_alias(s);
-
- if (a)
- s->name = a->name;
- else {
- s->name = "*";
- actual_slabs--;
- }
- }
-}
-
-static int slab_mismatch(char *slab)
-{
- return regexec(&pattern, slab, 0, NULL, 0);
-}
-
-static void read_slab_dir(void)
-{
- DIR *dir;
- struct dirent *de;
- struct slabinfo *slab = slabinfo;
- struct aliasinfo *alias = aliasinfo;
- char *p;
- char *t;
- int count;
-
- if (chdir("/sys/kernel/slab") && chdir("/sys/slab"))
- fatal("SYSFS support for SLUB not active\n");
-
- dir = opendir(".");
- while ((de = readdir(dir))) {
- if (de->d_name[0] == '.' ||
- (de->d_name[0] != ':' && slab_mismatch(de->d_name)))
- continue;
- switch (de->d_type) {
- case DT_LNK:
- alias->name = strdup(de->d_name);
- count = readlink(de->d_name, buffer, sizeof(buffer));
-
- if (count < 0)
- fatal("Cannot read symlink %s\n", de->d_name);
-
- buffer[count] = 0;
- p = buffer + count;
- while (p > buffer && p[-1] != '/')
- p--;
- alias->ref = strdup(p);
- alias++;
- break;
- case DT_DIR:
- if (chdir(de->d_name))
- fatal("Unable to access slab %s\n", slab->name);
- slab->name = strdup(de->d_name);
- slab->alias = 0;
- slab->refs = 0;
- slab->aliases = get_obj("aliases");
- slab->align = get_obj("align");
- slab->cache_dma = get_obj("cache_dma");
- slab->cpu_slabs = get_obj("cpu_slabs");
- slab->destroy_by_rcu = get_obj("destroy_by_rcu");
- slab->hwcache_align = get_obj("hwcache_align");
- slab->object_size = get_obj("object_size");
- slab->objects = get_obj("objects");
- slab->objects_partial = get_obj("objects_partial");
- slab->objects_total = get_obj("objects_total");
- slab->objs_per_slab = get_obj("objs_per_slab");
- slab->order = get_obj("order");
- slab->partial = get_obj("partial");
- slab->partial = get_obj_and_str("partial", &t);
- decode_numa_list(slab->numa_partial, t);
- free(t);
- slab->poison = get_obj("poison");
- slab->reclaim_account = get_obj("reclaim_account");
- slab->red_zone = get_obj("red_zone");
- slab->sanity_checks = get_obj("sanity_checks");
- slab->slab_size = get_obj("slab_size");
- slab->slabs = get_obj_and_str("slabs", &t);
- decode_numa_list(slab->numa, t);
- free(t);
- slab->store_user = get_obj("store_user");
- slab->trace = get_obj("trace");
- slab->alloc_fastpath = get_obj("alloc_fastpath");
- slab->alloc_slowpath = get_obj("alloc_slowpath");
- slab->free_fastpath = get_obj("free_fastpath");
- slab->free_slowpath = get_obj("free_slowpath");
- slab->free_frozen= get_obj("free_frozen");
- slab->free_add_partial = get_obj("free_add_partial");
- slab->free_remove_partial = get_obj("free_remove_partial");
- slab->alloc_from_partial = get_obj("alloc_from_partial");
- slab->alloc_slab = get_obj("alloc_slab");
- slab->alloc_refill = get_obj("alloc_refill");
- slab->free_slab = get_obj("free_slab");
- slab->cpuslab_flush = get_obj("cpuslab_flush");
- slab->deactivate_full = get_obj("deactivate_full");
- slab->deactivate_empty = get_obj("deactivate_empty");
- slab->deactivate_to_head = get_obj("deactivate_to_head");
- slab->deactivate_to_tail = get_obj("deactivate_to_tail");
- slab->deactivate_remote_frees = get_obj("deactivate_remote_frees");
- slab->order_fallback = get_obj("order_fallback");
- chdir("..");
- if (slab->name[0] == ':')
- alias_targets++;
- slab++;
- break;
- default :
- fatal("Unknown file type %lx\n", de->d_type);
- }
- }
- closedir(dir);
- slabs = slab - slabinfo;
- actual_slabs = slabs;
- aliases = alias - aliasinfo;
- if (slabs > MAX_SLABS)
- fatal("Too many slabs\n");
- if (aliases > MAX_ALIASES)
- fatal("Too many aliases\n");
-}
-
-static void output_slabs(void)
-{
- struct slabinfo *slab;
-
- for (slab = slabinfo; slab < slabinfo + slabs; slab++) {
-
- if (slab->alias)
- continue;
-
-
- if (show_numa)
- slab_numa(slab, 0);
- else if (show_track)
- show_tracking(slab);
- else if (validate)
- slab_validate(slab);
- else if (shrink)
- slab_shrink(slab);
- else if (set_debug)
- slab_debug(slab);
- else if (show_ops)
- ops(slab);
- else if (show_slab)
- slabcache(slab);
- else if (show_report)
- report(slab);
- }
-}
-
-struct option opts[] = {
- { "aliases", 0, NULL, 'a' },
- { "activity", 0, NULL, 'A' },
- { "debug", 2, NULL, 'd' },
- { "display-activity", 0, NULL, 'D' },
- { "empty", 0, NULL, 'e' },
- { "first-alias", 0, NULL, 'f' },
- { "help", 0, NULL, 'h' },
- { "inverted", 0, NULL, 'i'},
- { "numa", 0, NULL, 'n' },
- { "ops", 0, NULL, 'o' },
- { "report", 0, NULL, 'r' },
- { "shrink", 0, NULL, 's' },
- { "slabs", 0, NULL, 'l' },
- { "track", 0, NULL, 't'},
- { "validate", 0, NULL, 'v' },
- { "zero", 0, NULL, 'z' },
- { "1ref", 0, NULL, '1'},
- { NULL, 0, NULL, 0 }
-};
-
-int main(int argc, char *argv[])
-{
- int c;
- int err;
- char *pattern_source;
-
- page_size = getpagesize();
-
- while ((c = getopt_long(argc, argv, "aAd::Defhil1noprstvzTS",
- opts, NULL)) != -1)
- switch (c) {
- case '1':
- show_single_ref = 1;
- break;
- case 'a':
- show_alias = 1;
- break;
- case 'A':
- sort_active = 1;
- break;
- case 'd':
- set_debug = 1;
- if (!debug_opt_scan(optarg))
- fatal("Invalid debug option '%s'\n", optarg);
- break;
- case 'D':
- show_activity = 1;
- break;
- case 'e':
- show_empty = 1;
- break;
- case 'f':
- show_first_alias = 1;
- break;
- case 'h':
- usage();
- return 0;
- case 'i':
- show_inverted = 1;
- break;
- case 'n':
- show_numa = 1;
- break;
- case 'o':
- show_ops = 1;
- break;
- case 'r':
- show_report = 1;
- break;
- case 's':
- shrink = 1;
- break;
- case 'l':
- show_slab = 1;
- break;
- case 't':
- show_track = 1;
- break;
- case 'v':
- validate = 1;
- break;
- case 'z':
- skip_zero = 0;
- break;
- case 'T':
- show_totals = 1;
- break;
- case 'S':
- sort_size = 1;
- break;
-
- default:
- fatal("%s: Invalid option '%c'\n", argv[0], optopt);
-
- }
-
- if (!show_slab && !show_alias && !show_track && !show_report
- && !validate && !shrink && !set_debug && !show_ops)
- show_slab = 1;
-
- if (argc > optind)
- pattern_source = argv[optind];
- else
- pattern_source = ".*";
-
- err = regcomp(&pattern, pattern_source, REG_ICASE|REG_NOSUB);
- if (err)
- fatal("%s: Invalid pattern '%s' code %d\n",
- argv[0], pattern_source, err);
- read_slab_dir();
- if (show_alias)
- alias();
- else
- if (show_totals)
- totals();
- else {
- link_slabs();
- rename_slabs();
- sort_slabs();
- output_slabs();
- }
- return 0;
-}
diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt
index b37300edf27..b0c6d1bbb43 100644
--- a/Documentation/vm/slub.txt
+++ b/Documentation/vm/slub.txt
@@ -17,7 +17,7 @@ data and perform operation on the slabs. By default slabinfo only lists
slabs that have data in them. See "slabinfo -h" for more options when
running the command. slabinfo can be compiled with
-gcc -o slabinfo Documentation/vm/slabinfo.c
+gcc -o slabinfo tools/vm/slabinfo.c
Some of the modes of operation of slabinfo require that slub debugging
be enabled on the command line. F.e. no tracking information will be
@@ -41,6 +41,7 @@ Possible debug options are
P Poisoning (object and padding)
U User tracking (free and alloc)
T Trace (please only use on single slabs)
+ A Toggle failslab filter mark for the cache
O Switch debugging off for caches that would have
caused higher minimum slab orders
- Switch all debugging off (useful if the kernel is
@@ -116,7 +117,7 @@ can be influenced by kernel parameters:
slub_min_objects=x (default 4)
slub_min_order=x (default 0)
-slub_max_order=x (default 1)
+slub_max_order=x (default 3 (PAGE_ALLOC_COSTLY_ORDER))
slub_min_objects allows to specify how many objects must at least fit
into one slab in order for the allocation order to be acceptable.
@@ -130,7 +131,10 @@ slub_min_objects.
slub_max_order specified the order at which slub_min_objects should no
longer be checked. This is useful to avoid SLUB trying to generate
super large order pages to fit slub_min_objects of a slab cache with
-large object sizes into one high order page.
+large object sizes into one high order page. Setting command line
+parameter debug_guardpage_minorder=N (N > 0), forces setting
+slub_max_order to 0, what cause minimum possible order of slabs
+allocation.
SLUB Debug output
-----------------
diff --git a/Documentation/vm/soft-dirty.txt b/Documentation/vm/soft-dirty.txt
new file mode 100644
index 00000000000..55684d11a1e
--- /dev/null
+++ b/Documentation/vm/soft-dirty.txt
@@ -0,0 +1,43 @@
+ SOFT-DIRTY PTEs
+
+ The soft-dirty is a bit on a PTE which helps to track which pages a task
+writes to. In order to do this tracking one should
+
+ 1. Clear soft-dirty bits from the task's PTEs.
+
+ This is done by writing "4" into the /proc/PID/clear_refs file of the
+ task in question.
+
+ 2. Wait some time.
+
+ 3. Read soft-dirty bits from the PTEs.
+
+ This is done by reading from the /proc/PID/pagemap. The bit 55 of the
+ 64-bit qword is the soft-dirty one. If set, the respective PTE was
+ written to since step 1.
+
+
+ Internally, to do this tracking, the writable bit is cleared from PTEs
+when the soft-dirty bit is cleared. So, after this, when the task tries to
+modify a page at some virtual address the #PF occurs and the kernel sets
+the soft-dirty bit on the respective PTE.
+
+ Note, that although all the task's address space is marked as r/o after the
+soft-dirty bits clear, the #PF-s that occur after that are processed fast.
+This is so, since the pages are still mapped to physical memory, and thus all
+the kernel does is finds this fact out and puts both writable and soft-dirty
+bits on the PTE.
+
+ While in most cases tracking memory changes by #PF-s is more than enough
+there is still a scenario when we can lose soft dirty bits -- a task
+unmaps a previously mapped memory region and then maps a new one at exactly
+the same place. When unmap is called, the kernel internally clears PTE values
+including soft dirty bits. To notify user space application about such
+memory region renewal the kernel always marks new memory regions (and
+expanded regions) as soft dirty.
+
+ This feature is actively used by the checkpoint-restore project. You
+can find more details about it on http://criu.org
+
+
+-- Pavel Emelyanov, Apr 9, 2013
diff --git a/Documentation/vm/split_page_table_lock b/Documentation/vm/split_page_table_lock
new file mode 100644
index 00000000000..6dea4fd5c96
--- /dev/null
+++ b/Documentation/vm/split_page_table_lock
@@ -0,0 +1,94 @@
+Split page table lock
+=====================
+
+Originally, mm->page_table_lock spinlock protected all page tables of the
+mm_struct. But this approach leads to poor page fault scalability of
+multi-threaded applications due high contention on the lock. To improve
+scalability, split page table lock was introduced.
+
+With split page table lock we have separate per-table lock to serialize
+access to the table. At the moment we use split lock for PTE and PMD
+tables. Access to higher level tables protected by mm->page_table_lock.
+
+There are helpers to lock/unlock a table and other accessor functions:
+ - pte_offset_map_lock()
+ maps pte and takes PTE table lock, returns pointer to the taken
+ lock;
+ - pte_unmap_unlock()
+ unlocks and unmaps PTE table;
+ - pte_alloc_map_lock()
+ allocates PTE table if needed and take the lock, returns pointer
+ to taken lock or NULL if allocation failed;
+ - pte_lockptr()
+ returns pointer to PTE table lock;
+ - pmd_lock()
+ takes PMD table lock, returns pointer to taken lock;
+ - pmd_lockptr()
+ returns pointer to PMD table lock;
+
+Split page table lock for PTE tables is enabled compile-time if
+CONFIG_SPLIT_PTLOCK_CPUS (usually 4) is less or equal to NR_CPUS.
+If split lock is disabled, all tables guaded by mm->page_table_lock.
+
+Split page table lock for PMD tables is enabled, if it's enabled for PTE
+tables and the architecture supports it (see below).
+
+Hugetlb and split page table lock
+---------------------------------
+
+Hugetlb can support several page sizes. We use split lock only for PMD
+level, but not for PUD.
+
+Hugetlb-specific helpers:
+ - huge_pte_lock()
+ takes pmd split lock for PMD_SIZE page, mm->page_table_lock
+ otherwise;
+ - huge_pte_lockptr()
+ returns pointer to table lock;
+
+Support of split page table lock by an architecture
+---------------------------------------------------
+
+There's no need in special enabling of PTE split page table lock:
+everything required is done by pgtable_page_ctor() and pgtable_page_dtor(),
+which must be called on PTE table allocation / freeing.
+
+Make sure the architecture doesn't use slab allocator for page table
+allocation: slab uses page->slab_cache and page->first_page for its pages.
+These fields share storage with page->ptl.
+
+PMD split lock only makes sense if you have more than two page table
+levels.
+
+PMD split lock enabling requires pgtable_pmd_page_ctor() call on PMD table
+allocation and pgtable_pmd_page_dtor() on freeing.
+
+Allocation usually happens in pmd_alloc_one(), freeing in pmd_free() and
+pmd_free_tlb(), but make sure you cover all PMD table allocation / freeing
+paths: i.e X86_PAE preallocate few PMDs on pgd_alloc().
+
+With everything in place you can set CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK.
+
+NOTE: pgtable_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must
+be handled properly.
+
+page->ptl
+---------
+
+page->ptl is used to access split page table lock, where 'page' is struct
+page of page containing the table. It shares storage with page->private
+(and few other fields in union).
+
+To avoid increasing size of struct page and have best performance, we use a
+trick:
+ - if spinlock_t fits into long, we use page->ptr as spinlock, so we
+ can avoid indirect access and save a cache line.
+ - if size of spinlock_t is bigger then size of long, we use page->ptl as
+ pointer to spinlock_t and allocate it dynamically. This allows to use
+ split lock with enabled DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC, but costs
+ one more cache line for indirect access;
+
+The spinlock_t allocated in pgtable_page_ctor() for PTE table and in
+pgtable_pmd_page_ctor() for PMD table.
+
+Please, never access page->ptl directly -- use appropriate helper.
diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
new file mode 100644
index 00000000000..6b31cfbe2a9
--- /dev/null
+++ b/Documentation/vm/transhuge.txt
@@ -0,0 +1,376 @@
+= Transparent Hugepage Support =
+
+== Objective ==
+
+Performance critical computing applications dealing with large memory
+working sets are already running on top of libhugetlbfs and in turn
+hugetlbfs. Transparent Hugepage Support is an alternative means of
+using huge pages for the backing of virtual memory with huge pages
+that supports the automatic promotion and demotion of page sizes and
+without the shortcomings of hugetlbfs.
+
+Currently it only works for anonymous memory mappings but in the
+future it can expand over the pagecache layer starting with tmpfs.
+
+The reason applications are running faster is because of two
+factors. The first factor is almost completely irrelevant and it's not
+of significant interest because it'll also have the downside of
+requiring larger clear-page copy-page in page faults which is a
+potentially negative effect. The first factor consists in taking a
+single page fault for each 2M virtual region touched by userland (so
+reducing the enter/exit kernel frequency by a 512 times factor). This
+only matters the first time the memory is accessed for the lifetime of
+a memory mapping. The second long lasting and much more important
+factor will affect all subsequent accesses to the memory for the whole
+runtime of the application. The second factor consist of two
+components: 1) the TLB miss will run faster (especially with
+virtualization using nested pagetables but almost always also on bare
+metal without virtualization) and 2) a single TLB entry will be
+mapping a much larger amount of virtual memory in turn reducing the
+number of TLB misses. With virtualization and nested pagetables the
+TLB can be mapped of larger size only if both KVM and the Linux guest
+are using hugepages but a significant speedup already happens if only
+one of the two is using hugepages just because of the fact the TLB
+miss is going to run faster.
+
+== Design ==
+
+- "graceful fallback": mm components which don't have transparent
+ hugepage knowledge fall back to breaking a transparent hugepage and
+ working on the regular pages and their respective regular pmd/pte
+ mappings
+
+- if a hugepage allocation fails because of memory fragmentation,
+ regular pages should be gracefully allocated instead and mixed in
+ the same vma without any failure or significant delay and without
+ userland noticing
+
+- if some task quits and more hugepages become available (either
+ immediately in the buddy or through the VM), guest physical memory
+ backed by regular pages should be relocated on hugepages
+ automatically (with khugepaged)
+
+- it doesn't require memory reservation and in turn it uses hugepages
+ whenever possible (the only possible reservation here is kernelcore=
+ to avoid unmovable pages to fragment all the memory but such a tweak
+ is not specific to transparent hugepage support and it's a generic
+ feature that applies to all dynamic high order allocations in the
+ kernel)
+
+- this initial support only offers the feature in the anonymous memory
+ regions but it'd be ideal to move it to tmpfs and the pagecache
+ later
+
+Transparent Hugepage Support maximizes the usefulness of free memory
+if compared to the reservation approach of hugetlbfs by allowing all
+unused memory to be used as cache or other movable (or even unmovable
+entities). It doesn't require reservation to prevent hugepage
+allocation failures to be noticeable from userland. It allows paging
+and all other advanced VM features to be available on the
+hugepages. It requires no modifications for applications to take
+advantage of it.
+
+Applications however can be further optimized to take advantage of
+this feature, like for example they've been optimized before to avoid
+a flood of mmap system calls for every malloc(4k). Optimizing userland
+is by far not mandatory and khugepaged already can take care of long
+lived page allocations even for hugepage unaware applications that
+deals with large amounts of memory.
+
+In certain cases when hugepages are enabled system wide, application
+may end up allocating more memory resources. An application may mmap a
+large region but only touch 1 byte of it, in that case a 2M page might
+be allocated instead of a 4k page for no good. This is why it's
+possible to disable hugepages system-wide and to only have them inside
+MADV_HUGEPAGE madvise regions.
+
+Embedded systems should enable hugepages only inside madvise regions
+to eliminate any risk of wasting any precious byte of memory and to
+only run faster.
+
+Applications that gets a lot of benefit from hugepages and that don't
+risk to lose memory by using hugepages, should use
+madvise(MADV_HUGEPAGE) on their critical mmapped regions.
+
+== sysfs ==
+
+Transparent Hugepage Support can be entirely disabled (mostly for
+debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
+avoid the risk of consuming more memory resources) or enabled system
+wide. This can be achieved with one of:
+
+echo always >/sys/kernel/mm/transparent_hugepage/enabled
+echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
+echo never >/sys/kernel/mm/transparent_hugepage/enabled
+
+It's also possible to limit defrag efforts in the VM to generate
+hugepages in case they're not immediately free to madvise regions or
+to never try to defrag memory and simply fallback to regular pages
+unless hugepages are immediately available. Clearly if we spend CPU
+time to defrag memory, we would expect to gain even more by the fact
+we use hugepages later instead of regular pages. This isn't always
+guaranteed, but it may be more likely in case the allocation is for a
+MADV_HUGEPAGE region.
+
+echo always >/sys/kernel/mm/transparent_hugepage/defrag
+echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
+echo never >/sys/kernel/mm/transparent_hugepage/defrag
+
+By default kernel tries to use huge zero page on read page fault.
+It's possible to disable huge zero page by writing 0 or enable it
+back by writing 1:
+
+echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
+echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
+
+khugepaged will be automatically started when
+transparent_hugepage/enabled is set to "always" or "madvise, and it'll
+be automatically shutdown if it's set to "never".
+
+khugepaged runs usually at low frequency so while one may not want to
+invoke defrag algorithms synchronously during the page faults, it
+should be worth invoking defrag at least in khugepaged. However it's
+also possible to disable defrag in khugepaged by writing 0 or enable
+defrag in khugepaged by writing 1:
+
+echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
+echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
+
+You can also control how many pages khugepaged should scan at each
+pass:
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
+
+and how many milliseconds to wait in khugepaged between each pass (you
+can set this to 0 to run khugepaged at 100% utilization of one core):
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
+
+and how many milliseconds to wait in khugepaged if there's an hugepage
+allocation failure to throttle the next allocation attempt.
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
+
+The khugepaged progress can be seen in the number of pages collapsed:
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
+
+for each pass:
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
+
+== Boot parameter ==
+
+You can change the sysfs boot time defaults of Transparent Hugepage
+Support by passing the parameter "transparent_hugepage=always" or
+"transparent_hugepage=madvise" or "transparent_hugepage=never"
+(without "") to the kernel command line.
+
+== Need of application restart ==
+
+The transparent_hugepage/enabled values only affect future
+behavior. So to make them effective you need to restart any
+application that could have been using hugepages. This also applies to
+the regions registered in khugepaged.
+
+== Monitoring usage ==
+
+The number of transparent huge pages currently used by the system is
+available by reading the AnonHugePages field in /proc/meminfo. To
+identify what applications are using transparent huge pages, it is
+necessary to read /proc/PID/smaps and count the AnonHugePages fields
+for each mapping. Note that reading the smaps file is expensive and
+reading it frequently will incur overhead.
+
+There are a number of counters in /proc/vmstat that may be used to
+monitor how successfully the system is providing huge pages for use.
+
+thp_fault_alloc is incremented every time a huge page is successfully
+ allocated to handle a page fault. This applies to both the
+ first time a page is faulted and for COW faults.
+
+thp_collapse_alloc is incremented by khugepaged when it has found
+ a range of pages to collapse into one huge page and has
+ successfully allocated a new huge page to store the data.
+
+thp_fault_fallback is incremented if a page fault fails to allocate
+ a huge page and instead falls back to using small pages.
+
+thp_collapse_alloc_failed is incremented if khugepaged found a range
+ of pages that should be collapsed into one huge page but failed
+ the allocation.
+
+thp_split is incremented every time a huge page is split into base
+ pages. This can happen for a variety of reasons but a common
+ reason is that a huge page is old and is being reclaimed.
+
+thp_zero_page_alloc is incremented every time a huge zero page is
+ successfully allocated. It includes allocations which where
+ dropped due race with other allocation. Note, it doesn't count
+ every map of the huge zero page, only its allocation.
+
+thp_zero_page_alloc_failed is incremented if kernel fails to allocate
+ huge zero page and falls back to using small pages.
+
+As the system ages, allocating huge pages may be expensive as the
+system uses memory compaction to copy data around memory to free a
+huge page for use. There are some counters in /proc/vmstat to help
+monitor this overhead.
+
+compact_stall is incremented every time a process stalls to run
+ memory compaction so that a huge page is free for use.
+
+compact_success is incremented if the system compacted memory and
+ freed a huge page for use.
+
+compact_fail is incremented if the system tries to compact memory
+ but failed.
+
+compact_pages_moved is incremented each time a page is moved. If
+ this value is increasing rapidly, it implies that the system
+ is copying a lot of data to satisfy the huge page allocation.
+ It is possible that the cost of copying exceeds any savings
+ from reduced TLB misses.
+
+compact_pagemigrate_failed is incremented when the underlying mechanism
+ for moving a page failed.
+
+compact_blocks_moved is incremented each time memory compaction examines
+ a huge page aligned range of pages.
+
+It is possible to establish how long the stalls were using the function
+tracer to record how long was spent in __alloc_pages_nodemask and
+using the mm_page_alloc tracepoint to identify which allocations were
+for huge pages.
+
+== get_user_pages and follow_page ==
+
+get_user_pages and follow_page if run on a hugepage, will return the
+head or tail pages as usual (exactly as they would do on
+hugetlbfs). Most gup users will only care about the actual physical
+address of the page and its temporary pinning to release after the I/O
+is complete, so they won't ever notice the fact the page is huge. But
+if any driver is going to mangle over the page structure of the tail
+page (like for checking page->mapping or other bits that are relevant
+for the head page and not the tail page), it should be updated to jump
+to check head page instead (while serializing properly against
+split_huge_page() to avoid the head and tail pages to disappear from
+under it, see the futex code to see an example of that, hugetlbfs also
+needed special handling in futex code for similar reasons).
+
+NOTE: these aren't new constraints to the GUP API, and they match the
+same constrains that applies to hugetlbfs too, so any driver capable
+of handling GUP on hugetlbfs will also work fine on transparent
+hugepage backed mappings.
+
+In case you can't handle compound pages if they're returned by
+follow_page, the FOLL_SPLIT bit can be specified as parameter to
+follow_page, so that it will split the hugepages before returning
+them. Migration for example passes FOLL_SPLIT as parameter to
+follow_page because it's not hugepage aware and in fact it can't work
+at all on hugetlbfs (but it instead works fine on transparent
+hugepages thanks to FOLL_SPLIT). migration simply can't deal with
+hugepages being returned (as it's not only checking the pfn of the
+page and pinning it during the copy but it pretends to migrate the
+memory in regular page sizes and with regular pte/pmd mappings).
+
+== Optimizing the applications ==
+
+To be guaranteed that the kernel will map a 2M page immediately in any
+memory region, the mmap region has to be hugepage naturally
+aligned. posix_memalign() can provide that guarantee.
+
+== Hugetlbfs ==
+
+You can use hugetlbfs on a kernel that has transparent hugepage
+support enabled just fine as always. No difference can be noted in
+hugetlbfs other than there will be less overall fragmentation. All
+usual features belonging to hugetlbfs are preserved and
+unaffected. libhugetlbfs will also work fine as usual.
+
+== Graceful fallback ==
+
+Code walking pagetables but unware about huge pmds can simply call
+split_huge_page_pmd(vma, addr, pmd) where the pmd is the one returned by
+pmd_offset. It's trivial to make the code transparent hugepage aware
+by just grepping for "pmd_offset" and adding split_huge_page_pmd where
+missing after pmd_offset returns the pmd. Thanks to the graceful
+fallback design, with a one liner change, you can avoid to write
+hundred if not thousand of lines of complex code to make your code
+hugepage aware.
+
+If you're not walking pagetables but you run into a physical hugepage
+but you can't handle it natively in your code, you can split it by
+calling split_huge_page(page). This is what the Linux VM does before
+it tries to swapout the hugepage for example.
+
+Example to make mremap.c transparent hugepage aware with a one liner
+change:
+
+diff --git a/mm/mremap.c b/mm/mremap.c
+--- a/mm/mremap.c
++++ b/mm/mremap.c
+@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
+ return NULL;
+
+ pmd = pmd_offset(pud, addr);
++ split_huge_page_pmd(vma, addr, pmd);
+ if (pmd_none_or_clear_bad(pmd))
+ return NULL;
+
+== Locking in hugepage aware code ==
+
+We want as much code as possible hugepage aware, as calling
+split_huge_page() or split_huge_page_pmd() has a cost.
+
+To make pagetable walks huge pmd aware, all you need to do is to call
+pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
+mmap_sem in read (or write) mode to be sure an huge pmd cannot be
+created from under you by khugepaged (khugepaged collapse_huge_page
+takes the mmap_sem in write mode in addition to the anon_vma lock). If
+pmd_trans_huge returns false, you just fallback in the old code
+paths. If instead pmd_trans_huge returns true, you have to take the
+mm->page_table_lock and re-run pmd_trans_huge. Taking the
+page_table_lock will prevent the huge pmd to be converted into a
+regular pmd from under you (split_huge_page can run in parallel to the
+pagetable walk). If the second pmd_trans_huge returns false, you
+should just drop the page_table_lock and fallback to the old code as
+before. Otherwise you should run pmd_trans_splitting on the pmd. In
+case pmd_trans_splitting returns true, it means split_huge_page is
+already in the middle of splitting the page. So if pmd_trans_splitting
+returns true it's enough to drop the page_table_lock and call
+wait_split_huge_page and then fallback the old code paths. You are
+guaranteed by the time wait_split_huge_page returns, the pmd isn't
+huge anymore. If pmd_trans_splitting returns false, you can proceed to
+process the huge pmd and the hugepage natively. Once finished you can
+drop the page_table_lock.
+
+== compound_lock, get_user_pages and put_page ==
+
+split_huge_page internally has to distribute the refcounts in the head
+page to the tail pages before clearing all PG_head/tail bits from the
+page structures. It can do that easily for refcounts taken by huge pmd
+mappings. But the GUI API as created by hugetlbfs (that returns head
+and tail pages if running get_user_pages on an address backed by any
+hugepage), requires the refcount to be accounted on the tail pages and
+not only in the head pages, if we want to be able to run
+split_huge_page while there are gup pins established on any tail
+page. Failure to be able to run split_huge_page if there's any gup pin
+on any tail page, would mean having to split all hugepages upfront in
+get_user_pages which is unacceptable as too many gup users are
+performance critical and they must work natively on hugepages like
+they work natively on hugetlbfs already (hugetlbfs is simpler because
+hugetlbfs pages cannot be split so there wouldn't be requirement of
+accounting the pins on the tail pages for hugetlbfs). If we wouldn't
+account the gup refcounts on the tail pages during gup, we won't know
+anymore which tail page is pinned by gup and which is not while we run
+split_huge_page. But we still have to add the gup pin to the head page
+too, to know when we can free the compound page in case it's never
+split during its lifetime. That requires changing not just
+get_page, but put_page as well so that when put_page runs on a tail
+page (and only on a tail page) it will find its respective head page,
+and then it will decrease the head page refcount in addition to the
+tail page refcount. To obtain a head page reliably and to decrease its
+refcount without race conditions, put_page has to serialize against
+__split_huge_page_refcount using a special per-page lock called
+compound_lock.
diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt
index 2d70d0d9510..744f82f86c5 100644
--- a/Documentation/vm/unevictable-lru.txt
+++ b/Documentation/vm/unevictable-lru.txt
@@ -84,8 +84,7 @@ indicate that the page is being managed on the unevictable list.
The PG_unevictable flag is analogous to, and mutually exclusive with, the
PG_active flag in that it indicates on which LRU list a page resides when
-PG_lru is set. The unevictable list is compile-time configurable based on the
-UNEVICTABLE_LRU Kconfig option.
+PG_lru is set.
The Unevictable LRU infrastructure maintains unevictable pages on an additional
LRU list for a few reasons:
@@ -198,12 +197,8 @@ the pages are also "rescued" from the unevictable list in the process of
freeing them.
page_evictable() also checks for mlocked pages by testing an additional page
-flag, PG_mlocked (as wrapped by PageMlocked()). If the page is NOT mlocked,
-and a non-NULL VMA is supplied, page_evictable() will check whether the VMA is
-VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and
-update the appropriate statistics if the vma is VM_LOCKED. This method allows
-efficient "culling" of pages in the fault path that are being faulted in to
-VM_LOCKED VMAs.
+flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is
+faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED.
VMSCAN'S HANDLING OF UNEVICTABLE PAGES
@@ -372,8 +367,8 @@ mlock_fixup() filters several classes of "special" VMAs:
mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to
allocate the huge pages and populate the ptes.
-3) VMAs with VM_DONTEXPAND or VM_RESERVED are generally userspace mappings of
- kernel pages, such as the VDSO page, relay channel pages, etc. These pages
+3) VMAs with VM_DONTEXPAND are generally userspace mappings of kernel pages,
+ such as the VDSO page, relay channel pages, etc. These pages
are inherently unevictable and are not managed on the LRU lists.
mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls
make_pages_present() to populate the ptes.
@@ -458,7 +453,7 @@ putback_lru_page() function to add migrated pages back to the LRU.
mmap(MAP_LOCKED) SYSTEM CALL HANDLING
-------------------------------------
-In addition the the mlock()/mlockall() system calls, an application can request
+In addition the mlock()/mlockall() system calls, an application can request
that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap()
call. Furthermore, any mmap() call or brk() call that expands the heap by a
task that has previously called mlockall() with the MCL_FUTURE flag will result
@@ -539,7 +534,7 @@ different reverse map mechanisms.
process because mlocked pages are migratable. However, for reclaim, if
the page is mapped into a VM_LOCKED VMA, the scan stops.
- try_to_unmap_anon() attempts to acquire in read mode the mmap semphore of
+ try_to_unmap_anon() attempts to acquire in read mode the mmap semaphore of
the mm_struct to which the VMA belongs. If this is successful, it will
mlock the page via mlock_vma_page() - we wouldn't have gotten to
try_to_unmap_anon() if the page were already mlocked - and will return
@@ -620,11 +615,11 @@ all PTEs from the page. For this purpose, the unevictable/mlock infrastructure
introduced a variant of try_to_unmap() called try_to_munlock().
try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
-mapped file pages with an additional argument specifing unlock versus unmap
+mapped file pages with an additional argument specifying unlock versus unmap
processing. Again, these functions walk the respective reverse maps looking
for VM_LOCKED VMAs. When such a VMA is found for anonymous pages and file
pages mapped in linear VMAs, as in the try_to_unmap() case, the functions
-attempt to acquire the associated mmap semphore, mlock the page via
+attempt to acquire the associated mmap semaphore, mlock the page via
mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the
pre-clearing of the page's PG_mlocked done by munlock_vma_page.
@@ -642,7 +637,7 @@ with it - the usual fallback position.
Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
However, the scan can terminate when it encounters a VM_LOCKED VMA and can
-successfully acquire the VMA's mmap semphore for read and mlock the page.
+successfully acquire the VMA's mmap semaphore for read and mlock the page.
Although try_to_munlock() might be called a great many times when munlocking a
large region or tearing down a large address space that has been mlocked via
mlockall(), overall this is a fairly rare event.
@@ -652,7 +647,7 @@ PAGE RECLAIM IN shrink_*_list()
-------------------------------
shrink_active_list() culls any obviously unevictable pages - i.e.
-!page_evictable(page, NULL) - diverting these to the unevictable list.
+!page_evictable(page) - diverting these to the unevictable list.
However, shrink_active_list() only sees unevictable pages that made it onto the
active/inactive lru lists. Note that these pages do not have PageUnevictable
set - otherwise they would be on the unevictable list and shrink_active_list
diff --git a/Documentation/vm/zswap.txt b/Documentation/vm/zswap.txt
new file mode 100644
index 00000000000..00c3d31e797
--- /dev/null
+++ b/Documentation/vm/zswap.txt
@@ -0,0 +1,68 @@
+Overview:
+
+Zswap is a lightweight compressed cache for swap pages. It takes pages that are
+in the process of being swapped out and attempts to compress them into a
+dynamically allocated RAM-based memory pool. zswap basically trades CPU cycles
+for potentially reduced swap I/O.  This trade-off can also result in a
+significant performance improvement if reads from the compressed cache are
+faster than reads from a swap device.
+
+NOTE: Zswap is a new feature as of v3.11 and interacts heavily with memory
+reclaim. This interaction has not been fully explored on the large set of
+potential configurations and workloads that exist. For this reason, zswap
+is a work in progress and should be considered experimental.
+
+Some potential benefits:
+* Desktop/laptop users with limited RAM capacities can mitigate the
+    performance impact of swapping.
+* Overcommitted guests that share a common I/O resource can
+    dramatically reduce their swap I/O pressure, avoiding heavy handed I/O
+ throttling by the hypervisor. This allows more work to get done with less
+ impact to the guest workload and guests sharing the I/O subsystem
+* Users with SSDs as swap devices can extend the life of the device by
+    drastically reducing life-shortening writes.
+
+Zswap evicts pages from compressed cache on an LRU basis to the backing swap
+device when the compressed pool reaches its size limit. This requirement had
+been identified in prior community discussions.
+
+To enabled zswap, the "enabled" attribute must be set to 1 at boot time. e.g.
+zswap.enabled=1
+
+Design:
+
+Zswap receives pages for compression through the Frontswap API and is able to
+evict pages from its own compressed pool on an LRU basis and write them back to
+the backing swap device in the case that the compressed pool is full.
+
+Zswap makes use of zbud for the managing the compressed memory pool. Each
+allocation in zbud is not directly accessible by address. Rather, a handle is
+returned by the allocation routine and that handle must be mapped before being
+accessed. The compressed memory pool grows on demand and shrinks as compressed
+pages are freed. The pool is not preallocated.
+
+When a swap page is passed from frontswap to zswap, zswap maintains a mapping
+of the swap entry, a combination of the swap type and swap offset, to the zbud
+handle that references that compressed swap page. This mapping is achieved
+with a red-black tree per swap type. The swap offset is the search key for the
+tree nodes.
+
+During a page fault on a PTE that is a swap entry, frontswap calls the zswap
+load function to decompress the page into the page allocated by the page fault
+handler.
+
+Once there are no PTEs referencing a swap page stored in zswap (i.e. the count
+in the swap_map goes to 0) the swap code calls the zswap invalidate function,
+via frontswap, to free the compressed entry.
+
+Zswap seeks to be simple in its policies. Sysfs attributes allow for one user
+controlled policy:
+* max_pool_percent - The maximum percentage of memory that the compressed
+ pool can occupy.
+
+Zswap allows the compressor to be selected at kernel boot time by setting the
+“compressor” attribute. The default compressor is lzo. e.g.
+zswap.compressor=deflate
+
+A debugfs interface is provided for various statistic about pool size, number
+of pages stored, and various counters for the reasons pages are rejected.