aboutsummaryrefslogtreecommitdiff
path: root/Documentation/vm
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/vm')
-rw-r--r--Documentation/vm/00-INDEX24
-rw-r--r--Documentation/vm/Makefile8
-rw-r--r--Documentation/vm/cleancache.txt43
-rw-r--r--Documentation/vm/frontswap.txt278
-rw-r--r--Documentation/vm/hugepage-mmap.c91
-rw-r--r--Documentation/vm/hugepage-shm.c98
-rw-r--r--Documentation/vm/hugetlbpage.txt21
-rw-r--r--Documentation/vm/hwpoison.txt5
-rw-r--r--Documentation/vm/ksm.txt15
-rw-r--r--Documentation/vm/locking130
-rw-r--r--Documentation/vm/map_hugetlb.c77
-rw-r--r--Documentation/vm/numa_memory_policy.txt5
-rw-r--r--Documentation/vm/overcommit-accounting15
-rw-r--r--Documentation/vm/page-types.c1100
-rw-r--r--Documentation/vm/pagemap.txt11
-rw-r--r--Documentation/vm/remap_file_pages.txt28
-rw-r--r--Documentation/vm/slub.txt2
-rw-r--r--Documentation/vm/soft-dirty.txt43
-rw-r--r--Documentation/vm/split_page_table_lock94
-rw-r--r--Documentation/vm/transhuge.txt85
-rw-r--r--Documentation/vm/unevictable-lru.txt24
-rw-r--r--Documentation/vm/zswap.txt68
22 files changed, 693 insertions, 1572 deletions
diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
index 5481c8ba341..081c49777ab 100644
--- a/Documentation/vm/00-INDEX
+++ b/Documentation/vm/00-INDEX
@@ -4,33 +4,37 @@ active_mm.txt
- An explanation from Linus about tsk->active_mm vs tsk->mm.
balance
- various information on memory balancing.
-hugepage-mmap.c
- - Example app using huge page memory with the mmap system call.
-hugepage-shm.c
- - Example app using huge page memory with Sys V shared memory system calls.
+cleancache.txt
+ - Intro to cleancache and page-granularity victim cache.
+frontswap.txt
+ - Outline frontswap, part of the transcendent memory frontend.
+highmem.txt
+ - Outline of highmem and common issues.
hugetlbpage.txt
- a brief summary of hugetlbpage support in the Linux kernel.
hwpoison.txt
- explains what hwpoison is
ksm.txt
- how to use the Kernel Samepage Merging feature.
-locking
- - info on how locking and synchronization is done in the Linux vm code.
-map_hugetlb.c
- - an example program that uses the MAP_HUGETLB mmap flag.
numa
- information about NUMA specific code in the Linux vm.
numa_memory_policy.txt
- documentation of concepts and APIs of the 2.6 memory policy support.
overcommit-accounting
- description of the Linux kernels overcommit handling modes.
-page-types.c
- - Tool for querying page flags
page_migration
- description of page migration in NUMA systems.
pagemap.txt
- pagemap, from the userspace perspective
slub.txt
- a short users guide for SLUB.
+soft-dirty.txt
+ - short explanation for soft-dirty PTEs
+split_page_table_lock
+ - Separate per-table lock to improve scalability of the old page_table_lock.
+transhuge.txt
+ - Transparent Hugepage Support, alternative way of using hugepages.
unevictable-lru.txt
- Unevictable LRU infrastructure
+zswap.txt
+ - Intro to compressed cache for swap pages
diff --git a/Documentation/vm/Makefile b/Documentation/vm/Makefile
deleted file mode 100644
index 3fa4d066886..00000000000
--- a/Documentation/vm/Makefile
+++ /dev/null
@@ -1,8 +0,0 @@
-# kbuild trick to avoid linker error. Can be omitted if a module is built.
-obj- := dummy.o
-
-# List of programs to build
-hostprogs-y := page-types hugepage-mmap hugepage-shm map_hugetlb
-
-# Tell kbuild to always build the programs
-always := $(hostprogs-y)
diff --git a/Documentation/vm/cleancache.txt b/Documentation/vm/cleancache.txt
index 36c367c7308..142fbb0f325 100644
--- a/Documentation/vm/cleancache.txt
+++ b/Documentation/vm/cleancache.txt
@@ -46,10 +46,11 @@ a negative return value indicates failure. A "put_page" will copy a
the pool id, a file key, and a page index into the file. (The combination
of a pool id, a file key, and an index is sometimes called a "handle".)
A "get_page" will copy the page, if found, from cleancache into kernel memory.
-A "flush_page" will ensure the page no longer is present in cleancache;
-a "flush_inode" will flush all pages associated with the specified file;
-and, when a filesystem is unmounted, a "flush_fs" will flush all pages in
-all files specified by the given pool id and also surrender the pool id.
+An "invalidate_page" will ensure the page no longer is present in cleancache;
+an "invalidate_inode" will invalidate all pages associated with the specified
+file; and, when a filesystem is unmounted, an "invalidate_fs" will invalidate
+all pages in all files specified by the given pool id and also surrender
+the pool id.
An "init_shared_fs", like init_fs, obtains a pool id but tells cleancache
to treat the pool as shared using a 128-bit UUID as a key. On systems
@@ -62,12 +63,12 @@ of the kernel (e.g. by "tools" that control cleancache). Or a
cleancache implementation can simply disable shared_init by always
returning a negative value.
-If a get_page is successful on a non-shared pool, the page is flushed (thus
-making cleancache an "exclusive" cache). On a shared pool, the page
-is NOT flushed on a successful get_page so that it remains accessible to
+If a get_page is successful on a non-shared pool, the page is invalidated
+(thus making cleancache an "exclusive" cache). On a shared pool, the page
+is NOT invalidated on a successful get_page so that it remains accessible to
other sharers. The kernel is responsible for ensuring coherency between
cleancache (shared or not), the page cache, and the filesystem, using
-cleancache flush operations as required.
+cleancache invalidate operations as required.
Note that cleancache must enforce put-put-get coherency and get-get
coherency. For the former, if two puts are made to the same handle but
@@ -77,22 +78,22 @@ if a get for a given handle fails, subsequent gets for that handle will
never succeed unless preceded by a successful put with that handle.
Last, cleancache provides no SMP serialization guarantees; if two
-different Linux threads are simultaneously putting and flushing a page
+different Linux threads are simultaneously putting and invalidating a page
with the same handle, the results are indeterminate. Callers must
lock the page to ensure serial behavior.
CLEANCACHE PERFORMANCE METRICS
-Cleancache monitoring is done by sysfs files in the
-/sys/kernel/mm/cleancache directory. The effectiveness of cleancache
+If properly configured, monitoring of cleancache is done via debugfs in
+the /sys/kernel/debug/mm/cleancache directory. The effectiveness of cleancache
can be measured (across all filesystems) with:
succ_gets - number of gets that were successful
failed_gets - number of gets that failed
puts - number of puts attempted (all "succeed")
-flushes - number of flushes attempted
+invalidates - number of invalidates attempted
-A backend implementatation may provide additional metrics.
+A backend implementation may provide additional metrics.
FAQ
@@ -143,7 +144,7 @@ systems.
The core hooks for cleancache in VFS are in most cases a single line
and the minimum set are placed precisely where needed to maintain
-coherency (via cleancache_flush operations) between cleancache,
+coherency (via cleancache_invalidate operations) between cleancache,
the page cache, and disk. All hooks compile into nothingness if
cleancache is config'ed off and turn into a function-pointer-
compare-to-NULL if config'ed on but no backend claims the ops
@@ -184,15 +185,15 @@ or for real kernel-addressable RAM, it makes perfect sense for
transcendent memory.
4) Why is non-shared cleancache "exclusive"? And where is the
- page "flushed" after a "get"? (Minchan Kim)
+ page "invalidated" after a "get"? (Minchan Kim)
The main reason is to free up space in transcendent memory and
-to avoid unnecessary cleancache_flush calls. If you want inclusive,
+to avoid unnecessary cleancache_invalidate calls. If you want inclusive,
the page can be "put" immediately following the "get". If
put-after-get for inclusive becomes common, the interface could
-be easily extended to add a "get_no_flush" call.
+be easily extended to add a "get_no_invalidate" call.
-The flush is done by the cleancache backend implementation.
+The invalidate is done by the cleancache backend implementation.
5) What's the performance impact?
@@ -222,7 +223,7 @@ Some points for a filesystem to consider:
as tmpfs should not enable cleancache)
- To ensure coherency/correctness, the FS must ensure that all
file removal or truncation operations either go through VFS or
- add hooks to do the equivalent cleancache "flush" operations
+ add hooks to do the equivalent cleancache "invalidate" operations
- To ensure coherency/correctness, either inode numbers must
be unique across the lifetime of the on-disk file OR the
FS must provide an "encode_fh" function.
@@ -243,11 +244,11 @@ If cleancache would use the inode virtual address instead of
inode/filehandle, the pool id could be eliminated. But, this
won't work because cleancache retains pagecache data pages
persistently even when the inode has been pruned from the
-inode unused list, and only flushes the data page if the file
+inode unused list, and only invalidates the data page if the file
gets removed/truncated. So if cleancache used the inode kva,
there would be potential coherency issues if/when the inode
kva is reused for a different file. Alternately, if cleancache
-flushed the pages when the inode kva was freed, much of the value
+invalidated the pages when the inode kva was freed, much of the value
of cleancache would be lost because the cache of pages in cleanache
is potentially much larger than the kernel pagecache and is most
useful if the pages survive inode cache removal.
diff --git a/Documentation/vm/frontswap.txt b/Documentation/vm/frontswap.txt
new file mode 100644
index 00000000000..c71a019be60
--- /dev/null
+++ b/Documentation/vm/frontswap.txt
@@ -0,0 +1,278 @@
+Frontswap provides a "transcendent memory" interface for swap pages.
+In some environments, dramatic performance savings may be obtained because
+swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.
+
+(Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends"
+and the only necessary changes to the core kernel for transcendent memory;
+all other supporting code -- the "backends" -- is implemented as drivers.
+See the LWN.net article "Transcendent memory in a nutshell" for a detailed
+overview of frontswap and related kernel parts:
+https://lwn.net/Articles/454795/ )
+
+Frontswap is so named because it can be thought of as the opposite of
+a "backing" store for a swap device. The storage is assumed to be
+a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming
+to the requirements of transcendent memory (such as Xen's "tmem", or
+in-kernel compressed memory, aka "zcache", or future RAM-like devices);
+this pseudo-RAM device is not directly accessible or addressable by the
+kernel and is of unknown and possibly time-varying size. The driver
+links itself to frontswap by calling frontswap_register_ops to set the
+frontswap_ops funcs appropriately and the functions it provides must
+conform to certain policies as follows:
+
+An "init" prepares the device to receive frontswap pages associated
+with the specified swap device number (aka "type"). A "store" will
+copy the page to transcendent memory and associate it with the type and
+offset associated with the page. A "load" will copy the page, if found,
+from transcendent memory into kernel memory, but will NOT remove the page
+from transcendent memory. An "invalidate_page" will remove the page
+from transcendent memory and an "invalidate_area" will remove ALL pages
+associated with the swap type (e.g., like swapoff) and notify the "device"
+to refuse further stores with that swap type.
+
+Once a page is successfully stored, a matching load on the page will normally
+succeed. So when the kernel finds itself in a situation where it needs
+to swap out a page, it first attempts to use frontswap. If the store returns
+success, the data has been successfully saved to transcendent memory and
+a disk write and, if the data is later read back, a disk read are avoided.
+If a store returns failure, transcendent memory has rejected the data, and the
+page can be written to swap as usual.
+
+If a backend chooses, frontswap can be configured as a "writethrough
+cache" by calling frontswap_writethrough(). In this mode, the reduction
+in swap device writes is lost (and also a non-trivial performance advantage)
+in order to allow the backend to arbitrarily "reclaim" space used to
+store frontswap pages to more completely manage its memory usage.
+
+Note that if a page is stored and the page already exists in transcendent memory
+(a "duplicate" store), either the store succeeds and the data is overwritten,
+or the store fails AND the page is invalidated. This ensures stale data may
+never be obtained from frontswap.
+
+If properly configured, monitoring of frontswap is done via debugfs in
+the /sys/kernel/debug/frontswap directory. The effectiveness of
+frontswap can be measured (across all swap devices) with:
+
+failed_stores - how many store attempts have failed
+loads - how many loads were attempted (all should succeed)
+succ_stores - how many store attempts have succeeded
+invalidates - how many invalidates were attempted
+
+A backend implementation may provide additional metrics.
+
+FAQ
+
+1) Where's the value?
+
+When a workload starts swapping, performance falls through the floor.
+Frontswap significantly increases performance in many such workloads by
+providing a clean, dynamic interface to read and write swap pages to
+"transcendent memory" that is otherwise not directly addressable to the kernel.
+This interface is ideal when data is transformed to a different form
+and size (such as with compression) or secretly moved (as might be
+useful for write-balancing for some RAM-like devices). Swap pages (and
+evicted page-cache pages) are a great use for this kind of slower-than-RAM-
+but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and
+cleancache) interface to transcendent memory provides a nice way to read
+and write -- and indirectly "name" -- the pages.
+
+Frontswap -- and cleancache -- with a fairly small impact on the kernel,
+provides a huge amount of flexibility for more dynamic, flexible RAM
+utilization in various system configurations:
+
+In the single kernel case, aka "zcache", pages are compressed and
+stored in local memory, thus increasing the total anonymous pages
+that can be safely kept in RAM. Zcache essentially trades off CPU
+cycles used in compression/decompression for better memory utilization.
+Benchmarks have shown little or no impact when memory pressure is
+low while providing a significant performance improvement (25%+)
+on some workloads under high memory pressure.
+
+"RAMster" builds on zcache by adding "peer-to-peer" transcendent memory
+support for clustered systems. Frontswap pages are locally compressed
+as in zcache, but then "remotified" to another system's RAM. This
+allows RAM to be dynamically load-balanced back-and-forth as needed,
+i.e. when system A is overcommitted, it can swap to system B, and
+vice versa. RAMster can also be configured as a memory server so
+many servers in a cluster can swap, dynamically as needed, to a single
+server configured with a large amount of RAM... without pre-configuring
+how much of the RAM is available for each of the clients!
+
+In the virtual case, the whole point of virtualization is to statistically
+multiplex physical resources across the varying demands of multiple
+virtual machines. This is really hard to do with RAM and efforts to do
+it well with no kernel changes have essentially failed (except in some
+well-publicized special-case workloads).
+Specifically, the Xen Transcendent Memory backend allows otherwise
+"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
+virtual machines, but the pages can be compressed and deduplicated to
+optimize RAM utilization. And when guest OS's are induced to surrender
+underutilized RAM (e.g. with "selfballooning"), sudden unexpected
+memory pressure may result in swapping; frontswap allows those pages
+to be swapped to and from hypervisor RAM (if overall host system memory
+conditions allow), thus mitigating the potentially awful performance impact
+of unplanned swapping.
+
+A KVM implementation is underway and has been RFC'ed to lkml. And,
+using frontswap, investigation is also underway on the use of NVM as
+a memory extension technology.
+
+2) Sure there may be performance advantages in some situations, but
+ what's the space/time overhead of frontswap?
+
+If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into
+nothingness and the only overhead is a few extra bytes per swapon'ed
+swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend"
+registers, there is one extra global variable compared to zero for
+every swap page read or written. If CONFIG_FRONTSWAP is enabled
+AND a frontswap backend registers AND the backend fails every "store"
+request (i.e. provides no memory despite claiming it might),
+CPU overhead is still negligible -- and since every frontswap fail
+precedes a swap page write-to-disk, the system is highly likely
+to be I/O bound and using a small fraction of a percent of a CPU
+will be irrelevant anyway.
+
+As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend
+registers, one bit is allocated for every swap page for every swap
+device that is swapon'd. This is added to the EIGHT bits (which
+was sixteen until about 2.6.34) that the kernel already allocates
+for every swap page for every swap device that is swapon'd. (Hugh
+Dickins has observed that frontswap could probably steal one of
+the existing eight bits, but let's worry about that minor optimization
+later.) For very large swap disks (which are rare) on a standard
+4K pagesize, this is 1MB per 32GB swap.
+
+When swap pages are stored in transcendent memory instead of written
+out to disk, there is a side effect that this may create more memory
+pressure that can potentially outweigh the other advantages. A
+backend, such as zcache, must implement policies to carefully (but
+dynamically) manage memory limits to ensure this doesn't happen.
+
+3) OK, how about a quick overview of what this frontswap patch does
+ in terms that a kernel hacker can grok?
+
+Let's assume that a frontswap "backend" has registered during
+kernel initialization; this registration indicates that this
+frontswap backend has access to some "memory" that is not directly
+accessible by the kernel. Exactly how much memory it provides is
+entirely dynamic and random.
+
+Whenever a swap-device is swapon'd frontswap_init() is called,
+passing the swap device number (aka "type") as a parameter.
+This notifies frontswap to expect attempts to "store" swap pages
+associated with that number.
+
+Whenever the swap subsystem is readying a page to write to a swap
+device (c.f swap_writepage()), frontswap_store is called. Frontswap
+consults with the frontswap backend and if the backend says it does NOT
+have room, frontswap_store returns -1 and the kernel swaps the page
+to the swap device as normal. Note that the response from the frontswap
+backend is unpredictable to the kernel; it may choose to never accept a
+page, it could accept every ninth page, or it might accept every
+page. But if the backend does accept a page, the data from the page
+has already been copied and associated with the type and offset,
+and the backend guarantees the persistence of the data. In this case,
+frontswap sets a bit in the "frontswap_map" for the swap device
+corresponding to the page offset on the swap device to which it would
+otherwise have written the data.
+
+When the swap subsystem needs to swap-in a page (swap_readpage()),
+it first calls frontswap_load() which checks the frontswap_map to
+see if the page was earlier accepted by the frontswap backend. If
+it was, the page of data is filled from the frontswap backend and
+the swap-in is complete. If not, the normal swap-in code is
+executed to obtain the page of data from the real swap device.
+
+So every time the frontswap backend accepts a page, a swap device read
+and (potentially) a swap device write are replaced by a "frontswap backend
+store" and (possibly) a "frontswap backend loads", which are presumably much
+faster.
+
+4) Can't frontswap be configured as a "special" swap device that is
+ just higher priority than any real swap device (e.g. like zswap,
+ or maybe swap-over-nbd/NFS)?
+
+No. First, the existing swap subsystem doesn't allow for any kind of
+swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy,
+but this would require fairly drastic changes. Even if it were
+rewritten, the existing swap subsystem uses the block I/O layer which
+assumes a swap device is fixed size and any page in it is linearly
+addressable. Frontswap barely touches the existing swap subsystem,
+and works around the constraints of the block I/O subsystem to provide
+a great deal of flexibility and dynamicity.
+
+For example, the acceptance of any swap page by the frontswap backend is
+entirely unpredictable. This is critical to the definition of frontswap
+backends because it grants completely dynamic discretion to the
+backend. In zcache, one cannot know a priori how compressible a page is.
+"Poorly" compressible pages can be rejected, and "poorly" can itself be
+defined dynamically depending on current memory constraints.
+
+Further, frontswap is entirely synchronous whereas a real swap
+device is, by definition, asynchronous and uses block I/O. The
+block I/O layer is not only unnecessary, but may perform "optimizations"
+that are inappropriate for a RAM-oriented device including delaying
+the write of some pages for a significant amount of time. Synchrony is
+required to ensure the dynamicity of the backend and to avoid thorny race
+conditions that would unnecessarily and greatly complicate frontswap
+and/or the block I/O subsystem. That said, only the initial "store"
+and "load" operations need be synchronous. A separate asynchronous thread
+is free to manipulate the pages stored by frontswap. For example,
+the "remotification" thread in RAMster uses standard asynchronous
+kernel sockets to move compressed frontswap pages to a remote machine.
+Similarly, a KVM guest-side implementation could do in-guest compression
+and use "batched" hypercalls.
+
+In a virtualized environment, the dynamicity allows the hypervisor
+(or host OS) to do "intelligent overcommit". For example, it can
+choose to accept pages only until host-swapping might be imminent,
+then force guests to do their own swapping.
+
+There is a downside to the transcendent memory specifications for
+frontswap: Since any "store" might fail, there must always be a real
+slot on a real swap device to swap the page. Thus frontswap must be
+implemented as a "shadow" to every swapon'd device with the potential
+capability of holding every page that the swap device might have held
+and the possibility that it might hold no pages at all. This means
+that frontswap cannot contain more pages than the total of swapon'd
+swap devices. For example, if NO swap device is configured on some
+installation, frontswap is useless. Swapless portable devices
+can still use frontswap but a backend for such devices must configure
+some kind of "ghost" swap device and ensure that it is never used.
+
+5) Why this weird definition about "duplicate stores"? If a page
+ has been previously successfully stored, can't it always be
+ successfully overwritten?
+
+Nearly always it can, but no, sometimes it cannot. Consider an example
+where data is compressed and the original 4K page has been compressed
+to 1K. Now an attempt is made to overwrite the page with data that
+is non-compressible and so would take the entire 4K. But the backend
+has no more space. In this case, the store must be rejected. Whenever
+frontswap rejects a store that would overwrite, it also must invalidate
+the old data and ensure that it is no longer accessible. Since the
+swap subsystem then writes the new data to the read swap device,
+this is the correct course of action to ensure coherency.
+
+6) What is frontswap_shrink for?
+
+When the (non-frontswap) swap subsystem swaps out a page to a real
+swap device, that page is only taking up low-value pre-allocated disk
+space. But if frontswap has placed a page in transcendent memory, that
+page may be taking up valuable real estate. The frontswap_shrink
+routine allows code outside of the swap subsystem to force pages out
+of the memory managed by frontswap and back into kernel-addressable memory.
+For example, in RAMster, a "suction driver" thread will attempt
+to "repatriate" pages sent to a remote machine back to the local machine;
+this is driven using the frontswap_shrink mechanism when memory pressure
+subsides.
+
+7) Why does the frontswap patch create the new include file swapfile.h?
+
+The frontswap code depends on some swap-subsystem-internal data
+structures that have, over the years, moved back and forth between
+static and global. This seemed a reasonable compromise: Define
+them as global but declare them in a new include file that isn't
+included by the large number of source files that include swap.h.
+
+Dan Magenheimer, last updated April 9, 2012
diff --git a/Documentation/vm/hugepage-mmap.c b/Documentation/vm/hugepage-mmap.c
deleted file mode 100644
index db0dd9a33d5..00000000000
--- a/Documentation/vm/hugepage-mmap.c
+++ /dev/null
@@ -1,91 +0,0 @@
-/*
- * hugepage-mmap:
- *
- * Example of using huge page memory in a user application using the mmap
- * system call. Before running this application, make sure that the
- * administrator has mounted the hugetlbfs filesystem (on some directory
- * like /mnt) using the command mount -t hugetlbfs nodev /mnt. In this
- * example, the app is requesting memory of size 256MB that is backed by
- * huge pages.
- *
- * For the ia64 architecture, the Linux kernel reserves Region number 4 for
- * huge pages. That means that if one requires a fixed address, a huge page
- * aligned address starting with 0x800000... will be required. If a fixed
- * address is not required, the kernel will select an address in the proper
- * range.
- * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
- */
-
-#include <stdlib.h>
-#include <stdio.h>
-#include <unistd.h>
-#include <sys/mman.h>
-#include <fcntl.h>
-
-#define FILE_NAME "/mnt/hugepagefile"
-#define LENGTH (256UL*1024*1024)
-#define PROTECTION (PROT_READ | PROT_WRITE)
-
-/* Only ia64 requires this */
-#ifdef __ia64__
-#define ADDR (void *)(0x8000000000000000UL)
-#define FLAGS (MAP_SHARED | MAP_FIXED)
-#else
-#define ADDR (void *)(0x0UL)
-#define FLAGS (MAP_SHARED)
-#endif
-
-static void check_bytes(char *addr)
-{
- printf("First hex is %x\n", *((unsigned int *)addr));
-}
-
-static void write_bytes(char *addr)
-{
- unsigned long i;
-
- for (i = 0; i < LENGTH; i++)
- *(addr + i) = (char)i;
-}
-
-static void read_bytes(char *addr)
-{
- unsigned long i;
-
- check_bytes(addr);
- for (i = 0; i < LENGTH; i++)
- if (*(addr + i) != (char)i) {
- printf("Mismatch at %lu\n", i);
- break;
- }
-}
-
-int main(void)
-{
- void *addr;
- int fd;
-
- fd = open(FILE_NAME, O_CREAT | O_RDWR, 0755);
- if (fd < 0) {
- perror("Open failed");
- exit(1);
- }
-
- addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, fd, 0);
- if (addr == MAP_FAILED) {
- perror("mmap");
- unlink(FILE_NAME);
- exit(1);
- }
-
- printf("Returned address is %p\n", addr);
- check_bytes(addr);
- write_bytes(addr);
- read_bytes(addr);
-
- munmap(addr, LENGTH);
- close(fd);
- unlink(FILE_NAME);
-
- return 0;
-}
diff --git a/Documentation/vm/hugepage-shm.c b/Documentation/vm/hugepage-shm.c
deleted file mode 100644
index 07956d8592c..00000000000
--- a/Documentation/vm/hugepage-shm.c
+++ /dev/null
@@ -1,98 +0,0 @@
-/*
- * hugepage-shm:
- *
- * Example of using huge page memory in a user application using Sys V shared
- * memory system calls. In this example the app is requesting 256MB of
- * memory that is backed by huge pages. The application uses the flag
- * SHM_HUGETLB in the shmget system call to inform the kernel that it is
- * requesting huge pages.
- *
- * For the ia64 architecture, the Linux kernel reserves Region number 4 for
- * huge pages. That means that if one requires a fixed address, a huge page
- * aligned address starting with 0x800000... will be required. If a fixed
- * address is not required, the kernel will select an address in the proper
- * range.
- * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
- *
- * Note: The default shared memory limit is quite low on many kernels,
- * you may need to increase it via:
- *
- * echo 268435456 > /proc/sys/kernel/shmmax
- *
- * This will increase the maximum size per shared memory segment to 256MB.
- * The other limit that you will hit eventually is shmall which is the
- * total amount of shared memory in pages. To set it to 16GB on a system
- * with a 4kB pagesize do:
- *
- * echo 4194304 > /proc/sys/kernel/shmall
- */
-
-#include <stdlib.h>
-#include <stdio.h>
-#include <sys/types.h>
-#include <sys/ipc.h>
-#include <sys/shm.h>
-#include <sys/mman.h>
-
-#ifndef SHM_HUGETLB
-#define SHM_HUGETLB 04000
-#endif
-
-#define LENGTH (256UL*1024*1024)
-
-#define dprintf(x) printf(x)
-
-/* Only ia64 requires this */
-#ifdef __ia64__
-#define ADDR (void *)(0x8000000000000000UL)
-#define SHMAT_FLAGS (SHM_RND)
-#else
-#define ADDR (void *)(0x0UL)
-#define SHMAT_FLAGS (0)
-#endif
-
-int main(void)
-{
- int shmid;
- unsigned long i;
- char *shmaddr;
-
- if ((shmid = shmget(2, LENGTH,
- SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W)) < 0) {
- perror("shmget");
- exit(1);
- }
- printf("shmid: 0x%x\n", shmid);
-
- shmaddr = shmat(shmid, ADDR, SHMAT_FLAGS);
- if (shmaddr == (char *)-1) {
- perror("Shared memory attach failure");
- shmctl(shmid, IPC_RMID, NULL);
- exit(2);
- }
- printf("shmaddr: %p\n", shmaddr);
-
- dprintf("Starting the writes:\n");
- for (i = 0; i < LENGTH; i++) {
- shmaddr[i] = (char)(i);
- if (!(i % (1024 * 1024)))
- dprintf(".");
- }
- dprintf("\n");
-
- dprintf("Starting the Check...");
- for (i = 0; i < LENGTH; i++)
- if (shmaddr[i] != (char)i)
- printf("\nIndex %lu mismatched\n", i);
- dprintf("Done.\n");
-
- if (shmdt((const void *)shmaddr) != 0) {
- perror("Detach failure");
- shmctl(shmid, IPC_RMID, NULL);
- exit(3);
- }
-
- shmctl(shmid, IPC_RMID, NULL);
-
- return 0;
-}
diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt
index f8551b3879f..bdd4bb97fff 100644
--- a/Documentation/vm/hugetlbpage.txt
+++ b/Documentation/vm/hugetlbpage.txt
@@ -165,6 +165,7 @@ which function as described above for the default huge page-sized case.
Interaction of Task Memory Policy with Huge Page Allocation/Freeing
+===================================================================
Whether huge pages are allocated and freed via the /proc interface or
the /sysfs interface using the nr_hugepages_mempolicy attribute, the NUMA
@@ -229,6 +230,7 @@ resulting effect on persistent huge page allocation is as follows:
of huge pages over all on-lines nodes with memory.
Per Node Hugepages Attributes
+=============================
A subset of the contents of the root huge page control directory in sysfs,
described above, will be replicated under each the system device of each
@@ -258,6 +260,7 @@ applied, from which node the huge page allocation will be attempted.
Using Huge Pages
+================
If the user applications are going to request huge pages using mmap system
call, then it is required that system administrator mount a file system of
@@ -296,14 +299,16 @@ calls, though the mount of filesystem will be required for using mmap calls
without MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see
map_hugetlb.c.
-*******************************************************************
+Examples
+========
-/*
- * hugepage-shm: see Documentation/vm/hugepage-shm.c
- */
+1) map_hugetlb: see tools/testing/selftests/vm/map_hugetlb.c
-*******************************************************************
+2) hugepage-shm: see tools/testing/selftests/vm/hugepage-shm.c
-/*
- * hugepage-mmap: see Documentation/vm/hugepage-mmap.c
- */
+3) hugepage-mmap: see tools/testing/selftests/vm/hugepage-mmap.c
+
+4) The libhugetlbfs (http://libhugetlbfs.sourceforge.net) library provides a
+ wide range of userspace tools to help with huge page usability, environment
+ setup, and control. Furthermore it provides useful test cases that should be
+ used when modifying code to ensure no regressions are introduced.
diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt
index 55006846660..6ae89a9edf2 100644
--- a/Documentation/vm/hwpoison.txt
+++ b/Documentation/vm/hwpoison.txt
@@ -84,6 +84,11 @@ PR_MCE_KILL
PR_MCE_KILL_EARLY: Early kill
PR_MCE_KILL_LATE: Late kill
PR_MCE_KILL_DEFAULT: Use system global default
+ Note that if you want to have a dedicated thread which handles
+ the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should
+ call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise,
+ the SIGBUS is sent to the main thread.
+
PR_MCE_KILL_GET
return current mode
diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
index b392e496f81..f34a8ee6f86 100644
--- a/Documentation/vm/ksm.txt
+++ b/Documentation/vm/ksm.txt
@@ -58,6 +58,21 @@ sleep_millisecs - how many milliseconds ksmd should sleep before next scan
e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
Default: 20 (chosen for demonstration purposes)
+merge_across_nodes - specifies if pages from different numa nodes can be merged.
+ When set to 0, ksm merges only pages which physically
+ reside in the memory area of same NUMA node. That brings
+ lower latency to access of shared pages. Systems with more
+ nodes, at significant NUMA distances, are likely to benefit
+ from the lower latency of setting 0. Smaller systems, which
+ need to minimize memory usage, are likely to benefit from
+ the greater sharing of setting 1 (default). You may wish to
+ compare how your system performs under each setting, before
+ deciding on which to use. merge_across_nodes setting can be
+ changed only when there are no ksm shared pages in system:
+ set run 2 to unmerge pages first, then to 1 after changing
+ merge_across_nodes, to remerge according to the new setting.
+ Default: 1 (merging across nodes as in earlier releases)
+
run - set 0 to stop ksmd from running but keep merged pages,
set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
set 2 to stop ksmd and unmerge all pages currently merged,
diff --git a/Documentation/vm/locking b/Documentation/vm/locking
deleted file mode 100644
index f61228bd639..00000000000
--- a/Documentation/vm/locking
+++ /dev/null
@@ -1,130 +0,0 @@
-Started Oct 1999 by Kanoj Sarcar <kanojsarcar@yahoo.com>
-
-The intent of this file is to have an uptodate, running commentary
-from different people about how locking and synchronization is done
-in the Linux vm code.
-
-page_table_lock & mmap_sem
---------------------------------------
-
-Page stealers pick processes out of the process pool and scan for
-the best process to steal pages from. To guarantee the existence
-of the victim mm, a mm_count inc and a mmdrop are done in swap_out().
-Page stealers hold kernel_lock to protect against a bunch of races.
-The vma list of the victim mm is also scanned by the stealer,
-and the page_table_lock is used to preserve list sanity against the
-process adding/deleting to the list. This also guarantees existence
-of the vma. Vma existence is not guaranteed once try_to_swap_out()
-drops the page_table_lock. To guarantee the existence of the underlying
-file structure, a get_file is done before the swapout() method is
-invoked. The page passed into swapout() is guaranteed not to be reused
-for a different purpose because the page reference count due to being
-present in the user's pte is not released till after swapout() returns.
-
-Any code that modifies the vmlist, or the vm_start/vm_end/
-vm_flags:VM_LOCKED/vm_next of any vma *in the list* must prevent
-kswapd from looking at the chain.
-
-The rules are:
-1. To scan the vmlist (look but don't touch) you must hold the
- mmap_sem with read bias, i.e. down_read(&mm->mmap_sem)
-2. To modify the vmlist you need to hold the mmap_sem with
- read&write bias, i.e. down_write(&mm->mmap_sem) *AND*
- you need to take the page_table_lock.
-3. The swapper takes _just_ the page_table_lock, this is done
- because the mmap_sem can be an extremely long lived lock
- and the swapper just cannot sleep on that.
-4. The exception to this rule is expand_stack, which just
- takes the read lock and the page_table_lock, this is ok
- because it doesn't really modify fields anybody relies on.
-5. You must be able to guarantee that while holding page_table_lock
- or page_table_lock of mm A, you will not try to get either lock
- for mm B.
-
-The caveats are:
-1. find_vma() makes use of, and updates, the mmap_cache pointer hint.
-The update of mmap_cache is racy (page stealer can race with other code
-that invokes find_vma with mmap_sem held), but that is okay, since it
-is a hint. This can be fixed, if desired, by having find_vma grab the
-page_table_lock.
-
-
-Code that add/delete elements from the vmlist chain are
-1. callers of insert_vm_struct
-2. callers of merge_segments
-3. callers of avl_remove
-
-Code that changes vm_start/vm_end/vm_flags:VM_LOCKED of vma's on
-the list:
-1. expand_stack
-2. mprotect
-3. mlock
-4. mremap
-
-It is advisable that changes to vm_start/vm_end be protected, although
-in some cases it is not really needed. Eg, vm_start is modified by
-expand_stack(), it is hard to come up with a destructive scenario without
-having the vmlist protection in this case.
-
-The page_table_lock nests with the inode i_mmap_mutex and the kmem cache
-c_spinlock spinlocks. This is okay, since the kmem code asks for pages after
-dropping c_spinlock. The page_table_lock also nests with pagecache_lock and
-pagemap_lru_lock spinlocks, and no code asks for memory with these locks
-held.
-
-The page_table_lock is grabbed while holding the kernel_lock spinning monitor.
-
-The page_table_lock is a spin lock.
-
-Note: PTL can also be used to guarantee that no new clones using the
-mm start up ... this is a loose form of stability on mm_users. For
-example, it is used in copy_mm to protect against a racing tlb_gather_mmu
-single address space optimization, so that the zap_page_range (from
-truncate) does not lose sending ipi's to cloned threads that might
-be spawned underneath it and go to user mode to drag in pte's into tlbs.
-
-swap_lock
---------------
-The swap devices are chained in priority order from the "swap_list" header.
-The "swap_list" is used for the round-robin swaphandle allocation strategy.
-The #free swaphandles is maintained in "nr_swap_pages". These two together
-are protected by the swap_lock.
-
-The swap_lock also protects all the device reference counts on the
-corresponding swaphandles, maintained in the "swap_map" array, and the
-"highest_bit" and "lowest_bit" fields.
-
-The swap_lock is a spinlock, and is never acquired from intr level.
-
-To prevent races between swap space deletion or async readahead swapins
-deciding whether a swap handle is being used, ie worthy of being read in
-from disk, and an unmap -> swap_free making the handle unused, the swap
-delete and readahead code grabs a temp reference on the swaphandle to
-prevent warning messages from swap_duplicate <- read_swap_cache_async.
-
-Swap cache locking
-------------------
-Pages are added into the swap cache with kernel_lock held, to make sure
-that multiple pages are not being added (and hence lost) by associating
-all of them with the same swaphandle.
-
-Pages are guaranteed not to be removed from the scache if the page is
-"shared": ie, other processes hold reference on the page or the associated
-swap handle. The only code that does not follow this rule is shrink_mmap,
-which deletes pages from the swap cache if no process has a reference on
-the page (multiple processes might have references on the corresponding
-swap handle though). lookup_swap_cache() races with shrink_mmap, when
-establishing a reference on a scache page, so, it must check whether the
-page it located is still in the swapcache, or shrink_mmap deleted it.
-(This race is due to the fact that shrink_mmap looks at the page ref
-count with pagecache_lock, but then drops pagecache_lock before deleting
-the page from the scache).
-
-do_wp_page and do_swap_page have MP races in them while trying to figure
-out whether a page is "shared", by looking at the page_count + swap_count.
-To preserve the sum of the counts, the page lock _must_ be acquired before
-calling is_page_shared (else processes might switch their swap_count refs
-to the page count refs, after the page count ref has been snapshotted).
-
-Swap device deletion code currently breaks all the scache assumptions,
-since it grabs neither mmap_sem nor page_table_lock.
diff --git a/Documentation/vm/map_hugetlb.c b/Documentation/vm/map_hugetlb.c
deleted file mode 100644
index eda1a6d3578..00000000000
--- a/Documentation/vm/map_hugetlb.c
+++ /dev/null
@@ -1,77 +0,0 @@
-/*
- * Example of using hugepage memory in a user application using the mmap
- * system call with MAP_HUGETLB flag. Before running this program make
- * sure the administrator has allocated enough default sized huge pages
- * to cover the 256 MB allocation.
- *
- * For ia64 architecture, Linux kernel reserves Region number 4 for hugepages.
- * That means the addresses starting with 0x800000... will need to be
- * specified. Specifying a fixed address is not required on ppc64, i386
- * or x86_64.
- */
-#include <stdlib.h>
-#include <stdio.h>
-#include <unistd.h>
-#include <sys/mman.h>
-#include <fcntl.h>
-
-#define LENGTH (256UL*1024*1024)
-#define PROTECTION (PROT_READ | PROT_WRITE)
-
-#ifndef MAP_HUGETLB
-#define MAP_HUGETLB 0x40000 /* arch specific */
-#endif
-
-/* Only ia64 requires this */
-#ifdef __ia64__
-#define ADDR (void *)(0x8000000000000000UL)
-#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_FIXED)
-#else
-#define ADDR (void *)(0x0UL)
-#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB)
-#endif
-
-static void check_bytes(char *addr)
-{
- printf("First hex is %x\n", *((unsigned int *)addr));
-}
-
-static void write_bytes(char *addr)
-{
- unsigned long i;
-
- for (i = 0; i < LENGTH; i++)
- *(addr + i) = (char)i;
-}
-
-static void read_bytes(char *addr)
-{
- unsigned long i;
-
- check_bytes(addr);
- for (i = 0; i < LENGTH; i++)
- if (*(addr + i) != (char)i) {
- printf("Mismatch at %lu\n", i);
- break;
- }
-}
-
-int main(void)
-{
- void *addr;
-
- addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, 0, 0);
- if (addr == MAP_FAILED) {
- perror("mmap");
- exit(1);
- }
-
- printf("Returned address is %p\n", addr);
- check_bytes(addr);
- write_bytes(addr);
- read_bytes(addr);
-
- munmap(addr, LENGTH);
-
- return 0;
-}
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt
index 4e7da654342..badb0507608 100644
--- a/Documentation/vm/numa_memory_policy.txt
+++ b/Documentation/vm/numa_memory_policy.txt
@@ -174,7 +174,6 @@ Components of Memory Policies
allocation fails, the kernel will search other nodes, in order of
increasing distance from the preferred node based on information
provided by the platform firmware.
- containing the cpu where the allocation takes place.
Internally, the Preferred policy uses a single node--the
preferred_node member of struct mempolicy. When the internal
@@ -275,9 +274,9 @@ Components of Memory Policies
For example, consider a task that is attached to a cpuset with
mems 2-5 that sets an Interleave policy over the same set with
MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the
- interleave now occurs over nodes 3,5-6. If the cpuset's mems
+ interleave now occurs over nodes 3,5-7. If the cpuset's mems
then change to 0,2-3,5, then the interleave occurs over nodes
- 0,3,5.
+ 0,2-3,5.
Thanks to the consistent remapping, applications preparing
nodemasks to specify memory policies using this flag should
diff --git a/Documentation/vm/overcommit-accounting b/Documentation/vm/overcommit-accounting
index 706d7ed9d8d..cbfaaa67411 100644
--- a/Documentation/vm/overcommit-accounting
+++ b/Documentation/vm/overcommit-accounting
@@ -8,19 +8,26 @@ The Linux kernel supports the following overcommit handling modes
default.
1 - Always overcommit. Appropriate for some scientific
- applications.
+ applications. Classic example is code using sparse arrays
+ and just relying on the virtual memory consisting almost
+ entirely of zero pages.
2 - Don't overcommit. The total address space commit
for the system is not permitted to exceed swap + a
- configurable percentage (default is 50) of physical RAM.
- Depending on the percentage you use, in most situations
+ configurable amount (default is 50%) of physical RAM.
+ Depending on the amount you use, in most situations
this means a process will not be killed while accessing
pages but will receive errors on memory allocation as
appropriate.
+ Useful for applications that want to guarantee their
+ memory allocations will be available in the future
+ without having to initialize every page.
+
The overcommit policy is set via the sysctl `vm.overcommit_memory'.
-The overcommit percentage is set via `vm.overcommit_ratio'.
+The overcommit amount can be set via `vm.overcommit_ratio' (percentage)
+or `vm.overcommit_kbytes' (absolute value).
The current overcommit limit and amount committed are viewable in
/proc/meminfo as CommitLimit and Committed_AS respectively.
diff --git a/Documentation/vm/page-types.c b/Documentation/vm/page-types.c
deleted file mode 100644
index 7445caa26d0..00000000000
--- a/Documentation/vm/page-types.c
+++ /dev/null
@@ -1,1100 +0,0 @@
-/*
- * page-types: Tool for querying page flags
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the Free
- * Software Foundation; version 2.
- *
- * This program is distributed in the hope that it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
- * more details.
- *
- * You should find a copy of v2 of the GNU General Public License somewhere on
- * your Linux system; if not, write to the Free Software Foundation, Inc., 59
- * Temple Place, Suite 330, Boston, MA 02111-1307 USA.
- *
- * Copyright (C) 2009 Intel corporation
- *
- * Authors: Wu Fengguang <fengguang.wu@intel.com>
- */
-
-#define _LARGEFILE64_SOURCE
-#include <stdio.h>
-#include <stdlib.h>
-#include <unistd.h>
-#include <stdint.h>
-#include <stdarg.h>
-#include <string.h>
-#include <getopt.h>
-#include <limits.h>
-#include <assert.h>
-#include <sys/types.h>
-#include <sys/errno.h>
-#include <sys/fcntl.h>
-#include <sys/mount.h>
-#include <sys/statfs.h>
-#include "../../include/linux/magic.h"
-
-
-#ifndef MAX_PATH
-# define MAX_PATH 256
-#endif
-
-#ifndef STR
-# define _STR(x) #x
-# define STR(x) _STR(x)
-#endif
-
-/*
- * pagemap kernel ABI bits
- */
-
-#define PM_ENTRY_BYTES sizeof(uint64_t)
-#define PM_STATUS_BITS 3
-#define PM_STATUS_OFFSET (64 - PM_STATUS_BITS)
-#define PM_STATUS_MASK (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET)
-#define PM_STATUS(nr) (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK)
-#define PM_PSHIFT_BITS 6
-#define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS)
-#define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET)
-#define PM_PSHIFT(x) (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
-#define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1)
-#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
-
-#define PM_PRESENT PM_STATUS(4LL)
-#define PM_SWAP PM_STATUS(2LL)
-
-
-/*
- * kernel page flags
- */
-
-#define KPF_BYTES 8
-#define PROC_KPAGEFLAGS "/proc/kpageflags"
-
-/* copied from kpageflags_read() */
-#define KPF_LOCKED 0
-#define KPF_ERROR 1
-#define KPF_REFERENCED 2
-#define KPF_UPTODATE 3
-#define KPF_DIRTY 4
-#define KPF_LRU 5
-#define KPF_ACTIVE 6
-#define KPF_SLAB 7
-#define KPF_WRITEBACK 8
-#define KPF_RECLAIM 9
-#define KPF_BUDDY 10
-
-/* [11-20] new additions in 2.6.31 */
-#define KPF_MMAP 11
-#define KPF_ANON 12
-#define KPF_SWAPCACHE 13
-#define KPF_SWAPBACKED 14
-#define KPF_COMPOUND_HEAD 15
-#define KPF_COMPOUND_TAIL 16
-#define KPF_HUGE 17
-#define KPF_UNEVICTABLE 18
-#define KPF_HWPOISON 19
-#define KPF_NOPAGE 20
-#define KPF_KSM 21
-
-/* [32-] kernel hacking assistances */
-#define KPF_RESERVED 32
-#define KPF_MLOCKED 33
-#define KPF_MAPPEDTODISK 34
-#define KPF_PRIVATE 35
-#define KPF_PRIVATE_2 36
-#define KPF_OWNER_PRIVATE 37
-#define KPF_ARCH 38
-#define KPF_UNCACHED 39
-
-/* [48-] take some arbitrary free slots for expanding overloaded flags
- * not part of kernel API
- */
-#define KPF_READAHEAD 48
-#define KPF_SLOB_FREE 49
-#define KPF_SLUB_FROZEN 50
-#define KPF_SLUB_DEBUG 51
-
-#define KPF_ALL_BITS ((uint64_t)~0ULL)
-#define KPF_HACKERS_BITS (0xffffULL << 32)
-#define KPF_OVERLOADED_BITS (0xffffULL << 48)
-#define BIT(name) (1ULL << KPF_##name)
-#define BITS_COMPOUND (BIT(COMPOUND_HEAD) | BIT(COMPOUND_TAIL))
-
-static const char *page_flag_names[] = {
- [KPF_LOCKED] = "L:locked",
- [KPF_ERROR] = "E:error",
- [KPF_REFERENCED] = "R:referenced",
- [KPF_UPTODATE] = "U:uptodate",
- [KPF_DIRTY] = "D:dirty",
- [KPF_LRU] = "l:lru",
- [KPF_ACTIVE] = "A:active",
- [KPF_SLAB] = "S:slab",
- [KPF_WRITEBACK] = "W:writeback",
- [KPF_RECLAIM] = "I:reclaim",
- [KPF_BUDDY] = "B:buddy",
-
- [KPF_MMAP] = "M:mmap",
- [KPF_ANON] = "a:anonymous",
- [KPF_SWAPCACHE] = "s:swapcache",
- [KPF_SWAPBACKED] = "b:swapbacked",
- [KPF_COMPOUND_HEAD] = "H:compound_head",
- [KPF_COMPOUND_TAIL] = "T:compound_tail",
- [KPF_HUGE] = "G:huge",
- [KPF_UNEVICTABLE] = "u:unevictable",
- [KPF_HWPOISON] = "X:hwpoison",
- [KPF_NOPAGE] = "n:nopage",
- [KPF_KSM] = "x:ksm",
-
- [KPF_RESERVED] = "r:reserved",
- [KPF_MLOCKED] = "m:mlocked",
- [KPF_MAPPEDTODISK] = "d:mappedtodisk",
- [KPF_PRIVATE] = "P:private",
- [KPF_PRIVATE_2] = "p:private_2",
- [KPF_OWNER_PRIVATE] = "O:owner_private",
- [KPF_ARCH] = "h:arch",
- [KPF_UNCACHED] = "c:uncached",
-
- [KPF_READAHEAD] = "I:readahead",
- [KPF_SLOB_FREE] = "P:slob_free",
- [KPF_SLUB_FROZEN] = "A:slub_frozen",
- [KPF_SLUB_DEBUG] = "E:slub_debug",
-};
-
-
-static const char *debugfs_known_mountpoints[] = {
- "/sys/kernel/debug",
- "/debug",
- 0,
-};
-
-/*
- * data structures
- */
-
-static int opt_raw; /* for kernel developers */
-static int opt_list; /* list pages (in ranges) */
-static int opt_no_summary; /* don't show summary */
-static pid_t opt_pid; /* process to walk */
-
-#define MAX_ADDR_RANGES 1024
-static int nr_addr_ranges;
-static unsigned long opt_offset[MAX_ADDR_RANGES];
-static unsigned long opt_size[MAX_ADDR_RANGES];
-
-#define MAX_VMAS 10240
-static int nr_vmas;
-static unsigned long pg_start[MAX_VMAS];
-static unsigned long pg_end[MAX_VMAS];
-
-#define MAX_BIT_FILTERS 64
-static int nr_bit_filters;
-static uint64_t opt_mask[MAX_BIT_FILTERS];
-static uint64_t opt_bits[MAX_BIT_FILTERS];
-
-static int page_size;
-
-static int pagemap_fd;
-static int kpageflags_fd;
-
-static int opt_hwpoison;
-static int opt_unpoison;
-
-static char hwpoison_debug_fs[MAX_PATH+1];
-static int hwpoison_inject_fd;
-static int hwpoison_forget_fd;
-
-#define HASH_SHIFT 13
-#define HASH_SIZE (1 << HASH_SHIFT)
-#define HASH_MASK (HASH_SIZE - 1)
-#define HASH_KEY(flags) (flags & HASH_MASK)
-
-static unsigned long total_pages;
-static unsigned long nr_pages[HASH_SIZE];
-static uint64_t page_flags[HASH_SIZE];
-
-
-/*
- * helper functions
- */
-
-#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
-
-#define min_t(type, x, y) ({ \
- type __min1 = (x); \
- type __min2 = (y); \
- __min1 < __min2 ? __min1 : __min2; })
-
-#define max_t(type, x, y) ({ \
- type __max1 = (x); \
- type __max2 = (y); \
- __max1 > __max2 ? __max1 : __max2; })
-
-static unsigned long pages2mb(unsigned long pages)
-{
- return (pages * page_size) >> 20;
-}
-
-static void fatal(const char *x, ...)
-{
- va_list ap;
-
- va_start(ap, x);
- vfprintf(stderr, x, ap);
- va_end(ap);
- exit(EXIT_FAILURE);
-}
-
-static int checked_open(const char *pathname, int flags)
-{
- int fd = open(pathname, flags);
-
- if (fd < 0) {
- perror(pathname);
- exit(EXIT_FAILURE);
- }
-
- return fd;
-}
-
-/*
- * pagemap/kpageflags routines
- */
-
-static unsigned long do_u64_read(int fd, char *name,
- uint64_t *buf,
- unsigned long index,
- unsigned long count)
-{
- long bytes;
-
- if (index > ULONG_MAX / 8)
- fatal("index overflow: %lu\n", index);
-
- if (lseek(fd, index * 8, SEEK_SET) < 0) {
- perror(name);
- exit(EXIT_FAILURE);
- }
-
- bytes = read(fd, buf, count * 8);
- if (bytes < 0) {
- perror(name);
- exit(EXIT_FAILURE);
- }
- if (bytes % 8)
- fatal("partial read: %lu bytes\n", bytes);
-
- return bytes / 8;
-}
-
-static unsigned long kpageflags_read(uint64_t *buf,
- unsigned long index,
- unsigned long pages)
-{
- return do_u64_read(kpageflags_fd, PROC_KPAGEFLAGS, buf, index, pages);
-}
-
-static unsigned long pagemap_read(uint64_t *buf,
- unsigned long index,
- unsigned long pages)
-{
- return do_u64_read(pagemap_fd, "/proc/pid/pagemap", buf, index, pages);
-}
-
-static unsigned long pagemap_pfn(uint64_t val)
-{
- unsigned long pfn;
-
- if (val & PM_PRESENT)
- pfn = PM_PFRAME(val);
- else
- pfn = 0;
-
- return pfn;
-}
-
-
-/*
- * page flag names
- */
-
-static char *page_flag_name(uint64_t flags)
-{
- static char buf[65];
- int present;
- int i, j;
-
- for (i = 0, j = 0; i < ARRAY_SIZE(page_flag_names); i++) {
- present = (flags >> i) & 1;
- if (!page_flag_names[i]) {
- if (present)
- fatal("unknown flag bit %d\n", i);
- continue;
- }
- buf[j++] = present ? page_flag_names[i][0] : '_';
- }
-
- return buf;
-}
-
-static char *page_flag_longname(uint64_t flags)
-{
- static char buf[1024];
- int i, n;
-
- for (i = 0, n = 0; i < ARRAY_SIZE(page_flag_names); i++) {
- if (!page_flag_names[i])
- continue;
- if ((flags >> i) & 1)
- n += snprintf(buf + n, sizeof(buf) - n, "%s,",
- page_flag_names[i] + 2);
- }
- if (n)
- n--;
- buf[n] = '\0';
-
- return buf;
-}
-
-
-/*
- * page list and summary
- */
-
-static void show_page_range(unsigned long voffset,
- unsigned long offset, uint64_t flags)
-{
- static uint64_t flags0;
- static unsigned long voff;
- static unsigned long index;
- static unsigned long count;
-
- if (flags == flags0 && offset == index + count &&
- (!opt_pid || voffset == voff + count)) {
- count++;
- return;
- }
-
- if (count) {
- if (opt_pid)
- printf("%lx\t", voff);
- printf("%lx\t%lx\t%s\n",
- index, count, page_flag_name(flags0));
- }
-
- flags0 = flags;
- index = offset;
- voff = voffset;
- count = 1;
-}
-
-static void show_page(unsigned long voffset,
- unsigned long offset, uint64_t flags)
-{
- if (opt_pid)
- printf("%lx\t", voffset);
- printf("%lx\t%s\n", offset, page_flag_name(flags));
-}
-
-static void show_summary(void)
-{
- int i;
-
- printf(" flags\tpage-count MB"
- " symbolic-flags\t\t\tlong-symbolic-flags\n");
-
- for (i = 0; i < ARRAY_SIZE(nr_pages); i++) {
- if (nr_pages[i])
- printf("0x%016llx\t%10lu %8lu %s\t%s\n",
- (unsigned long long)page_flags[i],
- nr_pages[i],
- pages2mb(nr_pages[i]),
- page_flag_name(page_flags[i]),
- page_flag_longname(page_flags[i]));
- }
-
- printf(" total\t%10lu %8lu\n",
- total_pages, pages2mb(total_pages));
-}
-
-
-/*
- * page flag filters
- */
-
-static int bit_mask_ok(uint64_t flags)
-{
- int i;
-
- for (i = 0; i < nr_bit_filters; i++) {
- if (opt_bits[i] == KPF_ALL_BITS) {
- if ((flags & opt_mask[i]) == 0)
- return 0;
- } else {
- if ((flags & opt_mask[i]) != opt_bits[i])
- return 0;
- }
- }
-
- return 1;
-}
-
-static uint64_t expand_overloaded_flags(uint64_t flags)
-{
- /* SLOB/SLUB overload several page flags */
- if (flags & BIT(SLAB)) {
- if (flags & BIT(PRIVATE))
- flags ^= BIT(PRIVATE) | BIT(SLOB_FREE);
- if (flags & BIT(ACTIVE))
- flags ^= BIT(ACTIVE) | BIT(SLUB_FROZEN);
- if (flags & BIT(ERROR))
- flags ^= BIT(ERROR) | BIT(SLUB_DEBUG);
- }
-
- /* PG_reclaim is overloaded as PG_readahead in the read path */
- if ((flags & (BIT(RECLAIM) | BIT(WRITEBACK))) == BIT(RECLAIM))
- flags ^= BIT(RECLAIM) | BIT(READAHEAD);
-
- return flags;
-}
-
-static uint64_t well_known_flags(uint64_t flags)
-{
- /* hide flags intended only for kernel hacker */
- flags &= ~KPF_HACKERS_BITS;
-
- /* hide non-hugeTLB compound pages */
- if ((flags & BITS_COMPOUND) && !(flags & BIT(HUGE)))
- flags &= ~BITS_COMPOUND;
-
- return flags;
-}
-
-static uint64_t kpageflags_flags(uint64_t flags)
-{
- flags = expand_overloaded_flags(flags);
-
- if (!opt_raw)
- flags = well_known_flags(flags);
-
- return flags;
-}
-
-/* verify that a mountpoint is actually a debugfs instance */
-static int debugfs_valid_mountpoint(const char *debugfs)
-{
- struct statfs st_fs;
-
- if (statfs(debugfs, &st_fs) < 0)
- return -ENOENT;
- else if (st_fs.f_type != (long) DEBUGFS_MAGIC)
- return -ENOENT;
-
- return 0;
-}
-
-/* find the path to the mounted debugfs */
-static const char *debugfs_find_mountpoint(void)
-{
- const char **ptr;
- char type[100];
- FILE *fp;
-
- ptr = debugfs_known_mountpoints;
- while (*ptr) {
- if (debugfs_valid_mountpoint(*ptr) == 0) {
- strcpy(hwpoison_debug_fs, *ptr);
- return hwpoison_debug_fs;
- }
- ptr++;
- }
-
- /* give up and parse /proc/mounts */
- fp = fopen("/proc/mounts", "r");
- if (fp == NULL)
- perror("Can't open /proc/mounts for read");
-
- while (fscanf(fp, "%*s %"
- STR(MAX_PATH)
- "s %99s %*s %*d %*d\n",
- hwpoison_debug_fs, type) == 2) {
- if (strcmp(type, "debugfs") == 0)
- break;
- }
- fclose(fp);
-
- if (strcmp(type, "debugfs") != 0)
- return NULL;
-
- return hwpoison_debug_fs;
-}
-
-/* mount the debugfs somewhere if it's not mounted */
-
-static void debugfs_mount(void)
-{
- const char **ptr;
-
- /* see if it's already mounted */
- if (debugfs_find_mountpoint())
- return;
-
- ptr = debugfs_known_mountpoints;
- while (*ptr) {
- if (mount(NULL, *ptr, "debugfs", 0, NULL) == 0) {
- /* save the mountpoint */
- strcpy(hwpoison_debug_fs, *ptr);
- break;
- }
- ptr++;
- }
-
- if (*ptr == NULL) {
- perror("mount debugfs");
- exit(EXIT_FAILURE);
- }
-}
-
-/*
- * page actions
- */
-
-static void prepare_hwpoison_fd(void)
-{
- char buf[MAX_PATH + 1];
-
- debugfs_mount();
-
- if (opt_hwpoison && !hwpoison_inject_fd) {
- snprintf(buf, MAX_PATH, "%s/hwpoison/corrupt-pfn",
- hwpoison_debug_fs);
- hwpoison_inject_fd = checked_open(buf, O_WRONLY);
- }
-
- if (opt_unpoison && !hwpoison_forget_fd) {
- snprintf(buf, MAX_PATH, "%s/hwpoison/unpoison-pfn",
- hwpoison_debug_fs);
- hwpoison_forget_fd = checked_open(buf, O_WRONLY);
- }
-}
-
-static int hwpoison_page(unsigned long offset)
-{
- char buf[100];
- int len;
-
- len = sprintf(buf, "0x%lx\n", offset);
- len = write(hwpoison_inject_fd, buf, len);
- if (len < 0) {
- perror("hwpoison inject");
- return len;
- }
- return 0;
-}
-
-static int unpoison_page(unsigned long offset)
-{
- char buf[100];
- int len;
-
- len = sprintf(buf, "0x%lx\n", offset);
- len = write(hwpoison_forget_fd, buf, len);
- if (len < 0) {
- perror("hwpoison forget");
- return len;
- }
- return 0;
-}
-
-/*
- * page frame walker
- */
-
-static int hash_slot(uint64_t flags)
-{
- int k = HASH_KEY(flags);
- int i;
-
- /* Explicitly reserve slot 0 for flags 0: the following logic
- * cannot distinguish an unoccupied slot from slot (flags==0).
- */
- if (flags == 0)
- return 0;
-
- /* search through the remaining (HASH_SIZE-1) slots */
- for (i = 1; i < ARRAY_SIZE(page_flags); i++, k++) {
- if (!k || k >= ARRAY_SIZE(page_flags))
- k = 1;
- if (page_flags[k] == 0) {
- page_flags[k] = flags;
- return k;
- }
- if (page_flags[k] == flags)
- return k;
- }
-
- fatal("hash table full: bump up HASH_SHIFT?\n");
- exit(EXIT_FAILURE);
-}
-
-static void add_page(unsigned long voffset,
- unsigned long offset, uint64_t flags)
-{
- flags = kpageflags_flags(flags);
-
- if (!bit_mask_ok(flags))
- return;
-
- if (opt_hwpoison)
- hwpoison_page(offset);
- if (opt_unpoison)
- unpoison_page(offset);
-
- if (opt_list == 1)
- show_page_range(voffset, offset, flags);
- else if (opt_list == 2)
- show_page(voffset, offset, flags);
-
- nr_pages[hash_slot(flags)]++;
- total_pages++;
-}
-
-#define KPAGEFLAGS_BATCH (64 << 10) /* 64k pages */
-static void walk_pfn(unsigned long voffset,
- unsigned long index,
- unsigned long count)
-{
- uint64_t buf[KPAGEFLAGS_BATCH];
- unsigned long batch;
- long pages;
- unsigned long i;
-
- while (count) {
- batch = min_t(unsigned long, count, KPAGEFLAGS_BATCH);
- pages = kpageflags_read(buf, index, batch);
- if (pages == 0)
- break;
-
- for (i = 0; i < pages; i++)
- add_page(voffset + i, index + i, buf[i]);
-
- index += pages;
- count -= pages;
- }
-}
-
-#define PAGEMAP_BATCH (64 << 10)
-static void walk_vma(unsigned long index, unsigned long count)
-{
- uint64_t buf[PAGEMAP_BATCH];
- unsigned long batch;
- unsigned long pages;
- unsigned long pfn;
- unsigned long i;
-
- while (count) {
- batch = min_t(unsigned long, count, PAGEMAP_BATCH);
- pages = pagemap_read(buf, index, batch);
- if (pages == 0)
- break;
-
- for (i = 0; i < pages; i++) {
- pfn = pagemap_pfn(buf[i]);
- if (pfn)
- walk_pfn(index + i, pfn, 1);
- }
-
- index += pages;
- count -= pages;
- }
-}
-
-static void walk_task(unsigned long index, unsigned long count)
-{
- const unsigned long end = index + count;
- unsigned long start;
- int i = 0;
-
- while (index < end) {
-
- while (pg_end[i] <= index)
- if (++i >= nr_vmas)
- return;
- if (pg_start[i] >= end)
- return;
-
- start = max_t(unsigned long, pg_start[i], index);
- index = min_t(unsigned long, pg_end[i], end);
-
- assert(start < index);
- walk_vma(start, index - start);
- }
-}
-
-static void add_addr_range(unsigned long offset, unsigned long size)
-{
- if (nr_addr_ranges >= MAX_ADDR_RANGES)
- fatal("too many addr ranges\n");
-
- opt_offset[nr_addr_ranges] = offset;
- opt_size[nr_addr_ranges] = min_t(unsigned long, size, ULONG_MAX-offset);
- nr_addr_ranges++;
-}
-
-static void walk_addr_ranges(void)
-{
- int i;
-
- kpageflags_fd = checked_open(PROC_KPAGEFLAGS, O_RDONLY);
-
- if (!nr_addr_ranges)
- add_addr_range(0, ULONG_MAX);
-
- for (i = 0; i < nr_addr_ranges; i++)
- if (!opt_pid)
- walk_pfn(0, opt_offset[i], opt_size[i]);
- else
- walk_task(opt_offset[i], opt_size[i]);
-
- close(kpageflags_fd);
-}
-
-
-/*
- * user interface
- */
-
-static const char *page_flag_type(uint64_t flag)
-{
- if (flag & KPF_HACKERS_BITS)
- return "(r)";
- if (flag & KPF_OVERLOADED_BITS)
- return "(o)";
- return " ";
-}
-
-static void usage(void)
-{
- int i, j;
-
- printf(
-"page-types [options]\n"
-" -r|--raw Raw mode, for kernel developers\n"
-" -d|--describe flags Describe flags\n"
-" -a|--addr addr-spec Walk a range of pages\n"
-" -b|--bits bits-spec Walk pages with specified bits\n"
-" -p|--pid pid Walk process address space\n"
-#if 0 /* planned features */
-" -f|--file filename Walk file address space\n"
-#endif
-" -l|--list Show page details in ranges\n"
-" -L|--list-each Show page details one by one\n"
-" -N|--no-summary Don't show summary info\n"
-" -X|--hwpoison hwpoison pages\n"
-" -x|--unpoison unpoison pages\n"
-" -h|--help Show this usage message\n"
-"flags:\n"
-" 0x10 bitfield format, e.g.\n"
-" anon bit-name, e.g.\n"
-" 0x10,anon comma-separated list, e.g.\n"
-"addr-spec:\n"
-" N one page at offset N (unit: pages)\n"
-" N+M pages range from N to N+M-1\n"
-" N,M pages range from N to M-1\n"
-" N, pages range from N to end\n"
-" ,M pages range from 0 to M-1\n"
-"bits-spec:\n"
-" bit1,bit2 (flags & (bit1|bit2)) != 0\n"
-" bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1\n"
-" bit1,~bit2 (flags & (bit1|bit2)) == bit1\n"
-" =bit1,bit2 flags == (bit1|bit2)\n"
-"bit-names:\n"
- );
-
- for (i = 0, j = 0; i < ARRAY_SIZE(page_flag_names); i++) {
- if (!page_flag_names[i])
- continue;
- printf("%16s%s", page_flag_names[i] + 2,
- page_flag_type(1ULL << i));
- if (++j > 3) {
- j = 0;
- putchar('\n');
- }
- }
- printf("\n "
- "(r) raw mode bits (o) overloaded bits\n");
-}
-
-static unsigned long long parse_number(const char *str)
-{
- unsigned long long n;
-
- n = strtoll(str, NULL, 0);
-
- if (n == 0 && str[0] != '0')
- fatal("invalid name or number: %s\n", str);
-
- return n;
-}
-
-static void parse_pid(const char *str)
-{
- FILE *file;
- char buf[5000];
-
- opt_pid = parse_number(str);
-
- sprintf(buf, "/proc/%d/pagemap", opt_pid);
- pagemap_fd = checked_open(buf, O_RDONLY);
-
- sprintf(buf, "/proc/%d/maps", opt_pid);
- file = fopen(buf, "r");
- if (!file) {
- perror(buf);
- exit(EXIT_FAILURE);
- }
-
- while (fgets(buf, sizeof(buf), file) != NULL) {
- unsigned long vm_start;
- unsigned long vm_end;
- unsigned long long pgoff;
- int major, minor;
- char r, w, x, s;
- unsigned long ino;
- int n;
-
- n = sscanf(buf, "%lx-%lx %c%c%c%c %llx %x:%x %lu",
- &vm_start,
- &vm_end,
- &r, &w, &x, &s,
- &pgoff,
- &major, &minor,
- &ino);
- if (n < 10) {
- fprintf(stderr, "unexpected line: %s\n", buf);
- continue;
- }
- pg_start[nr_vmas] = vm_start / page_size;
- pg_end[nr_vmas] = vm_end / page_size;
- if (++nr_vmas >= MAX_VMAS) {
- fprintf(stderr, "too many VMAs\n");
- break;
- }
- }
- fclose(file);
-}
-
-static void parse_file(const char *name)
-{
-}
-
-static void parse_addr_range(const char *optarg)
-{
- unsigned long offset;
- unsigned long size;
- char *p;
-
- p = strchr(optarg, ',');
- if (!p)
- p = strchr(optarg, '+');
-
- if (p == optarg) {
- offset = 0;
- size = parse_number(p + 1);
- } else if (p) {
- offset = parse_number(optarg);
- if (p[1] == '\0')
- size = ULONG_MAX;
- else {
- size = parse_number(p + 1);
- if (*p == ',') {
- if (size < offset)
- fatal("invalid range: %lu,%lu\n",
- offset, size);
- size -= offset;
- }
- }
- } else {
- offset = parse_number(optarg);
- size = 1;
- }
-
- add_addr_range(offset, size);
-}
-
-static void add_bits_filter(uint64_t mask, uint64_t bits)
-{
- if (nr_bit_filters >= MAX_BIT_FILTERS)
- fatal("too much bit filters\n");
-
- opt_mask[nr_bit_filters] = mask;
- opt_bits[nr_bit_filters] = bits;
- nr_bit_filters++;
-}
-
-static uint64_t parse_flag_name(const char *str, int len)
-{
- int i;
-
- if (!*str || !len)
- return 0;
-
- if (len <= 8 && !strncmp(str, "compound", len))
- return BITS_COMPOUND;
-
- for (i = 0; i < ARRAY_SIZE(page_flag_names); i++) {
- if (!page_flag_names[i])
- continue;
- if (!strncmp(str, page_flag_names[i] + 2, len))
- return 1ULL << i;
- }
-
- return parse_number(str);
-}
-
-static uint64_t parse_flag_names(const char *str, int all)
-{
- const char *p = str;
- uint64_t flags = 0;
-
- while (1) {
- if (*p == ',' || *p == '=' || *p == '\0') {
- if ((*str != '~') || (*str == '~' && all && *++str))
- flags |= parse_flag_name(str, p - str);
- if (*p != ',')
- break;
- str = p + 1;
- }
- p++;
- }
-
- return flags;
-}
-
-static void parse_bits_mask(const char *optarg)
-{
- uint64_t mask;
- uint64_t bits;
- const char *p;
-
- p = strchr(optarg, '=');
- if (p == optarg) {
- mask = KPF_ALL_BITS;
- bits = parse_flag_names(p + 1, 0);
- } else if (p) {
- mask = parse_flag_names(optarg, 0);
- bits = parse_flag_names(p + 1, 0);
- } else if (strchr(optarg, '~')) {
- mask = parse_flag_names(optarg, 1);
- bits = parse_flag_names(optarg, 0);
- } else {
- mask = parse_flag_names(optarg, 0);
- bits = KPF_ALL_BITS;
- }
-
- add_bits_filter(mask, bits);
-}
-
-static void describe_flags(const char *optarg)
-{
- uint64_t flags = parse_flag_names(optarg, 0);
-
- printf("0x%016llx\t%s\t%s\n",
- (unsigned long long)flags,
- page_flag_name(flags),
- page_flag_longname(flags));
-}
-
-static const struct option opts[] = {
- { "raw" , 0, NULL, 'r' },
- { "pid" , 1, NULL, 'p' },
- { "file" , 1, NULL, 'f' },
- { "addr" , 1, NULL, 'a' },
- { "bits" , 1, NULL, 'b' },
- { "describe" , 1, NULL, 'd' },
- { "list" , 0, NULL, 'l' },
- { "list-each" , 0, NULL, 'L' },
- { "no-summary", 0, NULL, 'N' },
- { "hwpoison" , 0, NULL, 'X' },
- { "unpoison" , 0, NULL, 'x' },
- { "help" , 0, NULL, 'h' },
- { NULL , 0, NULL, 0 }
-};
-
-int main(int argc, char *argv[])
-{
- int c;
-
- page_size = getpagesize();
-
- while ((c = getopt_long(argc, argv,
- "rp:f:a:b:d:lLNXxh", opts, NULL)) != -1) {
- switch (c) {
- case 'r':
- opt_raw = 1;
- break;
- case 'p':
- parse_pid(optarg);
- break;
- case 'f':
- parse_file(optarg);
- break;
- case 'a':
- parse_addr_range(optarg);
- break;
- case 'b':
- parse_bits_mask(optarg);
- break;
- case 'd':
- describe_flags(optarg);
- exit(0);
- case 'l':
- opt_list = 1;
- break;
- case 'L':
- opt_list = 2;
- break;
- case 'N':
- opt_no_summary = 1;
- break;
- case 'X':
- opt_hwpoison = 1;
- prepare_hwpoison_fd();
- break;
- case 'x':
- opt_unpoison = 1;
- prepare_hwpoison_fd();
- break;
- case 'h':
- usage();
- exit(0);
- default:
- usage();
- exit(1);
- }
- }
-
- if (opt_list && opt_pid)
- printf("voffset\t");
- if (opt_list == 1)
- printf("offset\tlen\tflags\n");
- if (opt_list == 2)
- printf("offset\tflags\n");
-
- walk_addr_ranges();
-
- if (opt_list == 1)
- show_page_range(0, 0, 0); /* drain the buffer */
-
- if (opt_no_summary)
- return 0;
-
- if (opt_list)
- printf("\n\n");
-
- show_summary();
-
- return 0;
-}
diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index df09b9650a8..5948e455c4d 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -15,8 +15,9 @@ There are three components to pagemap:
* Bits 0-54 page frame number (PFN) if present
* Bits 0-4 swap type if swapped
* Bits 5-54 swap offset if swapped
- * Bits 55-60 page shift (page size = 1<<page shift)
- * Bit 61 reserved for future use
+ * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
+ * Bits 56-60 zero
+ * Bit 61 page is file-page or shared-anon
* Bit 62 page swapped
* Bit 63 page present
@@ -60,6 +61,7 @@ There are three components to pagemap:
19. HWPOISON
20. NOPAGE
21. KSM
+ 22. THP
Short descriptions to the page flags:
@@ -97,6 +99,9 @@ Short descriptions to the page flags:
21. KSM
identical memory pages dynamically shared between one or more processes
+22. THP
+ contiguous pages which construct transparent hugepages
+
[IO related page flags]
1. ERROR IO error occurred
3. UPTODATE page has up-to-date data
@@ -143,5 +148,5 @@ once.
Other notes:
Reading from any of the files will return -EINVAL if you are not starting
-the read on an 8-byte boundary (e.g., if you seeked an odd number of bytes
+the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
into the file), or if the size of the read is not a multiple of 8 bytes.
diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.txt
new file mode 100644
index 00000000000..560e4363a55
--- /dev/null
+++ b/Documentation/vm/remap_file_pages.txt
@@ -0,0 +1,28 @@
+The remap_file_pages() system call is used to create a nonlinear mapping,
+that is, a mapping in which the pages of the file are mapped into a
+nonsequential order in memory. The advantage of using remap_file_pages()
+over using repeated calls to mmap(2) is that the former approach does not
+require the kernel to create additional VMA (Virtual Memory Area) data
+structures.
+
+Supporting of nonlinear mapping requires significant amount of non-trivial
+code in kernel virtual memory subsystem including hot paths. Also to get
+nonlinear mapping work kernel need a way to distinguish normal page table
+entries from entries with file offset (pte_file). Kernel reserves flag in
+PTE for this purpose. PTE flags are scarce resource especially on some CPU
+architectures. It would be nice to free up the flag for other usage.
+
+Fortunately, there are not many users of remap_file_pages() in the wild.
+It's only known that one enterprise RDBMS implementation uses the syscall
+on 32-bit systems to map files bigger than can linearly fit into 32-bit
+virtual address space. This use-case is not critical anymore since 64-bit
+systems are widely available.
+
+The plan is to deprecate the syscall and replace it with an emulation.
+The emulation will create new VMAs instead of nonlinear mappings. It's
+going to work slower for rare users of remap_file_pages() but ABI is
+preserved.
+
+One side effect of emulation (apart from performance) is that user can hit
+vm.max_map_count limit more easily due to additional VMAs. See comment for
+DEFAULT_MAX_MAP_COUNT for more details on the limit.
diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt
index 6752870c497..b0c6d1bbb43 100644
--- a/Documentation/vm/slub.txt
+++ b/Documentation/vm/slub.txt
@@ -17,7 +17,7 @@ data and perform operation on the slabs. By default slabinfo only lists
slabs that have data in them. See "slabinfo -h" for more options when
running the command. slabinfo can be compiled with
-gcc -o slabinfo tools/slub/slabinfo.c
+gcc -o slabinfo tools/vm/slabinfo.c
Some of the modes of operation of slabinfo require that slub debugging
be enabled on the command line. F.e. no tracking information will be
diff --git a/Documentation/vm/soft-dirty.txt b/Documentation/vm/soft-dirty.txt
new file mode 100644
index 00000000000..55684d11a1e
--- /dev/null
+++ b/Documentation/vm/soft-dirty.txt
@@ -0,0 +1,43 @@
+ SOFT-DIRTY PTEs
+
+ The soft-dirty is a bit on a PTE which helps to track which pages a task
+writes to. In order to do this tracking one should
+
+ 1. Clear soft-dirty bits from the task's PTEs.
+
+ This is done by writing "4" into the /proc/PID/clear_refs file of the
+ task in question.
+
+ 2. Wait some time.
+
+ 3. Read soft-dirty bits from the PTEs.
+
+ This is done by reading from the /proc/PID/pagemap. The bit 55 of the
+ 64-bit qword is the soft-dirty one. If set, the respective PTE was
+ written to since step 1.
+
+
+ Internally, to do this tracking, the writable bit is cleared from PTEs
+when the soft-dirty bit is cleared. So, after this, when the task tries to
+modify a page at some virtual address the #PF occurs and the kernel sets
+the soft-dirty bit on the respective PTE.
+
+ Note, that although all the task's address space is marked as r/o after the
+soft-dirty bits clear, the #PF-s that occur after that are processed fast.
+This is so, since the pages are still mapped to physical memory, and thus all
+the kernel does is finds this fact out and puts both writable and soft-dirty
+bits on the PTE.
+
+ While in most cases tracking memory changes by #PF-s is more than enough
+there is still a scenario when we can lose soft dirty bits -- a task
+unmaps a previously mapped memory region and then maps a new one at exactly
+the same place. When unmap is called, the kernel internally clears PTE values
+including soft dirty bits. To notify user space application about such
+memory region renewal the kernel always marks new memory regions (and
+expanded regions) as soft dirty.
+
+ This feature is actively used by the checkpoint-restore project. You
+can find more details about it on http://criu.org
+
+
+-- Pavel Emelyanov, Apr 9, 2013
diff --git a/Documentation/vm/split_page_table_lock b/Documentation/vm/split_page_table_lock
new file mode 100644
index 00000000000..6dea4fd5c96
--- /dev/null
+++ b/Documentation/vm/split_page_table_lock
@@ -0,0 +1,94 @@
+Split page table lock
+=====================
+
+Originally, mm->page_table_lock spinlock protected all page tables of the
+mm_struct. But this approach leads to poor page fault scalability of
+multi-threaded applications due high contention on the lock. To improve
+scalability, split page table lock was introduced.
+
+With split page table lock we have separate per-table lock to serialize
+access to the table. At the moment we use split lock for PTE and PMD
+tables. Access to higher level tables protected by mm->page_table_lock.
+
+There are helpers to lock/unlock a table and other accessor functions:
+ - pte_offset_map_lock()
+ maps pte and takes PTE table lock, returns pointer to the taken
+ lock;
+ - pte_unmap_unlock()
+ unlocks and unmaps PTE table;
+ - pte_alloc_map_lock()
+ allocates PTE table if needed and take the lock, returns pointer
+ to taken lock or NULL if allocation failed;
+ - pte_lockptr()
+ returns pointer to PTE table lock;
+ - pmd_lock()
+ takes PMD table lock, returns pointer to taken lock;
+ - pmd_lockptr()
+ returns pointer to PMD table lock;
+
+Split page table lock for PTE tables is enabled compile-time if
+CONFIG_SPLIT_PTLOCK_CPUS (usually 4) is less or equal to NR_CPUS.
+If split lock is disabled, all tables guaded by mm->page_table_lock.
+
+Split page table lock for PMD tables is enabled, if it's enabled for PTE
+tables and the architecture supports it (see below).
+
+Hugetlb and split page table lock
+---------------------------------
+
+Hugetlb can support several page sizes. We use split lock only for PMD
+level, but not for PUD.
+
+Hugetlb-specific helpers:
+ - huge_pte_lock()
+ takes pmd split lock for PMD_SIZE page, mm->page_table_lock
+ otherwise;
+ - huge_pte_lockptr()
+ returns pointer to table lock;
+
+Support of split page table lock by an architecture
+---------------------------------------------------
+
+There's no need in special enabling of PTE split page table lock:
+everything required is done by pgtable_page_ctor() and pgtable_page_dtor(),
+which must be called on PTE table allocation / freeing.
+
+Make sure the architecture doesn't use slab allocator for page table
+allocation: slab uses page->slab_cache and page->first_page for its pages.
+These fields share storage with page->ptl.
+
+PMD split lock only makes sense if you have more than two page table
+levels.
+
+PMD split lock enabling requires pgtable_pmd_page_ctor() call on PMD table
+allocation and pgtable_pmd_page_dtor() on freeing.
+
+Allocation usually happens in pmd_alloc_one(), freeing in pmd_free() and
+pmd_free_tlb(), but make sure you cover all PMD table allocation / freeing
+paths: i.e X86_PAE preallocate few PMDs on pgd_alloc().
+
+With everything in place you can set CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK.
+
+NOTE: pgtable_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must
+be handled properly.
+
+page->ptl
+---------
+
+page->ptl is used to access split page table lock, where 'page' is struct
+page of page containing the table. It shares storage with page->private
+(and few other fields in union).
+
+To avoid increasing size of struct page and have best performance, we use a
+trick:
+ - if spinlock_t fits into long, we use page->ptr as spinlock, so we
+ can avoid indirect access and save a cache line.
+ - if size of spinlock_t is bigger then size of long, we use page->ptl as
+ pointer to spinlock_t and allocate it dynamically. This allows to use
+ split lock with enabled DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC, but costs
+ one more cache line for indirect access;
+
+The spinlock_t allocated in pgtable_page_ctor() for PTE table and in
+pgtable_pmd_page_ctor() for PMD table.
+
+Please, never access page->ptl directly -- use appropriate helper.
diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 29bdf62aac0..6b31cfbe2a9 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -116,6 +116,13 @@ echo always >/sys/kernel/mm/transparent_hugepage/defrag
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
echo never >/sys/kernel/mm/transparent_hugepage/defrag
+By default kernel tries to use huge zero page on read page fault.
+It's possible to disable huge zero page by writing 0 or enable it
+back by writing 1:
+
+echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
+echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
+
khugepaged will be automatically started when
transparent_hugepage/enabled is set to "always" or "madvise, and it'll
be automatically shutdown if it's set to "never".
@@ -166,6 +173,76 @@ behavior. So to make them effective you need to restart any
application that could have been using hugepages. This also applies to
the regions registered in khugepaged.
+== Monitoring usage ==
+
+The number of transparent huge pages currently used by the system is
+available by reading the AnonHugePages field in /proc/meminfo. To
+identify what applications are using transparent huge pages, it is
+necessary to read /proc/PID/smaps and count the AnonHugePages fields
+for each mapping. Note that reading the smaps file is expensive and
+reading it frequently will incur overhead.
+
+There are a number of counters in /proc/vmstat that may be used to
+monitor how successfully the system is providing huge pages for use.
+
+thp_fault_alloc is incremented every time a huge page is successfully
+ allocated to handle a page fault. This applies to both the
+ first time a page is faulted and for COW faults.
+
+thp_collapse_alloc is incremented by khugepaged when it has found
+ a range of pages to collapse into one huge page and has
+ successfully allocated a new huge page to store the data.
+
+thp_fault_fallback is incremented if a page fault fails to allocate
+ a huge page and instead falls back to using small pages.
+
+thp_collapse_alloc_failed is incremented if khugepaged found a range
+ of pages that should be collapsed into one huge page but failed
+ the allocation.
+
+thp_split is incremented every time a huge page is split into base
+ pages. This can happen for a variety of reasons but a common
+ reason is that a huge page is old and is being reclaimed.
+
+thp_zero_page_alloc is incremented every time a huge zero page is
+ successfully allocated. It includes allocations which where
+ dropped due race with other allocation. Note, it doesn't count
+ every map of the huge zero page, only its allocation.
+
+thp_zero_page_alloc_failed is incremented if kernel fails to allocate
+ huge zero page and falls back to using small pages.
+
+As the system ages, allocating huge pages may be expensive as the
+system uses memory compaction to copy data around memory to free a
+huge page for use. There are some counters in /proc/vmstat to help
+monitor this overhead.
+
+compact_stall is incremented every time a process stalls to run
+ memory compaction so that a huge page is free for use.
+
+compact_success is incremented if the system compacted memory and
+ freed a huge page for use.
+
+compact_fail is incremented if the system tries to compact memory
+ but failed.
+
+compact_pages_moved is incremented each time a page is moved. If
+ this value is increasing rapidly, it implies that the system
+ is copying a lot of data to satisfy the huge page allocation.
+ It is possible that the cost of copying exceeds any savings
+ from reduced TLB misses.
+
+compact_pagemigrate_failed is incremented when the underlying mechanism
+ for moving a page failed.
+
+compact_blocks_moved is incremented each time memory compaction examines
+ a huge page aligned range of pages.
+
+It is possible to establish how long the stalls were using the function
+tracer to record how long was spent in __alloc_pages_nodemask and
+using the mm_page_alloc tracepoint to identify which allocations were
+for huge pages.
+
== get_user_pages and follow_page ==
get_user_pages and follow_page if run on a hugepage, will return the
@@ -214,7 +291,7 @@ unaffected. libhugetlbfs will also work fine as usual.
== Graceful fallback ==
Code walking pagetables but unware about huge pmds can simply call
-split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
+split_huge_page_pmd(vma, addr, pmd) where the pmd is the one returned by
pmd_offset. It's trivial to make the code transparent hugepage aware
by just grepping for "pmd_offset" and adding split_huge_page_pmd where
missing after pmd_offset returns the pmd. Thanks to the graceful
@@ -237,7 +314,7 @@ diff --git a/mm/mremap.c b/mm/mremap.c
return NULL;
pmd = pmd_offset(pud, addr);
-+ split_huge_page_pmd(mm, pmd);
++ split_huge_page_pmd(vma, addr, pmd);
if (pmd_none_or_clear_bad(pmd))
return NULL;
@@ -283,13 +360,13 @@ on any tail page, would mean having to split all hugepages upfront in
get_user_pages which is unacceptable as too many gup users are
performance critical and they must work natively on hugepages like
they work natively on hugetlbfs already (hugetlbfs is simpler because
-hugetlbfs pages cannot be splitted so there wouldn't be requirement of
+hugetlbfs pages cannot be split so there wouldn't be requirement of
accounting the pins on the tail pages for hugetlbfs). If we wouldn't
account the gup refcounts on the tail pages during gup, we won't know
anymore which tail page is pinned by gup and which is not while we run
split_huge_page. But we still have to add the gup pin to the head page
too, to know when we can free the compound page in case it's never
-splitted during its lifetime. That requires changing not just
+split during its lifetime. That requires changing not just
get_page, but put_page as well so that when put_page runs on a tail
page (and only on a tail page) it will find its respective head page,
and then it will decrease the head page refcount in addition to the
diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt
index 97bae3c576c..744f82f86c5 100644
--- a/Documentation/vm/unevictable-lru.txt
+++ b/Documentation/vm/unevictable-lru.txt
@@ -197,12 +197,8 @@ the pages are also "rescued" from the unevictable list in the process of
freeing them.
page_evictable() also checks for mlocked pages by testing an additional page
-flag, PG_mlocked (as wrapped by PageMlocked()). If the page is NOT mlocked,
-and a non-NULL VMA is supplied, page_evictable() will check whether the VMA is
-VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and
-update the appropriate statistics if the vma is VM_LOCKED. This method allows
-efficient "culling" of pages in the fault path that are being faulted in to
-VM_LOCKED VMAs.
+flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is
+faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED.
VMSCAN'S HANDLING OF UNEVICTABLE PAGES
@@ -371,8 +367,8 @@ mlock_fixup() filters several classes of "special" VMAs:
mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to
allocate the huge pages and populate the ptes.
-3) VMAs with VM_DONTEXPAND or VM_RESERVED are generally userspace mappings of
- kernel pages, such as the VDSO page, relay channel pages, etc. These pages
+3) VMAs with VM_DONTEXPAND are generally userspace mappings of kernel pages,
+ such as the VDSO page, relay channel pages, etc. These pages
are inherently unevictable and are not managed on the LRU lists.
mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls
make_pages_present() to populate the ptes.
@@ -457,7 +453,7 @@ putback_lru_page() function to add migrated pages back to the LRU.
mmap(MAP_LOCKED) SYSTEM CALL HANDLING
-------------------------------------
-In addition the the mlock()/mlockall() system calls, an application can request
+In addition the mlock()/mlockall() system calls, an application can request
that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap()
call. Furthermore, any mmap() call or brk() call that expands the heap by a
task that has previously called mlockall() with the MCL_FUTURE flag will result
@@ -538,7 +534,7 @@ different reverse map mechanisms.
process because mlocked pages are migratable. However, for reclaim, if
the page is mapped into a VM_LOCKED VMA, the scan stops.
- try_to_unmap_anon() attempts to acquire in read mode the mmap semphore of
+ try_to_unmap_anon() attempts to acquire in read mode the mmap semaphore of
the mm_struct to which the VMA belongs. If this is successful, it will
mlock the page via mlock_vma_page() - we wouldn't have gotten to
try_to_unmap_anon() if the page were already mlocked - and will return
@@ -619,11 +615,11 @@ all PTEs from the page. For this purpose, the unevictable/mlock infrastructure
introduced a variant of try_to_unmap() called try_to_munlock().
try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
-mapped file pages with an additional argument specifing unlock versus unmap
+mapped file pages with an additional argument specifying unlock versus unmap
processing. Again, these functions walk the respective reverse maps looking
for VM_LOCKED VMAs. When such a VMA is found for anonymous pages and file
pages mapped in linear VMAs, as in the try_to_unmap() case, the functions
-attempt to acquire the associated mmap semphore, mlock the page via
+attempt to acquire the associated mmap semaphore, mlock the page via
mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the
pre-clearing of the page's PG_mlocked done by munlock_vma_page.
@@ -641,7 +637,7 @@ with it - the usual fallback position.
Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
However, the scan can terminate when it encounters a VM_LOCKED VMA and can
-successfully acquire the VMA's mmap semphore for read and mlock the page.
+successfully acquire the VMA's mmap semaphore for read and mlock the page.
Although try_to_munlock() might be called a great many times when munlocking a
large region or tearing down a large address space that has been mlocked via
mlockall(), overall this is a fairly rare event.
@@ -651,7 +647,7 @@ PAGE RECLAIM IN shrink_*_list()
-------------------------------
shrink_active_list() culls any obviously unevictable pages - i.e.
-!page_evictable(page, NULL) - diverting these to the unevictable list.
+!page_evictable(page) - diverting these to the unevictable list.
However, shrink_active_list() only sees unevictable pages that made it onto the
active/inactive lru lists. Note that these pages do not have PageUnevictable
set - otherwise they would be on the unevictable list and shrink_active_list
diff --git a/Documentation/vm/zswap.txt b/Documentation/vm/zswap.txt
new file mode 100644
index 00000000000..00c3d31e797
--- /dev/null
+++ b/Documentation/vm/zswap.txt
@@ -0,0 +1,68 @@
+Overview:
+
+Zswap is a lightweight compressed cache for swap pages. It takes pages that are
+in the process of being swapped out and attempts to compress them into a
+dynamically allocated RAM-based memory pool. zswap basically trades CPU cycles
+for potentially reduced swap I/O.  This trade-off can also result in a
+significant performance improvement if reads from the compressed cache are
+faster than reads from a swap device.
+
+NOTE: Zswap is a new feature as of v3.11 and interacts heavily with memory
+reclaim. This interaction has not been fully explored on the large set of
+potential configurations and workloads that exist. For this reason, zswap
+is a work in progress and should be considered experimental.
+
+Some potential benefits:
+* Desktop/laptop users with limited RAM capacities can mitigate the
+    performance impact of swapping.
+* Overcommitted guests that share a common I/O resource can
+    dramatically reduce their swap I/O pressure, avoiding heavy handed I/O
+ throttling by the hypervisor. This allows more work to get done with less
+ impact to the guest workload and guests sharing the I/O subsystem
+* Users with SSDs as swap devices can extend the life of the device by
+    drastically reducing life-shortening writes.
+
+Zswap evicts pages from compressed cache on an LRU basis to the backing swap
+device when the compressed pool reaches its size limit. This requirement had
+been identified in prior community discussions.
+
+To enabled zswap, the "enabled" attribute must be set to 1 at boot time. e.g.
+zswap.enabled=1
+
+Design:
+
+Zswap receives pages for compression through the Frontswap API and is able to
+evict pages from its own compressed pool on an LRU basis and write them back to
+the backing swap device in the case that the compressed pool is full.
+
+Zswap makes use of zbud for the managing the compressed memory pool. Each
+allocation in zbud is not directly accessible by address. Rather, a handle is
+returned by the allocation routine and that handle must be mapped before being
+accessed. The compressed memory pool grows on demand and shrinks as compressed
+pages are freed. The pool is not preallocated.
+
+When a swap page is passed from frontswap to zswap, zswap maintains a mapping
+of the swap entry, a combination of the swap type and swap offset, to the zbud
+handle that references that compressed swap page. This mapping is achieved
+with a red-black tree per swap type. The swap offset is the search key for the
+tree nodes.
+
+During a page fault on a PTE that is a swap entry, frontswap calls the zswap
+load function to decompress the page into the page allocated by the page fault
+handler.
+
+Once there are no PTEs referencing a swap page stored in zswap (i.e. the count
+in the swap_map goes to 0) the swap code calls the zswap invalidate function,
+via frontswap, to free the compressed entry.
+
+Zswap seeks to be simple in its policies. Sysfs attributes allow for one user
+controlled policy:
+* max_pool_percent - The maximum percentage of memory that the compressed
+ pool can occupy.
+
+Zswap allows the compressor to be selected at kernel boot time by setting the
+“compressor” attribute. The default compressor is lzo. e.g.
+zswap.compressor=deflate
+
+A debugfs interface is provided for various statistic about pool size, number
+of pages stored, and various counters for the reasons pages are rejected.