diff options
Diffstat (limited to 'Documentation/vm')
| -rw-r--r-- | Documentation/vm/00-INDEX | 4 | ||||
| -rw-r--r-- | Documentation/vm/hwpoison.txt | 5 | ||||
| -rw-r--r-- | Documentation/vm/locking | 130 | ||||
| -rw-r--r-- | Documentation/vm/numa_memory_policy.txt | 5 | ||||
| -rw-r--r-- | Documentation/vm/overcommit-accounting | 7 | ||||
| -rw-r--r-- | Documentation/vm/remap_file_pages.txt | 28 | ||||
| -rw-r--r-- | Documentation/vm/transhuge.txt | 4 | ||||
| -rw-r--r-- | Documentation/vm/unevictable-lru.txt | 2 |
8 files changed, 44 insertions, 141 deletions
diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX index a39d06680e1..081c49777ab 100644 --- a/Documentation/vm/00-INDEX +++ b/Documentation/vm/00-INDEX @@ -16,8 +16,6 @@ hwpoison.txt - explains what hwpoison is ksm.txt - how to use the Kernel Samepage Merging feature. -locking - - info on how locking and synchronization is done in the Linux vm code. numa - information about NUMA specific code in the Linux vm. numa_memory_policy.txt @@ -32,6 +30,8 @@ slub.txt - a short users guide for SLUB. soft-dirty.txt - short explanation for soft-dirty PTEs +split_page_table_lock + - Separate per-table lock to improve scalability of the old page_table_lock. transhuge.txt - Transparent Hugepage Support, alternative way of using hugepages. unevictable-lru.txt diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt index 55006846660..6ae89a9edf2 100644 --- a/Documentation/vm/hwpoison.txt +++ b/Documentation/vm/hwpoison.txt @@ -84,6 +84,11 @@ PR_MCE_KILL PR_MCE_KILL_EARLY: Early kill PR_MCE_KILL_LATE: Late kill PR_MCE_KILL_DEFAULT: Use system global default + Note that if you want to have a dedicated thread which handles + the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should + call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise, + the SIGBUS is sent to the main thread. + PR_MCE_KILL_GET return current mode diff --git a/Documentation/vm/locking b/Documentation/vm/locking deleted file mode 100644 index f61228bd639..00000000000 --- a/Documentation/vm/locking +++ /dev/null @@ -1,130 +0,0 @@ -Started Oct 1999 by Kanoj Sarcar <kanojsarcar@yahoo.com> - -The intent of this file is to have an uptodate, running commentary -from different people about how locking and synchronization is done -in the Linux vm code. - -page_table_lock & mmap_sem --------------------------------------- - -Page stealers pick processes out of the process pool and scan for -the best process to steal pages from. To guarantee the existence -of the victim mm, a mm_count inc and a mmdrop are done in swap_out(). -Page stealers hold kernel_lock to protect against a bunch of races. -The vma list of the victim mm is also scanned by the stealer, -and the page_table_lock is used to preserve list sanity against the -process adding/deleting to the list. This also guarantees existence -of the vma. Vma existence is not guaranteed once try_to_swap_out() -drops the page_table_lock. To guarantee the existence of the underlying -file structure, a get_file is done before the swapout() method is -invoked. The page passed into swapout() is guaranteed not to be reused -for a different purpose because the page reference count due to being -present in the user's pte is not released till after swapout() returns. - -Any code that modifies the vmlist, or the vm_start/vm_end/ -vm_flags:VM_LOCKED/vm_next of any vma *in the list* must prevent -kswapd from looking at the chain. - -The rules are: -1. To scan the vmlist (look but don't touch) you must hold the - mmap_sem with read bias, i.e. down_read(&mm->mmap_sem) -2. To modify the vmlist you need to hold the mmap_sem with - read&write bias, i.e. down_write(&mm->mmap_sem) *AND* - you need to take the page_table_lock. -3. The swapper takes _just_ the page_table_lock, this is done - because the mmap_sem can be an extremely long lived lock - and the swapper just cannot sleep on that. -4. The exception to this rule is expand_stack, which just - takes the read lock and the page_table_lock, this is ok - because it doesn't really modify fields anybody relies on. -5. You must be able to guarantee that while holding page_table_lock - or page_table_lock of mm A, you will not try to get either lock - for mm B. - -The caveats are: -1. find_vma() makes use of, and updates, the mmap_cache pointer hint. -The update of mmap_cache is racy (page stealer can race with other code -that invokes find_vma with mmap_sem held), but that is okay, since it -is a hint. This can be fixed, if desired, by having find_vma grab the -page_table_lock. - - -Code that add/delete elements from the vmlist chain are -1. callers of insert_vm_struct -2. callers of merge_segments -3. callers of avl_remove - -Code that changes vm_start/vm_end/vm_flags:VM_LOCKED of vma's on -the list: -1. expand_stack -2. mprotect -3. mlock -4. mremap - -It is advisable that changes to vm_start/vm_end be protected, although -in some cases it is not really needed. Eg, vm_start is modified by -expand_stack(), it is hard to come up with a destructive scenario without -having the vmlist protection in this case. - -The page_table_lock nests with the inode i_mmap_mutex and the kmem cache -c_spinlock spinlocks. This is okay, since the kmem code asks for pages after -dropping c_spinlock. The page_table_lock also nests with pagecache_lock and -pagemap_lru_lock spinlocks, and no code asks for memory with these locks -held. - -The page_table_lock is grabbed while holding the kernel_lock spinning monitor. - -The page_table_lock is a spin lock. - -Note: PTL can also be used to guarantee that no new clones using the -mm start up ... this is a loose form of stability on mm_users. For -example, it is used in copy_mm to protect against a racing tlb_gather_mmu -single address space optimization, so that the zap_page_range (from -truncate) does not lose sending ipi's to cloned threads that might -be spawned underneath it and go to user mode to drag in pte's into tlbs. - -swap_lock --------------- -The swap devices are chained in priority order from the "swap_list" header. -The "swap_list" is used for the round-robin swaphandle allocation strategy. -The #free swaphandles is maintained in "nr_swap_pages". These two together -are protected by the swap_lock. - -The swap_lock also protects all the device reference counts on the -corresponding swaphandles, maintained in the "swap_map" array, and the -"highest_bit" and "lowest_bit" fields. - -The swap_lock is a spinlock, and is never acquired from intr level. - -To prevent races between swap space deletion or async readahead swapins -deciding whether a swap handle is being used, ie worthy of being read in -from disk, and an unmap -> swap_free making the handle unused, the swap -delete and readahead code grabs a temp reference on the swaphandle to -prevent warning messages from swap_duplicate <- read_swap_cache_async. - -Swap cache locking ------------------- -Pages are added into the swap cache with kernel_lock held, to make sure -that multiple pages are not being added (and hence lost) by associating -all of them with the same swaphandle. - -Pages are guaranteed not to be removed from the scache if the page is -"shared": ie, other processes hold reference on the page or the associated -swap handle. The only code that does not follow this rule is shrink_mmap, -which deletes pages from the swap cache if no process has a reference on -the page (multiple processes might have references on the corresponding -swap handle though). lookup_swap_cache() races with shrink_mmap, when -establishing a reference on a scache page, so, it must check whether the -page it located is still in the swapcache, or shrink_mmap deleted it. -(This race is due to the fact that shrink_mmap looks at the page ref -count with pagecache_lock, but then drops pagecache_lock before deleting -the page from the scache). - -do_wp_page and do_swap_page have MP races in them while trying to figure -out whether a page is "shared", by looking at the page_count + swap_count. -To preserve the sum of the counts, the page lock _must_ be acquired before -calling is_page_shared (else processes might switch their swap_count refs -to the page count refs, after the page count ref has been snapshotted). - -Swap device deletion code currently breaks all the scache assumptions, -since it grabs neither mmap_sem nor page_table_lock. diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt index 4e7da654342..badb0507608 100644 --- a/Documentation/vm/numa_memory_policy.txt +++ b/Documentation/vm/numa_memory_policy.txt @@ -174,7 +174,6 @@ Components of Memory Policies allocation fails, the kernel will search other nodes, in order of increasing distance from the preferred node based on information provided by the platform firmware. - containing the cpu where the allocation takes place. Internally, the Preferred policy uses a single node--the preferred_node member of struct mempolicy. When the internal @@ -275,9 +274,9 @@ Components of Memory Policies For example, consider a task that is attached to a cpuset with mems 2-5 that sets an Interleave policy over the same set with MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the - interleave now occurs over nodes 3,5-6. If the cpuset's mems + interleave now occurs over nodes 3,5-7. If the cpuset's mems then change to 0,2-3,5, then the interleave occurs over nodes - 0,3,5. + 0,2-3,5. Thanks to the consistent remapping, applications preparing nodemasks to specify memory policies using this flag should diff --git a/Documentation/vm/overcommit-accounting b/Documentation/vm/overcommit-accounting index 8eaa2fc4b8f..cbfaaa67411 100644 --- a/Documentation/vm/overcommit-accounting +++ b/Documentation/vm/overcommit-accounting @@ -14,8 +14,8 @@ The Linux kernel supports the following overcommit handling modes 2 - Don't overcommit. The total address space commit for the system is not permitted to exceed swap + a - configurable percentage (default is 50) of physical RAM. - Depending on the percentage you use, in most situations + configurable amount (default is 50%) of physical RAM. + Depending on the amount you use, in most situations this means a process will not be killed while accessing pages but will receive errors on memory allocation as appropriate. @@ -26,7 +26,8 @@ The Linux kernel supports the following overcommit handling modes The overcommit policy is set via the sysctl `vm.overcommit_memory'. -The overcommit percentage is set via `vm.overcommit_ratio'. +The overcommit amount can be set via `vm.overcommit_ratio' (percentage) +or `vm.overcommit_kbytes' (absolute value). The current overcommit limit and amount committed are viewable in /proc/meminfo as CommitLimit and Committed_AS respectively. diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.txt new file mode 100644 index 00000000000..560e4363a55 --- /dev/null +++ b/Documentation/vm/remap_file_pages.txt @@ -0,0 +1,28 @@ +The remap_file_pages() system call is used to create a nonlinear mapping, +that is, a mapping in which the pages of the file are mapped into a +nonsequential order in memory. The advantage of using remap_file_pages() +over using repeated calls to mmap(2) is that the former approach does not +require the kernel to create additional VMA (Virtual Memory Area) data +structures. + +Supporting of nonlinear mapping requires significant amount of non-trivial +code in kernel virtual memory subsystem including hot paths. Also to get +nonlinear mapping work kernel need a way to distinguish normal page table +entries from entries with file offset (pte_file). Kernel reserves flag in +PTE for this purpose. PTE flags are scarce resource especially on some CPU +architectures. It would be nice to free up the flag for other usage. + +Fortunately, there are not many users of remap_file_pages() in the wild. +It's only known that one enterprise RDBMS implementation uses the syscall +on 32-bit systems to map files bigger than can linearly fit into 32-bit +virtual address space. This use-case is not critical anymore since 64-bit +systems are widely available. + +The plan is to deprecate the syscall and replace it with an emulation. +The emulation will create new VMAs instead of nonlinear mappings. It's +going to work slower for rare users of remap_file_pages() but ABI is +preserved. + +One side effect of emulation (apart from performance) is that user can hit +vm.max_map_count limit more easily due to additional VMAs. See comment for +DEFAULT_MAX_MAP_COUNT for more details on the limit. diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt index 4a63953a41f..6b31cfbe2a9 100644 --- a/Documentation/vm/transhuge.txt +++ b/Documentation/vm/transhuge.txt @@ -360,13 +360,13 @@ on any tail page, would mean having to split all hugepages upfront in get_user_pages which is unacceptable as too many gup users are performance critical and they must work natively on hugepages like they work natively on hugetlbfs already (hugetlbfs is simpler because -hugetlbfs pages cannot be splitted so there wouldn't be requirement of +hugetlbfs pages cannot be split so there wouldn't be requirement of accounting the pins on the tail pages for hugetlbfs). If we wouldn't account the gup refcounts on the tail pages during gup, we won't know anymore which tail page is pinned by gup and which is not while we run split_huge_page. But we still have to add the gup pin to the head page too, to know when we can free the compound page in case it's never -splitted during its lifetime. That requires changing not just +split during its lifetime. That requires changing not just get_page, but put_page as well so that when put_page runs on a tail page (and only on a tail page) it will find its respective head page, and then it will decrease the head page refcount in addition to the diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt index a68db7692ee..744f82f86c5 100644 --- a/Documentation/vm/unevictable-lru.txt +++ b/Documentation/vm/unevictable-lru.txt @@ -453,7 +453,7 @@ putback_lru_page() function to add migrated pages back to the LRU. mmap(MAP_LOCKED) SYSTEM CALL HANDLING ------------------------------------- -In addition the the mlock()/mlockall() system calls, an application can request +In addition the mlock()/mlockall() system calls, an application can request that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap() call. Furthermore, any mmap() call or brk() call that expands the heap by a task that has previously called mlockall() with the MCL_FUTURE flag will result |
