aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/mm
AgeCommit message (Collapse)Author
2011-05-02x86-32, NUMA: use sparse_memory_present_with_active_regions()Tejun Heo
Instead of calling memory_present() for each region from NUMA init, call sparse_memory_present_with_active_regions() from paging_init() similarly to x86-64. For flat and numaq, this results in exactly the same memory_present() calls. For srat, if there are multiple memory chunks for a node, after this change, memory_present() will be called separately for each chunk instead of being called once to encompass the whole range, which doesn't cause any harm and actually is the better behavior. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>
2011-05-02x86, NUMA: Unify 32/64bit numa_cpu_node() implementationTejun Heo
Currently, the only meaningful user of apic->x86_32_numa_cpu_node() is NUMAQ which returns valid mapping only after CPU is initialized during SMP bringup; thus, the previous patch to set apicid -> node in setup_local_APIC() makes __apicid_to_node[] always contain the correct mapping whether custom apic->x86_32_numa_cpu_node() is used or not. So, there is no reason to keep separate 32bit implementation. We can always consult __apicid_to_node[]. Move 64bit implementation from numa_64.c to numa.c and remove 32bit implementation from numa_32.c. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>
2011-05-02x86-64, NUMA: simplify nodedata allocationTejun Heo
With top-down memblock allocation, the allocation range limits in ealry_node_mem() can be simplified - try node-local first, then any node but in any case don't allocate below DMA limit. Remove early_node_mem() and implement simplified allocation directly in setup_node_bootmem(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>
2011-05-02x86-64, NUMA: trivial cleanups for setup_node_bootmem()Tejun Heo
Make the following trivial changes in preparation for further updates. * nodeid -> nid, nid -> tnid * use nd_ prefix for nodedata related variables * remove start/end_pfn and use start/end directly Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>
2011-05-02x86-64, NUMA: Simplify hotadd memory handlingTejun Heo
The only special handling NUMA needs to do for hotadd memory is determining the node for the hotadd memory given the address of it and there's nothing specific to specific config method used. srat_64.c does somewhat elaborate error checking on ACPI_SRAT_MEM_HOT_PLUGGABLE regions, remembers them and implements memory_add_physaddr_to_nid() which determines the node for given hotadd address. This is almost completely redundant. All the information is already available to the generic NUMA code which already performs all the sanity checking and merging. All that's necessary is not using __initdata from numa_meminfo and providing a function which uses it to map address to node. Drop the specific implementation from srat_64.c and add generic memory_add_physaddr_to_nid() in numa_64.c, which is enabled if CONFIG_MEMORY_HOTPLUG is set. Other than dropping the code, srat_64.c doesn't need any change as it already calls numa_add_memblk() for hot pluggable regions which is enough. While at it, change CONFIG_MEMORY_HOTPLUG_SPARSE in srat_64.c to CONFIG_MEMORY_HOTPLUG, for NUMA on x86-64, the two are always the same. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>
2011-05-02Merge branch 'x86/urgent' into x86-mmTejun Heo
Merge reason: Pick up the following two fix commits. 2be19102b7: x86, NUMA: Fix empty memblk detection in numa_cleanup_meminfo() 765af22da8: x86-32, NUMA: Fix ACPI NUMA init broken by recent x86-64 change Scheduled NUMA init 32/64bit unification changes depend on these. Signed-off-by: Tejun Heo <tj@kernel.org>
2011-05-02Merge branch 'x86/numa' into x86-mmTejun Heo
Merge reason: Pick up x86-32 remap allocator cleanup changes - 14 commits, 3fe14ab541^..993ba1585c. 3fe14ab541: x86-32, numa: Fix failure condition check in alloc_remap() 993ba1585c: x86-32, numa: Update remap allocator comments Scheduled NUMA init 32/64bit unification changes depend on them. Signed-off-by: Tejun Heo <tj@kernel.org>
2011-05-01x86, NUMA: Fix empty memblk detection in numa_cleanup_meminfo()Yinghai Lu
numa_cleanup_meminfo() trims each memblk between low (0) and high (max_pfn) limits and discards empty ones. However, the emptiness detection incorrectly used equality test. If the start of a memblk is higher than max_pfn, it is empty but fails the equality test and doesn't get discarded. The condition triggers when max_pfn is lower than start of a NUMA node and results in memory misconfiguration - leading to WARN_ON()s and other funnies. The bug was discovered in devel branch where 32bit too uses this code path for NUMA init. If a node is above the addressing limit, max_pfn ends up lower than the node triggering this problem. The failure hasn't been observed on x86-64 but is still possible with broken hardware e820/NUMA info. As the fix is very low risk, it would be better to apply it even for 64bit. Fix it by using >= instead of ==. Signed-off-by: Yinghai Lu <yinghai@kernel.org> [ Extracted the actual fix from the original patch and rewrote patch description. ] Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20110501171204.GO29280@htj.dyndns.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-29ioremap: Delay sanity check until after a successful mappingTim Gardner
While tracking down the reason for an ioremap() failure I was distracted by the WARN_ONCE() in __ioremap_caller(). Performing a WARN_ONCE() sanity check before the mapping is successful seems pointless if the caller sends bad values. A case in point is when the BIOS provides erroneous screen_info values causing vesafb_probe() to request an outrageuous size. The WARN_ONCE is then wasted on bogosity. Move the warning to a point where the mapping has been successfully allocated. Addresses: http://bugs.launchpad.net/bugs/772042 Reviewed-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Link: http://lkml.kernel.org/r/4DB99D2E.9080106@canonical.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-21x86, numa: Fix cpu nodemasks for NUMA emulation and CONFIG_DEBUG_PER_CPU_MAPSDavid Rientjes
The cpu<->node mappings under CONFIG_DEBUG_PER_CPU_MAPS=y when NUMA emulation is enabled is currently broken because it does not iterate through every emulated node and bind cpus that have affinity to it. NUMA emulation should bind each cpu to every local node to accurately represent the true NUMA topology of the underlying machine. debug_cpumask_set_cpu() needs to be fixed at the same time so that the debugging information that it emits shows the new cpumask of the node being assigned when the cpu is being added or removed. It can now take responsibility of setting or clearing the cpu itself to remove the need for duplicate code. Also change its last parameter, "enable", to have the correct bool type since it can only be true or false. -v2: Fix the return statements, by Kosaki Motohiro Acked-and-Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andreas Herrmann <herrmann.der.user@googlemail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1104201918470.12634@chino.kir.corp.google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-06x86-32, numa: Update remap allocator commentsTejun Heo
Now that remap allocator is cleaned up, update comments such that they are in docbook function description format and reflect the actual implementation. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1301955840-7246-15-git-send-email-tj@kernel.org Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06x86-32, numa: Remove redundant node_remap_size[]Tejun Heo
Remap area size can be determined from node_remap_start_vaddr[] and node_remap_end_vaddr[] making node_remap_size[] redundant. Remove it. While at it, make resume_map_numa_kva() use @nr_pages for number of pages instead of @size. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1301955840-7246-14-git-send-email-tj@kernel.org Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06x86-32, numa: Remove now useless node_remap_offset[]Tejun Heo
With lowmem address reservation moved into init_alloc_remap(), node_remap_offset[] is no longer useful. Remove it and related offset handling code. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1301955840-7246-13-git-send-email-tj@kernel.org Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06x86-32, numa: Make pgdat allocation use alloc_remap()Tejun Heo
pgdat allocation is handled differnetly from other remap allocations - it's reserved during initialization. There's no reason to handle this any differnetly. Remap allocator is initialized for every node and if init failed the allocation will fail and pgdat allocation can fall back to generic code like anyone else. Remove special init-time pgdat reservation and make allocate_pgdat() use alloc_remap() like everyone else. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1301955840-7246-12-git-send-email-tj@kernel.org Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06x86-32, numa: Move remapping for remap allocator into init_alloc_remap()Tejun Heo
There's no reason to perform the actual remapping separately. Collapse remap_numa_kva() into init_alloc_remap() and, while at it, make it less verbose. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1301955840-7246-11-git-send-email-tj@kernel.org Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06x86-32, numa: Move lowmem address space reservation to init_alloc_remap()Tejun Heo
Remap alloc init is done in the following stages. 1. init_alloc_remap() calculates how much memory is necessary for each node and reserves node local memory. 2. initmem_init() collects how much each node needs and reserves a single contiguous lowmem area which can contain all. 3. init_remap_allocator() initializes allocator parameters from the determined lowmem address and per-node offsets. 4. Actual remap happens. There is no reason for the lowmem remap area to be reserved as a single contiguous area at one go. They don't interact with each other and the memblock allocator will put them side-by-side anyway. This patch breaks up the single lowmem address reservation and put per-node lowmem address reservation into init_alloc_remap() and initializes allocator parameters directly in the function as all the addresses are determined there. This merges steps 2 and 3 into 1. While at it, remove now largely irrelevant comments in init_alloc_remap(). This change causes the following behavior changes. * Remap lowmem areas are allocated in smaller per-node chunks. * Remap lowmem area reservation failure fail future remap allocations instead of panicking. * Remap allocator initialization is less verbose. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1301955840-7246-10-git-send-email-tj@kernel.org Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06x86-32, numa: Make init_alloc_remap() less panickyTejun Heo
Remap allocator failure isn't fatal. The callers are required to fall back to regular early memory allocation mechanisms on failure anyway, so there's no reason to panic on remap init failure. Whining and returning are enough. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1301955840-7246-9-git-send-email-tj@kernel.org Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06x86-32, numa: Calculate remap size in common codeTejun Heo
Only pgdat and memmap use remap area and there isn't much benefit in allowing per-node override. In addition, the use of node_remap_size[] is confusing in that it contains number of bytes before remap initialization and then number of pages afterwards. Move remap size calculation for memap from specific NUMA config implementations to init_alloc_remap() and make node_remap_size[] static. The only behavior difference is that, before this patch, numaq_32 didn't consider max_pfn when calculating the memmap size but it's enforced after this patch, which is the right thing to do. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1301955840-7246-8-git-send-email-tj@kernel.org Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06x86-32, numa: Make @size in init_aloc_remap() represent bytesTejun Heo
@size variable in init_alloc_remap() is confusing in that it starts as number of bytes as its name implies and then becomes number of pages. Make it consistently represent bytes. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1301955840-7246-7-git-send-email-tj@kernel.org Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06x86-32, numa: Rename @node_kva to @node_pa in init_alloc_remap()Tejun Heo
init_alloc_remap() is about to do more and using _kva suffix for physical address becomes confusing because the function will be handling both physical and virtual addresses. Rename @node_kva to @node_pa. This is trivial rename and doesn't cause any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1301955840-7246-6-git-send-email-tj@kernel.org Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06x86-32, numa: Reorganize calculate_numa_remap_page()Tejun Heo
Separate the outer node walking loop and per-node logic from calculate_numa_remap_pages(). The outer loop is collapsed into initmem_init() and the per-node logic is moved into a new function - init_alloc_remap(). The new function name is confusing with the existing init_remap_allocator() and the behavior is the function isn't very clean either at this point, but this is to prepare for further cleanups and it will become prettier. This function doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1301955840-7246-5-git-send-email-tj@kernel.org Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06x86-32, numa: Remove redundant top-down alloc code from remap initializationTejun Heo
memblock_find_in_range() now does top-down allocation by default, so there's no reason for its callers to explicitly implement it by gradually lowering the start address. Remove redundant top-down allocation logic from init_meminit() and calculate_numa_remap_pages(). Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1301955840-7246-4-git-send-email-tj@kernel.org Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06x86-32, numa: Align pgdat size while initializing alloc_remapTejun Heo
When pgdat is reserved in init_remap_allocator(), PAGE_SIZE aligned size will be used. Match the size alignment in initialization to avoid allocation failure down the road. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1301955840-7246-3-git-send-email-tj@kernel.org Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06x86-32, numa: Fix failure condition check in alloc_remap()Tejun Heo
node_remap_{start|end}_vaddr[] describe [start, end) ranges; however, alloc_remap() incorrectly failed when the current allocation + size equaled the end but it should fail only when it goes over. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1301955840-7246-2-git-send-email-tj@kernel.org Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-04x86-32, NUMA: Fix ACPI NUMA init broken by recent x86-64 changeTejun Heo
Commit d8fc3afc49 (x86, NUMA: Move *_numa_init() invocations into initmem_init()) moved acpi_numa_init() call into NUMA initmem_init() but forgot to update 32bit NUMA init breaking ACPI NUMA configuration for 32bit. acpi_numa_init() call was later moved again to srat_64.c. Match it by adding the call to get_memcfg_from_srat() in srat_32.c. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: H. Peter Anvin <hpa@linux.intel.com> LKML-Reference: <20110404100645.GE1420@mtj.dyndns.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-04x86-64, NUMA: Remove unused variableFlorian Mickler
In case !CONFIG_ACPI_NUMA and !CONFIG_AMD_NUMA gcc emits a warning about the unused variable ret. As that variable is in fact not needed I choose to remove it. Signed-off-by: Florian Mickler <florian@mickler.org> LKML-Reference: <1301843624-22364-1-git-send-email-florian@mickler.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2011-04-01x86-64, NUMA: Remove custom phys_to_nid() implementationTejun Heo
phys_to_nid() maps physical address to NUMA node id. This is implemented by building perfect hash in compute_hash_shift() during initialization. However, with SPARSE memory model, the nid is encoded in page flags. The perfect hash implementation was for DISCONTIG memory model which got removed years ago by b263295dbf (x86: 64-bit, make sparsemem vmemmap the only memory model). So, the perfect hash ends up being used only during initialization when the core SPARSE code already provides perfectly acceptable generic early_pfn_to_nid() implementation. Drop phys_to_nid() and use the generic ealry_pfn_to_nid() instead. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de>
2011-03-23Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: deal with races in /proc/*/{syscall,stack,personality} proc: enable writing to /proc/pid/mem proc: make check_mem_permission() return an mm_struct on success proc: hold cred_guard_mutex in check_mem_permission() proc: disable mem_write after exec mm: implement access_remote_vm mm: factor out main logic of access_process_vm mm: use mm_struct to resolve gate vma's in __get_user_pages mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm mm: arch: make in_gate_area take an mm_struct instead of a task_struct mm: arch: make get_gate_vma take an mm_struct instead of a task_struct x86: mark associated mm when running a task in 32 bit compatibility mode x86: add context tag to mark mm when running a task in 32-bit compatibility mode auxv: require the target to be tracable (or yourself) close race in /proc/*/environ report errors in /proc/*/*map* sanely pagemap: close races with suid execve make sessionid permissions in /proc/*/task/* match those in /proc/* fix leaks in path_lookupat() Fix up trivial conflicts in fs/proc/base.c
2011-03-23mm: arch: rename in_gate_area_no_task to in_gate_area_no_mmStephen Wilson
Now that gate vma's are referenced with respect to a particular mm and not a particular task it only makes sense to propagate the change to this predicate as well. Signed-off-by: Stephen Wilson <wilsons@start.ca> Reviewed-by: Michel Lespinasse <walken@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23mm: arch: make in_gate_area take an mm_struct instead of a task_structStephen Wilson
Morally, the question of whether an address lies in a gate vma should be asked with respect to an mm, not a particular task. Moreover, dropping the dependency on task_struct will help make existing and future operations on mm's more flexible and convenient. Signed-off-by: Stephen Wilson <wilsons@start.ca> Reviewed-by: Michel Lespinasse <walken@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23mm: arch: make get_gate_vma take an mm_struct instead of a task_structStephen Wilson
Morally, the presence of a gate vma is more an attribute of a particular mm than a particular task. Moreover, dropping the dependency on task_struct will help make both existing and future operations on mm's more flexible and convenient. Signed-off-by: Stephen Wilson <wilsons@start.ca> Reviewed-by: Michel Lespinasse <walken@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-22Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: xen: update mask_rw_pte after kernel page tables init changes xen: set max_pfn_mapped to the last pfn mapped x86: Cleanup highmap after brk is concluded Fix up trivial onflict (added header file includes) in arch/x86/mm/init_64.c
2011-03-19x86: Cleanup highmap after brk is concludedYinghai Lu
Now cleanup_highmap actually is in two steps: one is early in head64.c and only clears above _end; a second one is in init_memory_mapping() and tries to clean from _brk_end to _end. It should check if those boundaries are PMD_SIZE aligned but currently does not. Also init_memory_mapping() is called several times for numa or memory hotplug, so we really should not handle initial kernel mappings there. This patch moves cleanup_highmap() down after _brk_end is settled so we can do everything in one step. Also we honor max_pfn_mapped in the implementation of cleanup_highmap. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> LKML-Reference: <alpine.DEB.2.00.1103171739050.3382@kaball-desktop> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-03-18Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Flush TLB if PGD entry is changed in i386 PAE mode x86, dumpstack: Correct stack dump info when frame pointer is available x86: Clean up csum-copy_64.S a bit x86: Fix common misspellings x86: Fix misspelling and align params x86: Use PentiumPro-optimized partial_csum() on VIA C7
2011-03-18x86: Flush TLB if PGD entry is changed in i386 PAE modeShaohua Li
According to intel CPU manual, every time PGD entry is changed in i386 PAE mode, we need do a full TLB flush. Current code follows this and there is comment for this too in the code. But current code misses the multi-threaded case. A changed page table might be used by several CPUs, every such CPU should flush TLB. Usually this isn't a problem, because we prepopulate all PGD entries at process fork. But when the process does munmap and follows new mmap, this issue will be triggered. When it happens, some CPUs keep doing page faults: http://marc.info/?l=linux-kernel&m=129915020508238&w=2 Reported-by: Yasunori Goto<y-goto@jp.fujitsu.com> Tested-by: Yasunori Goto<y-goto@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Shaohua Li<shaohua.li@intel.com> Cc: Mallick Asit K <asit.k.mallick@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm <linux-mm@kvack.org> Cc: stable <stable@kernel.org> LKML-Reference: <1300246649.2337.95.camel@sli10-conroe> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-18x86: Fix common misspellingsLucas De Marchi
They were generated by 'codespell' and then manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi> Cc: trivial@kernel.org LKML-Reference: <1300389856-1099-3-git-send-email-lucas.demarchi@profusion.mobi> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-16Merge branch 'driver-core-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (50 commits) printk: do not mangle valid userspace syslog prefixes efivars: Add Documentation efivars: Expose efivars functionality to external drivers. efivars: Parameterize operations. efivars: Split out variable registration efivars: parameterize efivars efivars: Make efivars bin_attributes dynamic efivars: move efivars globals into struct efivars drivers:misc: ti-st: fix debugging code kref: Fix typo in kref documentation UIO: add PRUSS UIO driver support Fix spelling mistakes in Documentation/zh_CN/SubmittingPatches firmware: Fix unaligned memory accesses in dmi-sysfs firmware: Add documentation for /sys/firmware/dmi firmware: Expose DMI type 15 System Event Log firmware: Break out system_event_log in dmi-sysfs firmware: Basic dmi-sysfs support firmware: Add DMI entry types to the headers Driver core: convert platform_{get,set}_drvdata to static inline functions Translate linux-2.6/Documentation/magic-number.txt into Chinese ...
2011-03-15x86, tlb, UV: Do small micro-optimization for native_flush_tlb_others()Xiao Guangrong
native_flush_tlb_others() is called from: flush_tlb_current_task() flush_tlb_mm() flush_tlb_page() All these functions disable preemption explicitly, so we can use smp_processor_id() instead of get_cpu() and put_cpu(). Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Cc: Cliff Wickman <cpw@sgi.com> LKML-Reference: <4D7EC791.4040003@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-15Merge commit 'v2.6.38' into x86/mmIngo Molnar
Conflicts: arch/x86/mm/numa_64.c Merge reason: Resolve the conflict, update the branch to .38. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-12x86-64, NUMA: Don't call numa_set_distanc() for all possible node ↵Tejun Heo
combinations during emulation The distance transforming in numa_emulation() used to call numa_set_distance() for all MAX_NUMNODES * MAX_NUMNODES node combinations regardless of which are enabled. As numa_set_distance() ignores all out-of-bound distance settings, this doesn't cause any problem other than looping unnecessarily many times during boot. However, as MAX_NUMNODES * MAX_NUMNODES can be pretty high, update the code such that it iterates through only the enabled combinations. Yinghai Lu identified the issue and provided an initial patch to address the issue; however, the patch was incorrect in that it didn't build emulated distance table when there's no physical distance table and unnecessarily complex. http://thread.gmane.org/gmane.linux.kernel/1107986/focus=1107988 Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Yinghai Lu <yinghai@kernel.org>
2011-03-10x86/mm: Fix pgd_lock deadlockAndrea Arcangeli
It's forbidden to take the page_table_lock with the irq disabled or if there's contention the IPIs (for tlb flushes) sent with the page_table_lock held will never run leading to a deadlock. Nobody takes the pgd_lock from irq context so the _irqsave can be removed. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@kernel.org> LKML-Reference: <201102162345.p1GNjMjm021738@imap1.linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-10x86/mm: Handle mm_fault_error() in kernel spaceAndrey Vagin
mm_fault_error() should not execute oom-killer, if page fault occurs in kernel space. E.g. in copy_from_user()/copy_to_user(). This would happen if we find ourselves in OOM on a copy_to_user(), or a copy_from_user() which faults. Without this patch, the kernels hangs up in copy_from_user(), because OOM killer sends SIG_KILL to current process, but it can't handle a signal while in syscall, then the kernel returns to copy_from_user(), reexcute current command and provokes page_fault again. With this patch the kernel return -EFAULT from copy_from_user(). The code, which checks that page fault occurred in kernel space, has been copied from do_sigbus(). This situation is handled by the same way on powerpc, xtensa, tile, ... Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@kernel.org> LKML-Reference: <201103092322.p29NMNPH001682@imap1.linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-04x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation()Tejun Heo
Undetermined entries in emu_nid_to_phys[] are filled with zero assuming that physical node 0 is always online; however, this might not be true depending on hardware configuration. Find a physical node which is actually online and use it instead. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: David Rientjes <rientjes@google.com> LKML-Reference: <alpine.DEB.2.00.1103020628210.31626@chino.kir.corp.google.com>
2011-03-04x86, numa: Fix numa_emulation code with memory-less node0Yinghai Lu
This crash happens on a system that does not have RAM on node0. When numa_emulation is compiled in, and: 1. we boot the system without numa=fake... 2. or we boot the system with numa=fake=128 to make emulation fail we will get: [ 0.076025] ------------[ cut here ]------------ [ 0.080004] kernel BUG at arch/x86/mm/numa_64.c:788! [ 0.080004] invalid opcode: 0000 [#1] SMP [...] need to use early_cpu_to_node() directly, because cpu_to_apicid and apicid_to_node will return node0 that is not onlined. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> LKML-Reference: <4D6ECF72.5010308@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-04x86-64, NUMA: Clean up initmem_init()David Rientjes
This patch cleans initmem_init() so that it is more readable and doesn't use an unnecessary array of function pointers to convolute the flow of the code. It also makes it obvious that dummy_numa_init() will always succeed (and documents that requirement) so that the existing BUG() is never actually reached. No functional change. -tj: Updated comment for dummy_numa_init() slightly. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2011-03-04x86-64, NUMA: Fix numa_emulation code with node0 without RAMYinghai Lu
On one system that does not have RAM on node0. When numa_emulation is compiled in, and 1. boot system without numa=fake... 2. or boot system with numa=fake=128 to make emulation fail will get: [ 0.092026] ------------[ cut here ]------------ [ 0.096005] kernel BUG at arch/x86/mm/numa_emulation.c:439! [ 0.096005] invalid opcode: 0000 [#1] SMP [ 0.096005] last sysfs file: [ 0.096005] CPU 0 [ 0.096005] Modules linked in: [ 0.096005] [ 0.096005] Pid: 0, comm: swapper Not tainted 2.6.38-rc6-tip-yh-03869-gcb0491d-dirty #684 Sun Microsystems Sun Fire X4240/Sun Fire X4240 [ 0.096005] RIP: 0010:[<ffffffff81cdc65b>] [<ffffffff81cdc65b>] numa_add_cpu+0x56/0xcf [ 0.096005] RSP: 0000:ffffffff82437ed8 EFLAGS: 00010246 ... [ 0.096005] Call Trace: [ 0.096005] [<ffffffff81cd7931>] identify_cpu+0x2d7/0x2df [ 0.096005] [<ffffffff827e54fa>] identify_boot_cpu+0x10/0x30 [ 0.096005] [<ffffffff827e5704>] check_bugs+0x9/0x2d [ 0.096005] [<ffffffff827dceda>] start_kernel+0x3d7/0x3f1 [ 0.096005] [<ffffffff827dc2cc>] x86_64_start_reservations+0x9c/0xa0 [ 0.096005] [<ffffffff827dc4ad>] x86_64_start_kernel+0x1dd/0x1e8 [ 0.096005] Code: 74 06 48 8d 04 90 eb 0f 48 c7 c0 30 d9 00 00 48 03 04 d5 90 0f 60 82 8b 00 83 f8 ff 74 0d 0f a3 05 8b 7e 92 00 19 d2 85 d2 75 02 <0f> 0b 48 98 be 00 01 00 00 48 c7 c7 e0 44 60 82 44 8b 2c 85 e0 [ 0.096005] RIP [<ffffffff81cdc65b>] numa_add_cpu+0x56/0xcf [ 0.096005] RSP <ffffffff82437ed8> [ 0.096026] ---[ end trace a7919e7f17c0a725 ]--- We need to use early_cpu_to_node() directly, because numa_cpu_node() will return node0 that is not onlined. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2011-03-04x86-64, NUMA: Revert NUMA affine page table allocationTejun Heo
This patch reverts NUMA affine page table allocation added by commit 1411e0ec31 (x86-64, numa: Put pgtable to local node memory). The commit made an undocumented change where the kernel linear mapping strictly follows intersection of e820 memory map and NUMA configuration. If the physical memory configuration has holes or NUMA nodes are not properly aligned, this leads to using unnecessarily smaller mapping size which leads to increased TLB pressure. For details, http://thread.gmane.org/gmane.linux.kernel/1104672 Patches to fix the problem have been proposed but the underlying code needs more cleanup and the approach itself seems a bit heavy handed and it has been determined to revert the feature for now and come back to it in the next developement cycle. http://thread.gmane.org/gmane.linux.kernel/1105959 As init_memory_mapping_high() callsites have been consolidated since the commit, reverting is done manually. Also, the RED-PEN comment in arch/x86/mm/init.c is not restored as the problem no longer exists with memblock based top-down early memory allocation. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de>
2011-03-02x86-64, NUMA: Better explain numa_distance handlingTejun Heo
Handling of out-of-bounds distances and allocation failure can use better documentation. Add it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Acked-by: David Rientjes <rientjes@google.com>
2011-03-02x86-64, NUMA: Fix distance table handlingYinghai Lu
NUMA distance table handling has the following problems. * numa_reset_distance() uses numa_distance * sizeof(numa_distance[0]) as the table size when it should be using the square of numa_distance. * The same size miscalculation when allocation space for phys_dist in numa_emulation(). * In numa_emulation(), phys_dist must be reserved; otherwise, the new emulated distance table may overlap it. Fix them and, while at it, take numa_distance_cnt resetting in numa_reset_distance() out of the if block to simplify the code a bit. David Rientjes reported incorrect handling of distance table during emulation. -tj: Edited out numa_alloc_distance() related changes which weren't necessary and rewrote patch description. -v2: Ingo was unhappy with 80-column limit induced linebreaks. Let lines run over 80-column. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Reported-by: David Rientjes <rientjes@google.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: David Rientjes <rientjes@google.com>
2011-02-25x86-64, NUMA: Fix size of numa_distance arrayDavid Rientjes
numa_distance should be sized like the SLIT, an NxN matrix where N is the highest node id + 1. This patch fixes the calculation to avoid overflowing the array on the subsequent iteration. -tj: The original patch used last index to calculate size. Yinghai pointed out it should be incremented so it is the number of elements instead of the last index to calculate the size of the table. Updated accordingly. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>