aboutsummaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2012-09-12mm: hugetlbfs: correctly populate shared pmdMichal Hocko
commit eb48c071464757414538c68a6033c8f8c15196f8 upstream. Each page mapped in a process's address space must be correctly accounted for in _mapcount. Normally the rules for this are straightforward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman: kernel BUG at mm/filemap.c:135! invalid opcode: 0000 [#1] SMP CPU 22 Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] Pid: 18001, comm: mpitest Tainted: G W 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 RIP: 0010:[<ffffffff8112cfed>] [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170 Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20) Call Trace: delete_from_page_cache+0x40/0x80 truncate_hugepages+0x115/0x1f0 hugetlbfs_evict_inode+0x18/0x30 evict+0x9f/0x1b0 iput_final+0xe3/0x1e0 iput+0x3e/0x50 d_kill+0xf8/0x110 dput+0xe2/0x1b0 __fput+0x162/0x240 During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc() shared page tables with the check dst_pte == src_pte. The logic is if the PMD page is the same, they must be shared. This assumes that the sharing is between the parent and child. However, if the sharing is with a different process entirely then this check fails as in this diagram: parent | ------------>pmd src_pte----------> data page ^ other--------->pmd--------------------| ^ child-----------| dst_pte For this situation to occur, it must be possible for Parent and Other to have faulted and failed to share page tables with each other. This is possible due to the following style of race. PROC A PROC B copy_hugetlb_page_range copy_hugetlb_page_range src_pte == huge_pte_offset src_pte == huge_pte_offset !src_pte so no sharing !src_pte so no sharing (time passes) hugetlb_fault hugetlb_fault huge_pte_alloc huge_pte_alloc huge_pmd_share huge_pmd_share LOCK(i_mmap_mutex) find nothing, no sharing UNLOCK(i_mmap_mutex) LOCK(i_mmap_mutex) find nothing, no sharing UNLOCK(i_mmap_mutex) pmd_alloc pmd_alloc LOCK(instantiation_mutex) fault UNLOCK(instantiation_mutex) LOCK(instantiation_mutex) fault UNLOCK(instantiation_mutex) These two processes are not poing to the same data page but are not sharing page tables because the opportunity was missed. When either process later forks, the src_pte == dst pte is potentially insufficient. As the check falls through, the wrong PTE information is copied in (harmless but wrong) and the mapcount is bumped for a page mapped by a shared page table leading to the BUG_ON. This patch addresses the issue by moving pmd_alloc into huge_pmd_share which guarantees that the shared pud is populated in the same critical section as pmd. This also means that huge_pte_offset test in huge_pmd_share is serialized correctly now which in turn means that the success of the sharing will be higher as the racing tasks see the pud and pmd populated together. Race identified and changelog written mostly by Mel Gorman. {akpm@linux-foundation.org: attempt to make the huge_pmd_share() comment comprehensible, clean up coding style] Reported-by: Larry Woodman <lwoodman@redhat.com> Tested-by: Larry Woodman <lwoodman@redhat.com> Reviewed-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Ken Chen <kenchen@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-09-12Redefine ATOMIC_INIT and ATOMIC64_INIT to drop the castsMel Gorman
commit 67a806d9499353fabd5b5ff07337f3aa88a1c3ba upstream. The following build error occurred during an alpha build: net/core/sock.c:274:36: error: initializer element is not constant Dave Anglin says: > Here is the line in sock.i: > > struct static_key memalloc_socks = ((struct static_key) { .enabled = > ((atomic_t) { (0) }) }); The above line contains two compound literals. It also uses a designated initializer to initialize the field enabled. A compound literal is not a constant expression. The location of the above statement isn't fully clear, but if a compound literal occurs outside the body of a function, the initializer list must consist of constant expressions. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Michael Cree <mcree@orcon.net.nz> Acked-by: Matt Turner <mattst88@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-09-12alpha: Don't export SOCK_NONBLOCK to user space.Michael Cree
commit a2fa3ccd7b43665fe14cb562761a6c3d26a1d13f upstream. Currently we export SOCK_NONBLOCK to user space but that conflicts with the definition from glibc leading to compilation errors in user programs (e.g. see Debian bug #658460). The generic socket.h restricts the definition of SOCK_NONBLOCK to the kernel, as does the MIPS specific socket.h, so let's do the same on Alpha. Signed-off-by: Michael Cree <mcree@orcon.net.nz> Acked-by: Matt Turner <mattst88@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-09-12ARM: 7489/1: errata: fix workaround for erratum #720789 on UP systemsWill Deacon
commit 730a8128cd8978467eb1cf546b11014acb57d433 upstream. Commit 5a783cbc4836 ("ARM: 7478/1: errata: extend workaround for erratum #720789") added workarounds for erratum #720789 to the range TLB invalidation functions with the observation that the erratum only affects SMP platforms. However, when running an SMP_ON_UP kernel on a uniprocessor platform we must take care to preserve the ASID as the workaround is not required. This patch ensures that we don't set the ASID to 0 when flushing the TLB on such a system, preserving the original behaviour with the workaround disabled. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-09-12ARM: 7488/1: mm: use 5 bits for swapfile type encodingWill Deacon
commit f5f2025ef3e2cdb593707cbf87378761f17befbe upstream. Page migration encodes the pfn in the offset field of a swp_entry_t. For LPAE, we support physical addresses of up to 36 bits (due to sparsemem limitations with the size of page flags), requiring 24 bits to represent a pfn. A further 3 bits are used to encode a swp_entry into a pte, leaving 5 bits for the type field. Furthermore, the core code defines MAX_SWAPFILES_SHIFT as 5, so the additional type bit does not get used. This patch reduces the width of the type field to 5 bits, allowing us to create up to 31 swapfiles of 64GB each. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-09-12ARM: 7483/1: vfp: only advertise VFPv4 in hwcaps if CONFIG_VFPv3 is enabledWill Deacon
commit 3d9fb0038a9b02febb01efc79a4a5d97f1822a90 upstream. VFPv4 support depends on the VFPv3 context save/restore code, so only advertise support in the hwcaps if the kernel can actually handle it. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-19s390/compat: fix mmap compat system callsHeiko Carstens
commit e85871218513c54f7dfdb6009043cb638f2fecbe upstream. The native 31 bit and the compat behaviour for the mmap system calls differ: In native 31 bit mode the passed in address for the mmap system call will be unmodified passed to sys_mmap_pgoff(). In compat mode however the passed in address will be modified with compat_ptr() which masks out the most significant bit. The result is that in native 31 bit mode each mmap request (with MAP_FIXED) will fail where the most significat bit is set, while in compat mode it may succeed. This odd behaviour was introduced with d3815898 "[S390] mmap: add missing compat_ptr conversion to both mmap compat syscalls". To restore a consistent behaviour accross native and compat mode this patch functionally reverts the above mentioned commit. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-19s390/compat: fix compat wrappers for process_vm system callsHeiko Carstens
commit 82aabdb6f1eb61e0034ec23901480f5dd23db7c4 upstream. The compat wrappers incorrectly called the non compat versions of the system process_vm system calls. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-19Input: eeti_ts: pass gpio value instead of IRQArnd Bergmann
commit 4eef6cbfcc03b294d9d334368a851b35b496ce53 upstream. The EETI touchscreen asserts its IRQ line as soon as it has data in its internal buffers. The line is automatically deasserted once all data has been read via I2C. Hence, the driver has to monitor the GPIO line and cannot simply rely on the interrupt handler reception. In the current implementation of the driver, irq_to_gpio() is used to determine the GPIO number from the i2c_client's IRQ value. As irq_to_gpio() is not available on all platforms, this patch changes this and makes the driver ignore the passed in IRQ. Instead, a GPIO is added to the platform_data struct and gpio_to_irq is used to derive the IRQ from that GPIO. If this fails, bail out. The driver is only able to work in environments where the touchscreen GPIO can be mapped to an IRQ. Without this patch, building raumfeld_defconfig results in: drivers/input/touchscreen/eeti_ts.c: In function 'eeti_ts_irq_active': drivers/input/touchscreen/eeti_ts.c:65:2: error: implicit declaration of function 'irq_to_gpio' [-Werror=implicit-function-declaration] Signed-off-by: Daniel Mack <zonque@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Sven Neumann <s.neumann@raumfeld.com> Cc: linux-input@vger.kernel.org Cc: Haojian Zhuang <haojian.zhuang@gmail.com> [bwh: Backported to 3.2: raumfeld_controller_i2c_board_info.irq was initialised using gpio_to_irq(), but this doesn't seem to matter] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-19ARM: mxs: Remove MMAP_MIN_ADDR setting from mxs_defconfigMarek Vasut
commit 3bed491c8d28329e34f8a31e3fe64d03f3a350f1 upstream. The CONFIG_DEFAULT_MMAP_MIN_ADDR was set to 65536 in mxs_defconfig, this caused severe breakage of userland applications since the upper limit for ARM is 32768. By default CONFIG_DEFAULT_MMAP_MIN_ADDR is set to 4096 and can also be changed via /proc/sys/vm/mmap_min_addr if needed. Quoting Russell King [1]: "4096 is also fine for ARM too. There's not much point in having defconfigs change it - that would just be pure noise in the config files." the CONFIG_DEFAULT_MMAP_MIN_ADDR can be removed from the defconfig altogether. This problem was introduced by commit cde7c41 (ARM: configs: add defconfig for mach-mxs). [1] http://marc.info/?l=linux-arm-kernel&m=134401593807820&w=2 Signed-off-by: Marek Vasut <marex@denx.de> Cc: Russell King <linux@arm.linux.org.uk> Cc: Wolfgang Denk <wd@denx.de> Signed-off-by: Shawn Guo <shawn.guo@linaro.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-19KVM: VMX: Advertise CPU_BASED_RDPMC_EXITING for nested guestsStefan Bader
Based on commit fee84b079d5ddee2247b5c1f53162c330c622902 upstream. Intercept RDPMC and forward it to the PMU emulation code. Newer vmx support will only allow to load the kvm_intel module if RDPMC_EXITING is supported. Even without the actual support this part of the change is required on 3.2 hosts. BugLink: http://bugs.launchpad.net/bugs/1031090 Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10m68k: Make sys_atomic_cmpxchg_32 work on classic m68kAndreas Schwab
commit 9e2760d18b3cf179534bbc27692c84879c61b97c upstream. User space access must always go through uaccess accessors, since on classic m68k user space and kernel space are completely separate. Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Tested-by: Thorsten Glaser <tg@debian.org> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10ARM: Fix undefined instruction exception handlingRussell King
commit 15ac49b65024f55c4371a53214879a9c77c4fbf9 upstream. While trying to get a v3.5 kernel booted on the cubox, I noticed that VFP does not work correctly with VFP bounce handling. This is because of the confusion over 16-bit vs 32-bit instructions, and where PC is supposed to point to. The rule is that FP handlers are entered with regs->ARM_pc pointing at the _next_ instruction to be executed. However, if the exception is not handled, regs->ARM_pc points at the faulting instruction. This is easy for ARM mode, because we know that the next instruction and previous instructions are separated by four bytes. This is not true of Thumb2 though. Since all FP instructions are 32-bit in Thumb2, it makes things easy. We just need to select the appropriate adjustment. Do this by moving the adjustment out of do_undefinstr() into the assembly code, as only the assembly code knows whether it's dealing with a 32-bit or 16-bit instruction. Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10ARM: 7478/1: errata: extend workaround for erratum #720789Will Deacon
commit 5a783cbc48367cfc7b65afc75430953dfe60098f upstream. Commit cdf357f1 ("ARM: 6299/1: errata: TLBIASIDIS and TLBIMVAIS operations can broadcast a faulty ASID") replaced by-ASID TLB flushing operations with all-ASID variants to workaround A9 erratum #720789. This patch extends the workaround to include the tlb_range operations, which were overlooked by the original patch. Tested-by: Steve Capper <steve.capper@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10ARM: 7477/1: vfp: Always save VFP state in vfp_pm_suspend on UPColin Cross
commit 24b35521b8ddf088531258f06f681bb7b227bf47 upstream. vfp_pm_suspend should save the VFP state in suspend after any lazy context switch. If it only saves when the VFP is enabled, the state can get lost when, on a UP system: Thread 1 uses the VFP Context switch occurs to thread 2, VFP is disabled but the VFP context is not saved Thread 2 initiates suspend vfp_pm_suspend is called with the VFP disabled, and the unsaved VFP context of Thread 1 in the registers Modify vfp_pm_suspend to save the VFP context whenever vfp_current_hw_state is not NULL. Includes a fix from Ido Yariv <ido@wizery.com>, who pointed out that on SMP systems, the state pointer can be pointing to a freed task struct if a task exited on another cpu, fixed by using #ifndef CONFIG_SMP in the new if clause. Cc: Barry Song <bs14@csr.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ido Yariv <ido@wizery.com> Cc: Daniel Drake <dsd@laptop.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10ARM: 7476/1: vfp: only clear vfp state for current cpu in vfp_pm_suspendColin Cross
commit a84b895a2348f0dbff31b71ddf954f70a6cde368 upstream. vfp_pm_suspend runs on each cpu, only clear the hardware state pointer for the current cpu. Prevents a possible crash if one cpu clears the hw state pointer when another cpu has already checked if it is valid. Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10ARM: 7467/1: mutex: use generic xchg-based implementation for ARMv6+Will Deacon
commit a76d7bd96d65fa5119adba97e1b58d95f2e78829 upstream. The open-coded mutex implementation for ARMv6+ cores suffers from a severe lack of barriers, so in the uncontended case we don't actually protect any accesses performed during the critical section. Furthermore, the code is largely a duplication of the ARMv6+ atomic_dec code but optimised to remove a branch instruction, as the mutex fastpath was previously inlined. Now that this is executed out-of-line, we can reuse the atomic access code for the locking (in fact, we use the xchg code as this produces shorter critical sections). This patch uses the generic xchg based implementation for mutexes on ARMv6+, which introduces barriers to the lock/unlock operations and also has the benefit of removing a fair amount of inline assembly code. Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Nicolas Pitre <nico@linaro.org> Reported-by: Shan Kang <kangshan0910@gmail.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10ARM: 7466/1: disable interrupt before spinning endlesslyShawn Guo
commit 98bd8b96b26db3399a48202318dca4aaa2515355 upstream. The CPU will endlessly spin at the end of machine_halt and machine_restart calls. However, this will lead to a soft lockup warning after about 20 seconds, if CONFIG_LOCKUP_DETECTOR is enabled, as system timer is still alive. Disable interrupt before going to spin endlessly, so that the lockup warning will never be seen. Reported-by: Marek Vasut <marex@denx.de> Signed-off-by: Shawn Guo <shawn.guo@linaro.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10s390/mm: fix fault handling for page table walk caseHeiko Carstens
commit 008c2e8f247f0a8db1e8e26139da12f3a3abcda0 upstream. Make sure the kernel does not incorrectly create a SIGBUS signal during user space accesses: For user space accesses in the switched addressing mode case the kernel may walk page tables and access user address space via the kernel mapping. If a page table entry is invalid the function __handle_fault() gets called in order to emulate a page fault and trigger all the usual actions like paging in a missing page etc. by calling handle_mm_fault(). If handle_mm_fault() returns with an error fixup handling is necessary. For the switched addressing mode case all errors need to be mapped to -EFAULT, so that the calling uaccess function can return -EFAULT to user space. Unfortunately the __handle_fault() incorrectly calls do_sigbus() if VM_FAULT_SIGBUS is set. This however should only happen if a page fault was triggered by a user space instruction. For kernel mode uaccesses the correct action is to only return -EFAULT. So user space may incorrectly see SIGBUS signals because of this bug. For current machines this would only be possible for the switched addressing mode case in conjunction with futex operations. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [bwh: Backported to 3.2: do_exception() and do_sigbus() parameters differ] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10Redefine ATOMIC_INIT and ATOMIC64_INIT to drop the castsTony Luck
commit a119365586b0130dfea06457f584953e0ff6481d upstream. The following build error occured during a ia64 build with swap-over-NFS patches applied. net/core/sock.c:274:36: error: initializer element is not constant net/core/sock.c:274:36: error: (near initialization for 'memalloc_socks') net/core/sock.c:274:36: error: initializer element is not constant This is identical to a parisc build error. Fengguang Wu, Mel Gorman and James Bottomley did all the legwork to track the root cause of the problem. This fix and entire commit log is shamelessly copied from them with one extra detail to change a dubious runtime use of ATOMIC_INIT() to atomic_set() in drivers/char/mspec.c Dave Anglin says: > Here is the line in sock.i: > > struct static_key memalloc_socks = ((struct static_key) { .enabled = > ((atomic_t) { (0) }) }); The above line contains two compound literals. It also uses a designated initializer to initialize the field enabled. A compound literal is not a constant expression. The location of the above statement isn't fully clear, but if a compound literal occurs outside the body of a function, the initializer list must consist of constant expressions. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10s390/mm: downgrade page table after fork of a 31 bit processMartin Schwidefsky
commit 0f6f281b731d20bfe75c13f85d33f3f05b440222 upstream. The downgrade of the 4 level page table created by init_new_context is currently done only in start_thread31. If a 31 bit process forks the new mm uses a 4 level page table, including the task size of 2<<42 that goes along with it. This is incorrect as now a 31 bit process can map memory beyond 2GB. Define arch_dup_mmap to do the downgrade after fork. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10x86, nops: Missing break resulting in incorrect selection on IntelAlan Cox
commit d6250a3f12edb3a86db9598ffeca3de8b4a219e9 upstream. The Intel case falls through into the generic case which then changes the values. For cases like the P6 it doesn't do the right thing so this seems to be a screwup. Signed-off-by: Alan Cox <alan@linux.intel.com> Link: http://lkml.kernel.org/n/tip-lww2uirad4skzjlmrm0vru8o@git.kernel.org Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10m68k: Correct the Atari ALLOWINT definitionMikael Pettersson
commit c663600584a596b5e66258cc10716fb781a5c2c9 upstream. Booting a 3.2, 3.3, or 3.4-rc4 kernel on an Atari using the `nfeth' ethernet device triggers a WARN_ONCE() in generic irq handling code on the first irq for that device: WARNING: at kernel/irq/handle.c:146 handle_irq_event_percpu+0x134/0x142() irq 3 handler nfeth_interrupt+0x0/0x194 enabled interrupts Modules linked in: Call Trace: [<000299b2>] warn_slowpath_common+0x48/0x6a [<000299c0>] warn_slowpath_common+0x56/0x6a [<00029a4c>] warn_slowpath_fmt+0x2a/0x32 [<0005b34c>] handle_irq_event_percpu+0x134/0x142 [<0005b34c>] handle_irq_event_percpu+0x134/0x142 [<0000a584>] nfeth_interrupt+0x0/0x194 [<001ba0a8>] schedule_preempt_disabled+0x0/0xc [<0005b37a>] handle_irq_event+0x20/0x2c [<0005add4>] generic_handle_irq+0x2c/0x3a [<00002ab6>] do_IRQ+0x20/0x32 [<0000289e>] auto_irqhandler_fixup+0x4/0x6 [<00003144>] cpu_idle+0x22/0x2e [<001b8a78>] printk+0x0/0x18 [<0024d112>] start_kernel+0x37a/0x386 [<0003021d>] __do_proc_dointvec+0xb1/0x366 [<0003021d>] __do_proc_dointvec+0xb1/0x366 [<0024c31e>] _sinittext+0x31e/0x9c0 After invoking the irq's handler the kernel sees !irqs_disabled() and concludes that the handler erroneously enabled interrupts. However, debugging shows that !irqs_disabled() is true even before the handler is invoked, which indicates a problem in the platform code rather than the specific driver. The warning does not occur in 3.1 or older kernels. It turns out that the ALLOWINT definition for Atari is incorrect. The Atari definition of ALLOWINT is ~0x400, the stated purpose of that is to avoid taking HSYNC interrupts. irqs_disabled() returns true if the 3-bit ipl & 4 is non-zero. The nfeth interrupt runs at ipl 3 (it's autovector 3), but 3 & 4 is zero so irqs_disabled() is false, and the warning above is generated. When interrupts are explicitly disabled, ipl is set to 7. When they are enabled, ipl is masked with ALLOWINT. On Atari this will result in ipl = 3, which blocks interrupts at ipl 3 and below. So how come nfeth interrupts at ipl 3 are received at all? That's because ipl is reset to 2 by Atari-specific code in default_idle(), again with the stated purpose of blocking HSYNC interrupts. This discrepancy means that ipl 3 can remain blocked for longer than intended. Both default_idle() and falcon_hblhandler() identify HSYNC with ipl 2, and the "Atari ST/.../F030 Hardware Register Listing" agrees, but ALLOWINT is defined as if HSYNC was ipl 3. [As an experiment I modified default_idle() to reset ipl to 3, and as expected that resulted in all nfeth interrupts being blocked.] The fix is simple: define ALLOWINT as ~0x500 instead. This makes arch_local_irq_enable() consistent with default_idle(), and prevents the !irqs_disabled() problems for ipl 3 interrupts. Tested on Atari running in an Aranym VM. Signed-off-by: Mikael Pettersson <mikpe@it.uu.se> Tested-by: Michael Schmitz <schmitzmic@googlemail.com> (on Falcon/CT60) [Geert Uytterhoeven: This version applies to v3.2..v3.4.] Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10xen: mark local pages as FOREIGN in the m2p_overrideStefano Stabellini
commit b9e0d95c041ca2d7ad297ee37c2e9cfab67a188f upstream. When the frontend and the backend reside on the same domain, even if we add pages to the m2p_override, these pages will never be returned by mfn_to_pfn because the check "get_phys_to_machine(pfn) != mfn" will always fail, so the pfn of the frontend will be returned instead (resulting in a deadlock because the frontend pages are already locked). INFO: task qemu-system-i38:1085 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. qemu-system-i38 D ffff8800cfc137c0 0 1085 1 0x00000000 ffff8800c47ed898 0000000000000282 ffff8800be4596b0 00000000000137c0 ffff8800c47edfd8 ffff8800c47ec010 00000000000137c0 00000000000137c0 ffff8800c47edfd8 00000000000137c0 ffffffff82213020 ffff8800be4596b0 Call Trace: [<ffffffff81101ee0>] ? __lock_page+0x70/0x70 [<ffffffff81a0fdd9>] schedule+0x29/0x70 [<ffffffff81a0fe80>] io_schedule+0x60/0x80 [<ffffffff81101eee>] sleep_on_page+0xe/0x20 [<ffffffff81a0e1ca>] __wait_on_bit_lock+0x5a/0xc0 [<ffffffff81101ed7>] __lock_page+0x67/0x70 [<ffffffff8106f750>] ? autoremove_wake_function+0x40/0x40 [<ffffffff811867e6>] ? bio_add_page+0x36/0x40 [<ffffffff8110b692>] set_page_dirty_lock+0x52/0x60 [<ffffffff81186021>] bio_set_pages_dirty+0x51/0x70 [<ffffffff8118c6b4>] do_blockdev_direct_IO+0xb24/0xeb0 [<ffffffff811e71a0>] ? ext3_get_blocks_handle+0xe00/0xe00 [<ffffffff8118ca95>] __blockdev_direct_IO+0x55/0x60 [<ffffffff811e71a0>] ? ext3_get_blocks_handle+0xe00/0xe00 [<ffffffff811e91c8>] ext3_direct_IO+0xf8/0x390 [<ffffffff811e71a0>] ? ext3_get_blocks_handle+0xe00/0xe00 [<ffffffff81004b60>] ? xen_mc_flush+0xb0/0x1b0 [<ffffffff81104027>] generic_file_aio_read+0x737/0x780 [<ffffffff813bedeb>] ? gnttab_map_refs+0x15b/0x1e0 [<ffffffff811038f0>] ? find_get_pages+0x150/0x150 [<ffffffff8119736c>] aio_rw_vect_retry+0x7c/0x1d0 [<ffffffff811972f0>] ? lookup_ioctx+0x90/0x90 [<ffffffff81198856>] aio_run_iocb+0x66/0x1a0 [<ffffffff811998b8>] do_io_submit+0x708/0xb90 [<ffffffff81199d50>] sys_io_submit+0x10/0x20 [<ffffffff81a18d69>] system_call_fastpath+0x16/0x1b The explanation is in the comment within the code: We need to do this because the pages shared by the frontend (xen-blkfront) can be already locked (lock_page, called by do_read_cache_page); when the userspace backend tries to use them with direct_IO, mfn_to_pfn returns the pfn of the frontend, so do_blockdev_direct_IO is going to try to lock the same pages again resulting in a deadlock. A simplified call graph looks like this: pygrub QEMU ----------------------------------------------- do_read_cache_page io_submit | | lock_page ext3_direct_IO | bio_add_page | lock_page Internally the xen-blkback uses m2p_add_override to swizzle (temporarily) a 'struct page' to have a different MFN (so that it can point to another guest). It also can easily find out whether another pfn corresponding to the mfn exists in the m2p, and can set the FOREIGN bit in the p2m, making sure that mfn_to_pfn returns the pfn of the backend. This allows the backend to perform direct_IO on these pages, but as a side effect prevents the frontend from using get_user_pages_fast on them while they are being shared with the backend. Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-04x86: Simplify code by removing a !SMP #ifdefs from 'struct cpuinfo_x86'Kevin Winchester
commit 141168c36cdee3ff23d9c7700b0edc47cb65479f and commit 3f806e50981825fa56a7f1938f24c0680816be45 upstream. Several fields in struct cpuinfo_x86 were not defined for the !SMP case, likely to save space. However, those fields still have some meaning for UP, and keeping them allows some #ifdef removal from other files. The additional size of the UP kernel from this change is not significant enough to worry about keeping up the distinction: text data bss dec hex filename 4737168 506459 972040 6215667 5ed7f3 vmlinux.o.before 4737444 506459 972040 6215943 5ed907 vmlinux.o.after for a difference of 276 bytes for an example UP config. If someone wants those 276 bytes back badly then it should be implemented in a cleaner way. Signed-off-by: Kevin Winchester <kjwinchester@gmail.com> Cc: Steffen Persvold <sp@numascale.com> Link: http://lkml.kernel.org/r/1324428742-12498-1-git-send-email-kjwinchester@gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-02s390/idle: fix sequence handling vs cpu hotplugHeiko Carstens
commit 0008204ffe85d23382d6fd0f971f3f0fbe70bae2 upstream. The s390 idle accounting code uses a sequence counter which gets used when the per cpu idle statistics get updated and read. One assumption on read access is that only when the sequence counter is even and did not change while reading all values the result is valid. On cpu hotplug however the per cpu data structure gets initialized via a cpu hotplug notifier on CPU_ONLINE. CPU_ONLINE however is too late, since the onlined cpu is already running and might access the per cpu data. Worst case is that the data structure gets initialized while an idle thread is updating its idle statistics. This will result in an uneven sequence counter after an update. As a result user space tools like top, which access /proc/stat in order to get idle stats, will busy loop waiting for the sequence counter to become even again, which will never happen until the queried cpu will update its idle statistics again. And even then the sequence counter will only have an even value for a couple of cpu cycles. Fix this by moving the initialization of the per cpu idle statistics to cpu_init(). I prefer that solution in favor of changing the notifier to CPU_UP_PREPARE, which would be a different solution to the problem. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-02powerpc: Add "memory" attribute for mfmsr()Tiejun Chen
commit b416c9a10baae6a177b4f9ee858b8d309542fbef upstream. Add "memory" attribute in inline assembly language as a compiler barrier to make sure 4.6.x GCC don't reorder mfmsr(). Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-02x86, microcode: Sanitize per-cpu microcode reloading interfaceBorislav Petkov
commit c9fc3f778a6a215ace14ee556067c73982b6d40f upstream. Microcode reloading in a per-core manner is a very bad idea for both major x86 vendors. And the thing is, we have such interface with which we can end up with different microcode versions applied on different cores of an otherwise homogeneous wrt (family,model,stepping) system. So turn off the possibility of doing that per core and allow it only system-wide. This is a minimal fix which we'd like to see in stable too thus the more-or-less arbitrary decision to allow system-wide reloading only on the BSP: $ echo 1 > /sys/devices/system/cpu/cpu0/microcode/reload ... and disable the interface on the other cores: $ echo 1 > /sys/devices/system/cpu/cpu23/microcode/reload -bash: echo: write error: Invalid argument Also, allowing the reload only from one CPU (the BSP in that case) doesn't allow the reload procedure to degenerate into an O(n^2) deal when triggering reloads from all /sys/devices/system/cpu/cpuX/microcode/reload sysfs nodes simultaneously. A more generic fix will follow. Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Link: http://lkml.kernel.org/r/1340280437-7718-2-git-send-email-bp@amd64.org Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-02x86, microcode: microcode_core.c simple_strtoul cleanupShuah Khan
commit e826abd523913f63eb03b59746ffb16153c53dc4 upstream. Change reload_for_cpu() in kernel/microcode_core.c to call kstrtoul() instead of calling obsoleted simple_strtoul(). Signed-off-by: Shuah Khan <shuahkhan@gmail.com> Reviewed-by: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/1336324264.2897.9.camel@lorien2 Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-02x86: Fix boot on Twinhead H12YAlan Cox
commit 80b3e557371205566a71e569fbfcce5b11f92dbe upstream. Despite lots of investigation into why this is needed we don't know or have an elegant cure. The only answer found on this laptop is to mark a problem region as used so that Linux doesn't put anything there. Currently all the users add reserve= command lines and anyone not knowing this needs to find the magic page that documents it. Automate it instead. Signed-off-by: Alan Cox <alan@linux.intel.com> Tested-and-bugfixed-by: Arne Fitzenreiter <arne@fitzenreiter.de> Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=10231 Link: http://lkml.kernel.org/r/20120515174347.5109.94551.stgit@bluebook Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-02powerpc/ftrace: Fix assembly trampoline register usageroger blofeld
commit fd5a42980e1cf327b7240adf5e7b51ea41c23437 upstream. Just like the module loader, ftrace needs to be updated to use r12 instead of r11 with newer gcc's. Signed-off-by: Roger Blofeld <blofeldus@yahoo.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25ACPI, x86: fix Dell M6600 ACPI reboot regression via DMIZhang Rui
commit 76eb9a30db4bc8fd172f9155247264b5f2686d7b upstream. Dell Precision M6600 is known to require PCI reboot, so add it to the reboot blacklist in pci_reboot_dmi_table[]. https://bugzilla.kernel.org/show_bug.cgi?id=42749 cc: x86@kernel.org Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25ACPI: Add a quirk for "AMILO PRO V2030" to ignore the timer overridingFeng Tang
commit b939c2acf1dc42b08407ef5174f2e8d6f43dd5ea upstream. commit f6b54f083cc66cf9b11d2120d8df3c2ad4e0836d upstream. This is the 2nd part of fix for kernel bugzilla 40002: "IRQ 0 assigned to VGA" https://bugzilla.kernel.org/show_bug.cgi?id=40002 The root cause is the buggy FW, whose ACPI tables assign the GSI 16 to 2 irqs 0 and 16(VGA), and the VGA is the right owner of GSI 16. So add a quirk to ignore the irq0 overriding GSI 16 for the FUJITSU SIEMENS AMILO PRO V2030 platform will solve this issue. Reported-and-tested-by: Szymon Kowalczyk <fazerxlo@o2.pl> Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25ACPI: Remove one board specific WARN when ignoring timer overridingFeng Tang
commit 5752cdb805ff89942d99d12118e2844e7db34df8 upstream. commit 7f68b4c2e158019c2ec494b5cfbd9c83b4e5b253 upstream. Current WARN msg is only for the ati_ixp4x0 board, while this function is used by mulitple platforms. So this one board specific warning is not appropriate any more. Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25ACPI: Make acpi_skip_timer_override cover all source_irq==0 casesFeng Tang
commit ae10ccdc3093486f8c2369d227583f9d79f628e5 upstream. Currently when acpi_skip_timer_override is set, it only cover the (source_irq == 0 && global_irq == 2) cases. While there is also platform which need use this option and its global_irq is not 2. This patch will extend acpi_skip_timer_override to cover all timer overriding cases as long as the source irq is 0. This is the first part of a fix to kernel bug bugzilla 40002: "IRQ 0 assigned to VGA" https://bugzilla.kernel.org/show_bug.cgi?id=40002 Reported-and-tested-by: Szymon Kowalczyk <fazerxlo@o2.pl> Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25powerpc: Fix wrong divisor in usecs_to_cputimeAndreas Schwab
commit 9f5072d4f63f28d30d343573830ac6c85fc0deff upstream. Commit d57af9b (taskstats: use real microsecond granularity for CPU times) renamed msecs_to_cputime to usecs_to_cputime, but failed to update all numbers on the way. This causes nonsensical cpu idle/iowait values to be displayed in /proc/stat (the only user of usecs_to_cputime so far). This also renames __cputime_msec_factor to __cputime_usec_factor, adapting its value and using it directly in cputime_to_usecs instead of doing two multiplications. Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Acked-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25MIPS: Properly align the .data..init_task section.David Daney
commit 7b1c0d26a8e272787f0f9fcc5f3e8531df3b3409 upstream. Improper alignment can lead to unbootable systems and/or random crashes. [ralf@linux-mips.org: This is a lond standing bug since 6eb10bc9e2deab06630261cd05c4cb1e9a60e980 (kernel.org) rsp. c422a10917f75fd19fa7fe070aaaa23e384dae6f (lmo) [MIPS: Clean up linker script using new linker script macros.] so dates back to 2.6.32.] Signed-off-by: David Daney <david.daney@cavium.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/3881/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25ARM: SAMSUNG: fix race in s3c_adc_start for ADCTodd Poynor
commit 8265981bb439f3ecc5356fb877a6c2a6636ac88a upstream. Checking for adc->ts_pend already claimed should be done with the lock held. Signed-off-by: Todd Poynor <toddpoynor@google.com> Acked-by: Ben Dooks <ben-linux@fluff.org> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-12ARM: fix rcu stalls on SMP platformsRussell King
commit 7deabca0acfe02b8e18f59a4c95676012f49a304 upstream. We can stall RCU processing on SMP platforms if a CPU sits in its idle loop for a long time. This happens because we don't call irq_enter() and irq_exit() around generic_smp_call_function_interrupt() and friends. Add the necessary calls, and remove the one from within ipi_timer(), so that they're all in a common place. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-12powerpc/kvm: sldi should be sldMichael Neuling
commit 2f584a146a2965b82fce89b8d2f95dc5cfe468d0 upstream. Since we are taking a registers, this should never have been an sldi. Talking to paulus offline, this is the correct fix. Was introduced by: commit 19ccb76a1938ab364a412253daec64613acbf3df Author: Paul Mackerras <paulus@samba.org> Date: Sat Jul 23 17:42:46 2011 +1000 Talking to paulus, this shouldn't be a literal. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-12powerpc/xmon: Use cpumask iterator to avoid warningAnton Blanchard
commit bc1d7702910c7c7e88eb60b58429dbfe293683ce upstream. We have a bug report where the kernel hits a warning in the cpumask code: WARNING: at include/linux/cpumask.h:107 Which is: WARN_ON_ONCE(cpu >= nr_cpumask_bits); The backtrace is: cpu_cmd cmds xmon_core xmon die xmon is iterating through 0 to NR_CPUS. I'm not sure why we are still open coding this but iterating above nr_cpu_ids is definitely a bug. This patch iterates through all possible cpus, in case we issue a system reset and CPUs in an offline state call in. Perhaps the old code was trying to handle CPUs that were in the partition but were never started (eg kexec into a kernel with an nr_cpus= boot option). They are going to die way before we get into xmon since we haven't set any kernel state up for them. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-04x86, cpufeature: Rename X86_FEATURE_DTS to X86_FEATURE_DTHERMH. Peter Anvin
commit 4ad33411308596f2f918603509729922a1ec4411 upstream. It makes sense to label "Digital Thermal Sensor" as "DTS", but unfortunately the string "dts" was already used for "Debug Store", and /proc/cpuinfo is a user space ABI. Therefore, rename this to "dtherm". This conflict went into mainline via the hwmon tree without any x86 maintainer ack, and without any kind of hint in the subject. a4659053 x86/hwmon: fix initialization of coretemp Reported-by: Jean Delvare <khali@linux-fr.org> Link: http://lkml.kernel.org/r/4FE34BCB.5050305@linux.intel.com Cc: Jan Beulich <JBeulich@suse.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> [bwh: Backported to 3.2: drop the coretemp device table change] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-04ARM: SAMSUNG: Fix for S3C2412 EBI memory mappingJose Miguel Goncalves
commit 3dca938656c7b0ff6b0717a5dde0f5f45e592be5 upstream. While upgrading the kernel on a S3C2412 based board I've noted that it was impossible to boot the board with a 2.6.32 or upper kernel. I've tracked down the problem to the EBI virtual memory mapping that is in conflict with the IO mapping definition in arch/arm/mach-s3c24xx/s3c2412.c. Signed-off-by: Jose Miguel Goncalves <jose.goncalves@inov.pt> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-04ARM: SAMSUNG: Should check for IS_ERR(clk) instead of NULLJonghwan Choi
commit a5d8f4765f0e92ef027492a8cb979c5b8d45f2c3 upstream. On the error condition clk_get() returns ERR_PTR(). Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-04thp: avoid atomic64_read in pmd_read_atomic for 32bit PAEAndrea Arcangeli
commit e4eed03fd06578571c01d4f1478c874bb432c815 upstream. In the x86 32bit PAE CONFIG_TRANSPARENT_HUGEPAGE=y case while holding the mmap_sem for reading, cmpxchg8b cannot be used to read pmd contents under Xen. So instead of dealing only with "consistent" pmdvals in pmd_none_or_trans_huge_or_clear_bad() (which would be conceptually simpler) we let pmd_none_or_trans_huge_or_clear_bad() deal with pmdvals where the low 32bit and high 32bit could be inconsistent (to avoid having to use cmpxchg8b). The only guarantee we get from pmd_read_atomic is that if the low part of the pmd was found null, the high part will be null too (so the pmd will be considered unstable). And if the low part of the pmd is found "stable" later, then it means the whole pmd was read atomically (because after a pmd is stable, neither MADV_DONTNEED nor page faults can alter it anymore, and we read the high part after the low part). In the 32bit PAE x86 case, it is enough to read the low part of the pmdval atomically to declare the pmd as "stable" and that's true for THP and no THP, furthermore in the THP case we also have a barrier() that will prevent any inconsistent pmdvals to be cached by a later re-read of the *pmd. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Jonathan Nieder <jrnieder@gmail.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Tested-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-04mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race conditionAndrea Arcangeli
commit 26c191788f18129af0eb32a358cdaea0c7479626 upstream. When holding the mmap_sem for reading, pmd_offset_map_lock should only run on a pmd_t that has been read atomically from the pmdp pointer, otherwise we may read only half of it leading to this crash. PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic" #0 [f06a9dd8] crash_kexec at c049b5ec #1 [f06a9e2c] oops_end at c083d1c2 #2 [f06a9e40] no_context at c0433ded #3 [f06a9e64] bad_area_nosemaphore at c043401a #4 [f06a9e6c] __do_page_fault at c0434493 #5 [f06a9eec] do_page_fault at c083eb45 #6 [f06a9f04] error_code (via page_fault) at c083c5d5 EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP: 00000000 DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0 CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246 #7 [f06a9f38] _spin_lock at c083bc14 #8 [f06a9f44] sys_mincore at c0507b7d #9 [f06a9fb0] system_call at c083becd start len EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00 SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033 CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286 This should be a longstanding bug affecting x86 32bit PAE without THP. Only archs with 64bit large pmd_t and 32bit unsigned long should be affected. With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad() would partly hide the bug when the pmd transition from none to stable, by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is enabled a new set of problem arises by the fact could then transition freely in any of the none, pmd_trans_huge or pmd_trans_stable states. So making the barrier in pmd_none_or_trans_huge_or_clear_bad() unconditional isn't good idea and it would be a flakey solution. This should be fully fixed by introducing a pmd_read_atomic that reads the pmd in order with THP disabled, or by reading the pmd atomically with cmpxchg8b with THP enabled. Luckily this new race condition only triggers in the places that must already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix is localized there but this bug is not related to THP. NOTE: this can trigger on x86 32bit systems with PAE enabled with more than 4G of ram, otherwise the high part of the pmd will never risk to be truncated because it would be zero at all times, in turn so hiding the SMP race. This bug was discovered and fully debugged by Ulrich, quote: ---- [..] pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and eax. 496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd) 497 { 498 /* depend on compiler for an atomic pmd read */ 499 pmd_t pmdval = *pmd; // edi = pmd pointer 0xc0507a74 <sys_mincore+548>: mov 0x8(%esp),%edi ... // edx = PTE page table high address 0xc0507a84 <sys_mincore+564>: mov 0x4(%edi),%edx ... // eax = PTE page table low address 0xc0507a8e <sys_mincore+574>: mov (%edi),%eax [..] Please note that the PMD is not read atomically. These are two "mov" instructions where the high order bits of the PMD entry are fetched first. Hence, the above machine code is prone to the following race. - The PMD entry {high|low} is 0x0000000000000000. The "mov" at 0xc0507a84 loads 0x00000000 into edx. - A page fault (on another CPU) sneaks in between the two "mov" instructions and instantiates the PMD. - The PMD entry {high|low} is now 0x00000003fda38067. The "mov" at 0xc0507a8e loads 0xfda38067 into eax. ---- Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-06-19ARM i.MX imx21ads: Fix overlapping static i/o mappingsJaccon Bastiaansen
commit 350ab15bb2ffe7103bc6bf6c634f3c5b286eaf2a upstream. The statically defined I/O memory regions for the i.MX21 on chip peripherals and the on board I/O peripherals of the i.MX21ADS board overlap. This results in a kernel crash during startup. This is fixed by reducing the memory range for the on board I/O peripherals to the actually required range. Signed-off-by: Jaccon Bastiaansen <jaccon.bastiaansen@gmail.com> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-06-19ARM: imx6: exit coherency when shutting down a cpuShawn Guo
commit 602bf40971d7f9a1ec0b7ba2b7e6427849828651 upstream. There is a system hang issue on imx6q which can easily be seen with running a cpu hotplug stress testing (hotplug secondary cores from user space via sysfs interface for thousands iterations). It turns out that the issue is caused by coherency of the cpu that is being shut down. When shutting down a cpu, we need to have the cpu exit coherency to prevent it from receiving cache, TLB, or BTB maintenance operations broadcast by other CPUs in the cluster. Copy cpu_enter_lowpower() and cpu_leave_lowpower() from mach-vexpress to have coherency properly handled in platform_cpu_die(), thus fix the issue. Signed-off-by: Shawn Guo <shawn.guo@linaro.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-06-19ARM i.MX53: Fix PLL4 base addressSascha Hauer
commit cdd781ab1906d039c2a93078385645d2d5af8491 upstream. MX53_DPLL4_BASE accidently returned the base address of PLL3. Fix this. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> [bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-06-19xen/setup: filter APERFMPERF cpuid feature outAndre Przywara
commit 5e626254206a709c6e937f3dda69bf26c7344f6f upstream. Xen PV kernels allow access to the APERF/MPERF registers to read the effective frequency. Access to the MSRs is however redirected to the currently scheduled physical CPU, making consecutive read and compares unreliable. In addition each rdmsr traps into the hypervisor. So to avoid bogus readouts and expensive traps, disable the kernel internal feature flag for APERF/MPERF if running under Xen. This will a) remove the aperfmperf flag from /proc/cpuinfo b) not mislead the power scheduler (arch/x86/kernel/cpu/sched.c) to use the feature to improve scheduling (by default disabled) c) not mislead the cpufreq driver to use the MSRs This does not cover userland programs which access the MSRs via the device file interface, but this will be addressed separately. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>