aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/include
AgeCommit message (Collapse)Author
2013-08-13mm: save soft-dirty bits on swapped pagesCyrill Gorcunov
Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set get swapped out, the bit is getting lost and no longer available when pte read back. To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in pte entry for the page being swapped out. When such page is to be read back from a swap cache we check for bit presence and if it's there we clear it and restore the former _PAGE_SOFT_DIRTY bit back. One of the problem was to find a place in pte entry where we can save the _PTE_SWP_SOFT_DIRTY bit while page is in swap. The _PAGE_PSE was chosen for that, it doesn't intersect with swap entry format stored in pte. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13sched: fix the theoretical signal_wake_up() vs schedule() raceOleg Nesterov
This is only theoretical, but after try_to_wake_up(p) was changed to check p->state under p->pi_lock the code like __set_current_state(TASK_INTERRUPTIBLE); schedule(); can miss a signal. This is the special case of wait-for-condition, it relies on try_to_wake_up/schedule interaction and thus it does not need mb() between __set_current_state() and if(signal_pending). However, this __set_current_state() can move into the critical section protected by rq->lock, now that try_to_wake_up() takes another lock we need to ensure that it can't be reordered with "if (signal_pending(current))" check inside that section. The patch is actually one-liner, it simply adds smp_wmb() before spin_lock_irq(rq->lock). This is what try_to_wake_up() already does by the same reason. We turn this wmb() into the new helper, smp_mb__before_spinlock(), for better documentation and to allow the architectures to change the default implementation. While at it, kill smp_mb__after_lock(), it has no callers. Perhaps we can also add smp_mb__before/after_spinunlock() for prepare_to_wait(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13KVM: x86: Update symbolic exit codesJan Kiszka
Add decoding for INVEPT and reorder the list according to the reason numbers. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-12Merge tag 'please-pull-mce-f-bit' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/ras Pull MCE-uncorrected-error fix from Tony Luck: "Bit 12 may or may not be set in MCi_STATUS.MCACOD when an uncorrected error is reported. Ignore it when checking error signatures." Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-12x86, microcode, AMD: Fix early microcode loadingTorsten Kaiser
load_microcode_amd() (and the helper it is using) should not have an cpu parameter. The microcode loading does not depend on the CPU wrt the patches loaded since they will end up in a global list for all CPUs anyway. The change from cpu to x86family in load_microcode_amd() now allows to drop the code messing with cpu_data(cpu) from collect_cpu_info_amd_early(), which is wrong anyway because at that point the per-cpu cpu_info is not yet setup (These values would later be overwritten by smp_store_boot_cpu_info() / smp_store_cpu_info()). Fold the rest of collect_cpu_info_amd_early() into load_ucode_amd_ap(), because its only used at one place and without the cpuinfo_x86 accesses it was not much left. Signed-off-by: Torsten Kaiser <just.for.lkml@googlemail.com> [ Fengguang: build fix ] Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> [ Boris: adapt it to current tree. ] Signed-off-by: Borislav Petkov <bp@suse.de>
2013-08-12Merge branch 'x86/mce' into x86/rasIngo Molnar
Pursue a single RAS/MCE topic branch on x86. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-12PCI: use weak functions for MSI arch-specific functionsThomas Petazzoni
Until now, the MSI architecture-specific functions could be overloaded using a fairly complex set of #define and compile-time conditionals. In order to prepare for the introduction of the msi_chip infrastructure, it is desirable to switch all those functions to use the 'weak' mechanism. This commit converts all the architectures that were overidding those MSI functions to use the new strategy. Note that we keep two separate, non-weak, functions default_teardown_msi_irqs() and default_restore_msi_irqs() for the default behavior of the arch_teardown_msi_irqs() and arch_restore_msi_irqs(), as the default behavior is needed by x86 PCI code. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Tested-by: Daniel Price <daniel.price@gmail.com> Tested-by: Thierry Reding <thierry.reding@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: linux-s390@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: x86@kernel.org Cc: Russell King <linux@arm.linux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: linux-ia64@vger.kernel.org Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: David S. Miller <davem@davemloft.net> Cc: sparclinux@vger.kernel.org Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Jason Cooper <jason@lakedaemon.net>
2013-08-09x86: Don't clear olpc_ofw_header when sentinel is detectedDaniel Drake
OpenFirmware wasn't quite following the protocol described in boot.txt and the kernel has detected this through use of the sentinel value in boot_params. OFW does zero out almost all of the stuff that it should do, but not the sentinel. This causes the kernel to clear olpc_ofw_header, which breaks x86 OLPC support. OpenFirmware has now been fixed. However, it would be nice if we could maintain Linux compatibility with old firmware versions. To do that, we just have to avoid zeroing out olpc_ofw_header. OFW does not write to any other parts of the header that are being zapped by the sentinel-detection code, and all users of olpc_ofw_header are somewhat protected through checking for the OLPC_OFW_SIG magic value before using it. So this should not cause any problems for anyone. Signed-off-by: Daniel Drake <dsd@laptop.org> Link: http://lkml.kernel.org/r/20130809221420.618E6FAB03@dev.laptop.org Acked-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: <stable@vger.kernel.org> # v3.9+
2013-08-09xen: Support 64-bit PV guest receiving NMIsKonrad Rzeszutek Wilk
This is based on a patch that Zhenzhong Duan had sent - which was missing some of the remaining pieces. The kernel has the logic to handle Xen-type-exceptions using the paravirt interface in the assembler code (see PARAVIRT_ADJUST_EXCEPTION_FRAME - pv_irq_ops.adjust_exception_frame and and INTERRUPT_RETURN - pv_cpu_ops.iret). That means the nmi handler (and other exception handlers) use the hypervisor iret. The other changes that would be neccessary for this would be to translate the NMI_VECTOR to one of the entries on the ipi_vector and make xen_send_IPI_mask_allbutself use different events. Fortunately for us commit 1db01b4903639fcfaec213701a494fe3fb2c490b (xen: Clean up apic ipi interface) implemented this and we piggyback on the cleanup such that the apic IPI interface will pass the right vector value for NMI. With this patch we can trigger NMIs within a PV guest (only tested x86_64). For this to work with normal PV guests (not initial domain) we need the domain to be able to use the APIC ops - they are already implemented to use the Xen event channels. For that to be turned on in a PV domU we need to remove the masking of X86_FEATURE_APIC. Incidentally that means kgdb will also now work within a PV guest without using the 'nokgdbroundup' workaround. Note that the 32-bit version is different and this patch does not enable that. CC: Lisa Nguyen <lisa@xenapiadmin.com> CC: Ben Guthro <benjamin.guthro@citrix.com> CC: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [v1: Fixed up per David Vrabel comments] Reviewed-by: Ben Guthro <benjamin.guthro@citrix.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com>
2013-08-09kvm uapi: Add KICK_CPU and PV_UNHALT definition to uapiRaghavendra K T
These are needed by both guest and host. Originally-from: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1376058122-8248-13-git-send-email-raghavendra.kt@linux.vnet.ibm.com Acked-by: Gleb Natapov <gleb@redhat.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-09x86, ticketlock: Add slowpath logicJeremy Fitzhardinge
Maintain a flag in the LSB of the ticket lock tail which indicates whether anyone is in the lock slowpath and may need kicking when the current holder unlocks. The flags are set when the first locker enters the slowpath, and cleared when unlocking to an empty queue (ie, no contention). In the specific implementation of lock_spinning(), make sure to set the slowpath flags on the lock just before blocking. We must do this before the last-chance pickup test to prevent a deadlock with the unlocker: Unlocker Locker test for lock pickup -> fail unlock test slowpath -> false set slowpath flags block Whereas this works in any ordering: Unlocker Locker set slowpath flags test for lock pickup -> fail block unlock test slowpath -> true, kick If the unlocker finds that the lock has the slowpath flag set but it is actually uncontended (ie, head == tail, so nobody is waiting), then it clears the slowpath flag. The unlock code uses a locked add to update the head counter. This also acts as a full memory barrier so that its safe to subsequently read back the slowflag state, knowing that the updated lock is visible to the other CPUs. If it were an unlocked add, then the flag read may just be forwarded from the store buffer before it was visible to the other CPUs, which could result in a deadlock. Unfortunately this means we need to do a locked instruction when unlocking with PV ticketlocks. However, if PV ticketlocks are not enabled, then the old non-locked "add" is the only unlocking code. Note: this code relies on gcc making sure that unlikely() code is out of line of the fastpath, which only happens when OPTIMIZE_SIZE=n. If it doesn't the generated code isn't too bad, but its definitely suboptimal. Thanks to Srivatsa Vaddagiri for providing a bugfix to the original version of this change, which has been folded in. Thanks to Stephan Diestelhorst for commenting on some code which relied on an inaccurate reading of the x86 memory ordering rules. Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org> Link: http://lkml.kernel.org/r/1376058122-8248-11-git-send-email-raghavendra.kt@linux.vnet.ibm.com Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Stephan Diestelhorst <stephan.diestelhorst@amd.com> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-09x86, pvticketlock: When paravirtualizing ticket locks, increment by 2Jeremy Fitzhardinge
Increment ticket head/tails by 2 rather than 1 to leave the LSB free to store a "is in slowpath state" bit. This halves the number of possible CPUs for a given ticket size, but this shouldn't matter in practice - kernels built for 32k+ CPU systems are probably specially built for the hardware rather than a generic distro kernel. Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org> Link: http://lkml.kernel.org/r/1376058122-8248-9-git-send-email-raghavendra.kt@linux.vnet.ibm.com Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: Attilio Rao <attilio.rao@citrix.com> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-09x86, pvticketlock: Use callee-save for lock_spinningJeremy Fitzhardinge
Although the lock_spinning calls in the spinlock code are on the uncommon path, their presence can cause the compiler to generate many more register save/restores in the function pre/postamble, which is in the fast path. To avoid this, convert it to using the pvops callee-save calling convention, which defers all the save/restores until the actual function is called, keeping the fastpath clean. Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org> Link: http://lkml.kernel.org/r/1376058122-8248-8-git-send-email-raghavendra.kt@linux.vnet.ibm.com Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: Attilio Rao <attilio.rao@citrix.com> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-09x86, ticketlock: Collapse a layer of functionsJeremy Fitzhardinge
Now that the paravirtualization layer doesn't exist at the spinlock level any more, we can collapse the __ticket_ functions into the arch_ functions. Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org> Link: http://lkml.kernel.org/r/1376058122-8248-4-git-send-email-raghavendra.kt@linux.vnet.ibm.com Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: Attilio Rao <attilio.rao@citrix.com> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-09x86, spinlock: Replace pv spinlocks with pv ticketlocksJeremy Fitzhardinge
Rather than outright replacing the entire spinlock implementation in order to paravirtualize it, keep the ticket lock implementation but add a couple of pvops hooks on the slow patch (long spin on lock, unlocking a contended lock). Ticket locks have a number of nice properties, but they also have some surprising behaviours in virtual environments. They enforce a strict FIFO ordering on cpus trying to take a lock; however, if the hypervisor scheduler does not schedule the cpus in the correct order, the system can waste a huge amount of time spinning until the next cpu can take the lock. (See Thomas Friebel's talk "Prevent Guests from Spinning Around" http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.) To address this, we add two hooks: - __ticket_spin_lock which is called after the cpu has been spinning on the lock for a significant number of iterations but has failed to take the lock (presumably because the cpu holding the lock has been descheduled). The lock_spinning pvop is expected to block the cpu until it has been kicked by the current lock holder. - __ticket_spin_unlock, which on releasing a contended lock (there are more cpus with tail tickets), it looks to see if the next cpu is blocked and wakes it if so. When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub functions causes all the extra code to go away. Results: ======= setup: 32 core machine with 32 vcpu KVM guest (HT off) with 8GB RAM base = 3.11-rc patched = base + pvspinlock V12 +-----------------+----------------+--------+ dbench (Throughput in MB/sec. Higher is better) +-----------------+----------------+--------+ | base (stdev %)|patched(stdev%) | %gain | +-----------------+----------------+--------+ | 15035.3 (0.3) |15150.0 (0.6) | 0.8 | | 1470.0 (2.2) | 1713.7 (1.9) | 16.6 | | 848.6 (4.3) | 967.8 (4.3) | 14.0 | | 652.9 (3.5) | 685.3 (3.7) | 5.0 | +-----------------+----------------+--------+ pvspinlock shows benefits for overcommit ratio > 1 for PLE enabled cases, and undercommits results are flat Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org> Link: http://lkml.kernel.org/r/1376058122-8248-2-git-send-email-raghavendra.kt@linux.vnet.ibm.com Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: Attilio Rao <attilio.rao@citrix.com> [ Raghavendra: Changed SPIN_THRESHOLD, fixed redefinition of arch_spinlock_t] Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-07x86, relocs: Move ELF relocation handling to CKees Cook
Moves the relocation handling into C, after decompression. This requires that the decompressed size is passed to the decompression routine as well so that relocations can be found. Only kernels that need relocation support will use the code (currently just x86_32), but this is laying the ground work for 64-bit using it in support of KASLR. Based on work by Neill Clift and Michael Davidson. Signed-off-by: Kees Cook <keescook@chromium.org> Link: http://lkml.kernel.org/r/20130708161517.GA4832@www.outflux.net Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-07nEPT: Nested INVEPTNadav Har'El
If we let L1 use EPT, we should probably also support the INVEPT instruction. In our current nested EPT implementation, when L1 changes its EPT table for L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in the course of this modification already calls INVEPT. But if last level of shadow page is unsync not all L1's changes to EPT12 are intercepted, which means roots need to be synced when L1 calls INVEPT. Global INVEPT should not be different since roots are synced by kvm_mmu_load() each time EPTP02 changes. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07nEPT: Add nEPT violation/misconfigration supportYang Zhang
Inject nEPT fault to L1 guest. This patch is original from Xinhao. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-06x86/jump-label: Use best default nops for inital jump label callsSteven Rostedt
As specified by H. Peter Anvin, the best nops for x86 without knowing the running computer is: 32bit: 0x3e, 0x8d, 0x74, 0x26, 0x00 also known as GENERIC_NOP5_ATOMIC 64bit: 0x0f, 0x1f, 0x44, 0x00, 0x00 also known as P6_NOP5_ATOMIC Currently the default nop that is used by jump label is: 0xe9 0x00 0x00 0x00 0x00 Which is really a 5byte jump to the next position. It's better to use a real nop than a jmp. Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Jason Baron <jbaron@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-08-06x86, asmlinkage, vdso: Mark vdso variables __visibleAndi Kleen
Signed-off-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1375740170-7446-17-git-send-email-andi@firstfloor.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-06x86, asmlinkage: Make 64bit checksum functions visibleAndi Kleen
They are implemented in assembler. Signed-off-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1375740170-7446-14-git-send-email-andi@firstfloor.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-06x86, asmlinkage, paravirt: Add __visible/asmlinkage to xen paravirt opsAndi Kleen
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1375740170-7446-13-git-send-email-andi@firstfloor.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-06x86, asmlinkage: Make several variables used from assembler/linker script ↵Andi Kleen
visible Plus one function, load_gs_index(). Signed-off-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1375740170-7446-10-git-send-email-andi@firstfloor.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-06x86, asmlinkage: Make kprobes code visible and fix assembler codeAndi Kleen
- Make all the external assembler template symbols __visible - Move the templates inline assembler code into a top level assembler statement, not inside a function. This avoids it being optimized away or cloned. Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1375740170-7446-8-git-send-email-andi@firstfloor.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-06x86, asmlinkage: Make various syscalls asmlinkageAndi Kleen
FWIW I suspect sys_rt_sigreturn/sys_sigreturn should use standard SYSCALL wrappers. But I didn't do that change in this patch. Signed-off-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1375740170-7446-7-git-send-email-andi@firstfloor.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-06x86, asmlinkage: Make 32bit/64bit __switch_to visibleAndi Kleen
This function is called from inline assembler, so has to be visible. Signed-off-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1375740170-7446-6-git-send-email-andi@firstfloor.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-06x86, asmlinkage: Make _*_start_kernel visibleAndi Kleen
Obviously these functions have to be visible, otherwise the whole kernel could be optimized away. Signed-off-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1375740170-7446-5-git-send-email-andi@firstfloor.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-06x86, asmlinkage: Make all interrupt handlers asmlinkage / __visibleAndi Kleen
These handlers are all referenced from assembler stubs, so need to be visible. The handlers without arguments become asmlinkage, the others __visible to not force regparms(0) on x86-32. I put it all into a single patch, please let me know if you want it it split up. Signed-off-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1375740170-7446-4-git-send-email-andi@firstfloor.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-06x86, asmlinkage: Change dotraplinkage into __visible on 32bitAndi Kleen
Mark 32bit dotraplinkage functions as __visible for LTO. 64bit already is using asmlinkage which includes it. v2: Clean up (M.Marek) Signed-off-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1375740170-7446-3-git-send-email-andi@firstfloor.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-06x86: Fix sys_call_table type in asm/syscall.hAndi Kleen
Make the sys_call_table type defined in asm/syscall.h match the definition in syscall_64.c v2: include asm/syscall.h in syscall_64.c too. I left uml alone because it doesn't have an syscall.h on its own and including the native one leads to other errors. Signed-off-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1375740170-7446-2-git-send-email-andi@firstfloor.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Richard Weinberger <richard@nod.at>
2013-08-05x86/mce: Pay no attention to 'F' bit in MCACOD when parsing 'UC' errorsTony Luck
The 0x1000 bit of the MCACOD field of machine check MCi_STATUS registers is only defined for corrected errors (where it means that hardware may be filtering errors see SDM section 15.9.2.1). For uncorrected errors it may, or may not be set - so we should mask it out when checking for the architecturaly defined recoverable error signatures (see SDM 15.9.3.1 and 15.9.3.2) Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-08-05x86: Correctly detect hypervisorJason Wang
We try to handle the hypervisor compatibility mode by detecting hypervisor through a specific order. This is not robust, since hypervisors may implement each others features. This patch tries to handle this situation by always choosing the last one in the CPUID leaves. This is done by letting .detect() return a priority instead of true/false and just re-using the CPUID leaf where the signature were found as the priority (or 1 if it was found by DMI). Then we can just pick hypervisor who has the highest priority. Other sophisticated detection method could also be implemented on top. Suggested by H. Peter Anvin and Paolo Bonzini. Acked-by: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Doug Covelli <dcovelli@vmware.com> Cc: Borislav Petkov <bp@suse.de> Cc: Dan Hecht <dhecht@vmware.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Link: http://lkml.kernel.org/r/1374742475-2485-4-git-send-email-jasowang@redhat.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-05x86, kvm: Switch to use hypervisor_cpuid_base()Jason Wang
Switch to use hypervisor_cpuid_base() to detect KVM. Cc: Gleb Natapov <gleb@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Link: http://lkml.kernel.org/r/1374742475-2485-3-git-send-email-jasowang@redhat.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-05xen: Switch to use hypervisor_cpuid_base()Jason Wang
Switch to use hypervisor_cpuid_base() to detect Xen. Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Link: http://lkml.kernel.org/r/1374742475-2485-2-git-send-email-jasowang@redhat.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-05x86: Introduce hypervisor_cpuid_base()Jason Wang
This patch introduce hypervisor_cpuid_base() which loop test the hypervisor existence function until the signature match and check the number of leaves if required. This could be used by Xen/KVM guest to detect the existence of hypervisor. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Link: http://lkml.kernel.org/r/1374742475-2485-1-git-send-email-jasowang@redhat.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-02x86: sysfb: move EFI quirks from efifb to sysfbDavid Herrmann
The EFI FB quirks from efifb.c are useful for simple-framebuffer devices as well. Apply them by default so we can convert efifb.c to use efi-framebuffer platform devices. Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Link: http://lkml.kernel.org/r/1375445127-15480-5-git-send-email-dh.herrmann@gmail.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-02x86: provide platform-devices for boot-framebuffersDavid Herrmann
The current situation regarding boot-framebuffers (VGA, VESA/VBE, EFI) on x86 causes troubles when loading multiple fbdev drivers. The global "struct screen_info" does not provide any state-tracking about which drivers use the FBs. request_mem_region() theoretically works, but unfortunately vesafb/efifb ignore it due to quirks for broken boards. Avoid this by creating a platform framebuffer devices with a pointer to the "struct screen_info" as platform-data. Drivers can now create platform-drivers and the driver-core will refuse multiple drivers being active simultaneously. We keep the screen_info available for backwards-compatibility. Drivers can be converted in follow-up patches. Different devices are created for VGA/VESA/EFI FBs to allow multiple drivers to be loaded on distro kernels. We create: - "vesa-framebuffer" for VBE/VESA graphics FBs - "efi-framebuffer" for EFI FBs - "platform-framebuffer" for everything else This allows to load vesafb, efifb and others simultaneously and each picks up only the supported FB types. Apart from platform-framebuffer devices, this also introduces a compatibility option for "simple-framebuffer" drivers which recently got introduced for OF based systems. If CONFIG_X86_SYSFB is selected, we try to match the screen_info against a simple-framebuffer supported format. If we succeed, we create a "simple-framebuffer" device instead of a platform-framebuffer. This allows to reuse the simplefb.c driver across architectures and also to introduce a SimpleDRM driver. There is no need to have vesafb.c, efifb.c, simplefb.c and more just to have architecture specific quirks in their setup-routines. Instead, we now move the architecture specific quirks into x86-setup and provide a generic simple-framebuffer. For backwards-compatibility (if strange formats are used), we still allow vesafb/efifb to be loaded simultaneously and pick up all remaining devices. Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Link: http://lkml.kernel.org/r/1375445127-15480-4-git-send-email-dh.herrmann@gmail.com Tested-by: Stephen Warren <swarren@nvidia.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-01sched/x86: Optimize switch_mm() for multi-threaded workloadsRik van Riel
Dick Fowles, Don Zickus and Joe Mario have been working on improvements to perf, and noticed heavy cache line contention on the mm_cpumask, running linpack on a 60 core / 120 thread system. The cause turned out to be unnecessary atomic accesses to the mm_cpumask. When in lazy TLB mode, the CPU is only removed from the mm_cpumask if there is a TLB flush event. Most of the time, no such TLB flush happens, and the kernel skips the TLB reload. It can also skip the atomic memory set & test. Here is a summary of Joe's test results: * The __schedule function dropped from 24% of all program cycles down to 5.5%. * The cacheline contention/hotness for accesses to that bitmask went from being the 1st/2nd hottest - down to the 84th hottest (0.3% of all shared misses which is now quite cold) * The average load latency for the bit-test-n-set instruction in __schedule dropped from 10k-15k cycles down to an average of 600 cycles. * The linpack program results improved from 133 GFlops to 144 GFlops. Peak GFlops rose from 133 to 153. Reported-by: Don Zickus <dzickus@redhat.com> Reported-by: Joe Mario <jmario@redhat.com> Tested-by: Joe Mario <jmario@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> Reviewed-by: Paul Turner <pjt@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20130731221421.616d3d20@annuminas.surriel.com [ Made the comments consistent around the modified code. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-07-29x86 / cpu topology: remove the stale macro arch_provides_topology_pointersHanjun Guo
Macro arch_provides_topology_pointers is pointless now, remove it. Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-29KVM: x86: rename EMULATE_DO_MMIOPaolo Bonzini
The next patch will reuse it for other userspace exits than MMIO, namely debug events. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-26cpufreq: Remove unused APERF/MPERF supportStratos Karafotis
The target frequency calculation method in the ondemand governor has changed and it is now independent of the measured average frequency. Consequently, the APERF/MPERF support in cpufreq is not used any more, so drop it. [rjw: Changelog] Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-07-23perf/x86: Add ability to calculate TSC from perf sample timestampsAdrian Hunter
For modern CPUs, perf clock is directly related to TSC. TSC can be calculated from perf clock and vice versa using a simple calculation. Two of the three componenets of that calculation are already exported in struct perf_event_mmap_page. This patch exports the third. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: http://lkml.kernel.org/r/1372425741-1676-3-git-send-email-adrian.hunter@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-07-23kprobes/x86: Call out into INT3 handler directly instead of using notifierJiri Kosina
In fd4363fff3d96 ("x86: Introduce int3 (breakpoint)-based instruction patching"), the mechanism that was introduced for notifying alternatives code from int3 exception handler that and exception occured was die_notifier. This is however problematic, as early code might be using jump labels even before the notifier registration has been performed, which will then lead to an oops due to unhandled exception. One of such occurences has been encountered by Fengguang: int3: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.11.0-rc1-01429-g04bf576 #8 task: ffff88000da1b040 ti: ffff88000da1c000 task.ti: ffff88000da1c000 RIP: 0010:[<ffffffff811098cc>] [<ffffffff811098cc>] ttwu_do_wakeup+0x28/0x225 RSP: 0000:ffff88000dd03f10 EFLAGS: 00000006 RAX: 0000000000000000 RBX: ffff88000dd12940 RCX: ffffffff81769c40 RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000001 RBP: ffff88000dd03f28 R08: ffffffff8176a8c0 R09: 0000000000000002 R10: ffffffff810ff484 R11: ffff88000dd129e8 R12: ffff88000dbc90c0 R13: ffff88000dbc90c0 R14: ffff88000da1dfd8 R15: ffff88000da1dfd8 FS: 0000000000000000(0000) GS:ffff88000dd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000ffffffff CR3: 0000000001c88000 CR4: 00000000000006e0 Stack: ffff88000dd12940 ffff88000dbc90c0 ffff88000da1dfd8 ffff88000dd03f48 ffffffff81109e2b ffff88000dd12940 0000000000000000 ffff88000dd03f68 ffffffff81109e9e 0000000000000000 0000000000012940 ffff88000dd03f98 Call Trace: <IRQ> [<ffffffff81109e2b>] ttwu_do_activate.constprop.56+0x6d/0x79 [<ffffffff81109e9e>] sched_ttwu_pending+0x67/0x84 [<ffffffff8110c845>] scheduler_ipi+0x15a/0x2b0 [<ffffffff8104dfb4>] smp_reschedule_interrupt+0x38/0x41 [<ffffffff8173bf5d>] reschedule_interrupt+0x6d/0x80 <EOI> [<ffffffff810ff484>] ? __atomic_notifier_call_chain+0x5/0xc1 [<ffffffff8105cc30>] ? native_safe_halt+0xd/0x16 [<ffffffff81015f10>] default_idle+0x147/0x282 [<ffffffff81017026>] arch_cpu_idle+0x3d/0x5d [<ffffffff81127d6a>] cpu_idle_loop+0x46d/0x5db [<ffffffff81127f5c>] cpu_startup_entry+0x84/0x84 [<ffffffff8104f4f8>] start_secondary+0x3c8/0x3d5 [...] Fix this by directly calling poke_int3_handler() from the int3 exception handler (analogically to what ftrace has been doing already), instead of relying on notifier, registration of which might not have yet been finalized by the time of the first trap. Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1307231007490.14024@pobox.suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-07-19perf, kvm: Support the in_tx/in_tx_cp modifiers in KVM arch perfmon emulation v5Andi Kleen
[KVM maintainers: The underlying support for this is in perf/core now. So please merge this patch into the KVM tree.] This is not arch perfmon, but older CPUs will just ignore it. This makes it possible to do at least some TSX measurements from a KVM guest v2: Various fixes to address review feedback v3: Ignore the bits when no CPUID. No #GP. Force raw events with TSX bits. v4: Use reserved bits for #GP v5: Remove obsolete argument Acked-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-19kprobes/x86: Remove unused text_poke_smp() and text_poke_smp_batch() functionsMasami Hiramatsu
Since introducing the text_poke_bp() for all text_poke_smp*() callers, text_poke_smp*() are now unused. This patch basically reverts: 3d55cc8a058e ("x86: Add text_poke_smp for SMP cross modifying code") 7deb18dcf047 ("x86: Introduce text_poke_smp_batch() for batch-code modifying") and related commits. This patch also fixes a Kconfig dependency issue on STOP_MACHINE in the case of CONFIG_SMP && !CONFIG_MODULE_UNLOAD. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Reviewed-by: Jiri Kosina <jkosina@suse.cz> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jason Baron <jbaron@akamai.com> Cc: yrl.pp-manager.tt@hitachi.com Cc: Borislav Petkov <bpetkov@suse.de> Link: http://lkml.kernel.org/r/20130718114753.26675.18714.stgit@mhiramat-M0-7522 Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-07-19Merge branch 'x86/jumplabel' into perf/coreIngo Molnar
Upcoming kprobes patches rely on the int3 code-patching machinery introduced by: fd4363fff3d9 x86: Introduce int3 (breakpoint)-based instruction patching Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-07-18remove sched notifier for cross-cpu migrationsMarcelo Tosatti
Linux as a guest on KVM hypervisor, the only user of the pvclock vsyscall interface, does not require notification on task migration because: 1. cpu ID number maps 1:1 to per-CPU pvclock time info. 2. per-CPU pvclock time info is updated if the underlying CPU changes. 3. that version is increased whenever underlying CPU changes. Which is sufficient to guarantee nanoseconds counter is calculated properly. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-07-16x86: Introduce int3 (breakpoint)-based instruction patchingJiri Kosina
Introduce a method for run-time instruction patching on a live SMP kernel based on int3 breakpoint, completely avoiding the need for stop_machine(). The way this is achieved: - add a int3 trap to the address that will be patched - sync cores - update all but the first byte of the patched range - sync cores - replace the first byte (int3) by the first byte of replacing opcode - sync cores According to http://lkml.indiana.edu/hypermail/linux/kernel/1001.1/01530.html synchronization after replacing "all but first" instructions should not be necessary (on Intel hardware), as the syncing after the subsequent patching of the first byte provides enough safety. But there's not only Intel HW out there, and we'd rather be on a safe side. If any CPU instruction execution would collide with the patching, it'd be trapped by the int3 breakpoint and redirected to the provided "handler" (which would typically mean just skipping over the patched region, acting as "nop" has been there, in case we are doing nop -> jump and jump -> nop transitions). Ftrace has been using this very technique since 08d636b ("ftrace/x86: Have arch x86_64 use breakpoints instead of stop machine") for ages already, and jump labels are another obvious potential user of this. Based on activities of Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> a few years ago. Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1307121102440.29788@pobox.suse.cz Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-07-16x86, bitops: Change bitops to be native operand sizeH. Peter Anvin
Change the bitops operation to be naturally "long", i.e. 63 bits on the 64-bit kernel. Additional bugs are likely to crop up in the future. We already have bugs which machines with > 16 TiB of memory in a single node, as can happen if memory is interleaved. The x86 bitop operations take a signed index, so using an unsigned type is not an option. Jim Kukunas measured the effect of this patch on kernel size: it adds 2779 bytes to the allyesconfig kernel. Some of that probably could be elided by replacing the inline functions with macros which select the 32-bit type if the index is a 32-bit value, something like: In that case we could also use "Jr" constraints for the 64-bit version. However, this would more than double the amount of code for a relatively small gain. Note that we can't use ilog2() for _BITOPS_LONG_SHIFT, as that causes a recursive header inclusion problem. The change to constant_test_bit() should both generate better code and give correct result for negative bit indicies. As previously written the compiler had to generate extra code to create the proper wrong result for negative values. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Jim Kukunas <james.t.kukunas@intel.com> Link: http://lkml.kernel.org/n/tip-z61ofiwe90xeyb461o72h8ya@git.kernel.org
2013-07-14x86: delete __cpuinit usage from all x86 filesPaul Gortmaker
The __cpuinit type of throwaway sections might have made sense some time ago when RAM was more constrained, but now the savings do not offset the cost and complications. For example, the fix in commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time") is a good example of the nasty type of bugs that can be created with improper use of the various __init prefixes. After a discussion on LKML[1] it was decided that cpuinit should go the way of devinit and be phased out. Once all the users are gone, we can then finally remove the macros themselves from linux/init.h. Note that some harmless section mismatch warnings may result, since notify_cpu_starting() and cpu_up() are arch independent (kernel/cpu.c) are flagged as __cpuinit -- so if we remove the __cpuinit from arch specific callers, we will also get section mismatch warnings. As an intermediate step, we intend to turn the linux/init.h cpuinit content into no-ops as early as possible, since that will get rid of these warnings. In any case, they are temporary and harmless. This removes all the arch/x86 uses of the __cpuinit macros from all C files. x86 only had the one __CPUINIT used in assembly files, and it wasn't paired off with a .previous or a __FINIT, so we can delete it directly w/o any corresponding additional change there. [1] https://lkml.org/lkml/2013/5/20/589 Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>