aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2014-10-30x86/intel/quark: Switch off CR4.PGE so TLB flush uses CR3 insteadBryan O'Donoghue
commit ee1b5b165c0a2f04d2107e634e51f05d0eb107de upstream. Quark x1000 advertises PGE via the standard CPUID method PGE bits exist in Quark X1000's PTEs. In order to flush an individual PTE it is necessary to reload CR3 irrespective of the PTE.PGE bit. See Quark Core_DevMan_001.pdf section 6.4.11 This bug was fixed in Galileo kernels, unfixed vanilla kernels are expected to crash and burn on this platform. Signed-off-by: Bryan O'Donoghue <pure.logic@nexus-software.ie> Cc: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/1411514784-14885-1-git-send-email-pure.logic@nexus-software.ie Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30x86,kvm,vmx: Preserve CR4 across VM entryAndy Lutomirski
commit d974baa398f34393db76be45f7d4d04fbdbb4a0a upstream. CR4 isn't constant; at least the TSD and PCE bits can vary. TBH, treating CR0 and CR3 as constant scares me a bit, too, but it looks like it's correct. This adds a branch and a read from cr4 to each vm entry. Because it is extremely likely that consecutive entries into the same vcpu will have the same host cr4 value, this fixes up the vmcs instead of restoring cr4 after the fact. A subsequent patch will add a kernel-wide cr4 shadow, reducing the overhead in the common case to just two memory reads and a branch. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Gleb Natapov <gleb@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30KVM: do not bias the generation number in kvm_current_mmio_generationPaolo Bonzini
commit 00f034a12fdd81210d58116326d92780aac5c238 upstream. The next patch will give a meaning (a la seqcount) to the low bit of the generation number. Ensure that it matches between kvm->memslots->generation and kvm_current_mmio_generation(). Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30kvm: fix potentially corrupt mmio cacheDavid Matlack
commit ee3d1570b58677885b4552bce8217fda7b226a68 upstream. vcpu exits and memslot mutations can run concurrently as long as the vcpu does not aquire the slots mutex. Thus it is theoretically possible for memslots to change underneath a vcpu that is handling an exit. If we increment the memslot generation number again after synchronize_srcu_expedited(), vcpus can safely cache memslot generation without maintaining a single rcu_dereference through an entire vm exit. And much of the x86/kvm code does not maintain a single rcu_dereference of the current memslots during each exit. We can prevent the following case: vcpu (CPU 0) | thread (CPU 1) --------------------------------------------+-------------------------- 1 vm exit | 2 srcu_read_unlock(&kvm->srcu) | 3 decide to cache something based on | old memslots | 4 | change memslots | (increments generation) 5 | synchronize_srcu(&kvm->srcu); 6 retrieve generation # from new memslots | 7 tag cache with new memslot generation | 8 srcu_read_unlock(&kvm->srcu) | ... | <action based on cache occurs even | though the caching decision was based | on the old memslots> | ... | <action *continues* to occur until next | memslot generation change, which may | be never> | | By incrementing the generation after synchronizing with kvm->srcu readers, we ensure that the generation retrieved in (6) will become invalid soon after (8). Keeping the existing increment is not strictly necessary, but we do keep it and just move it for consistency from update_memslots to install_new_memslots. It invalidates old cached MMIOs immediately, instead of having to wait for the end of synchronize_srcu_expedited, which makes the code more clearly correct in case CPU 1 is preempted right after synchronize_srcu() returns. To avoid halving the generation space in SPTEs, always presume that the low bit of the generation is zero when reconstructing a generation number out of an SPTE. This effectively disables MMIO caching in SPTEs during the call to synchronize_srcu_expedited. Using the low bit this way is somewhat like a seqcount---where the protected thing is a cache, and instead of retrying we can simply punt if we observe the low bit to be 1. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30kvm: x86: fix stale mmio cache bugDavid Matlack
commit 56f17dd3fbc44adcdbc3340fe3988ddb833a47a7 upstream. The following events can lead to an incorrect KVM_EXIT_MMIO bubbling up to userspace: (1) Guest accesses gpa X without a memory slot. The gfn is cached in struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets the SPTE write-execute-noread so that future accesses cause EPT_MISCONFIGs. (2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION covering the page just accessed. (3) Guest attempts to read or write to gpa X again. On Intel, this generates an EPT_MISCONFIG. The memory slot generation number that was incremented in (2) would normally take care of this but we fast path mmio faults through quickly_check_mmio_pf(), which only checks the per-vcpu mmio cache. Since we hit the cache, KVM passes a KVM_EXIT_MMIO up to userspace. This patch fixes the issue by using the memslot generation number to validate the mmio cache. Signed-off-by: David Matlack <dmatlack@google.com> [xiaoguangrong: adjust the code to make it simpler for stable-tree fix.] Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Tested-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-09vgaarb: Don't default exclusively to first video device with mem+ioBruno Prémont
commit 86fd887b7fe350819dae5b55e7fef05b511c8656 upstream. Commit 20cde694027e ("x86, ia64: Move EFI_FB vga_default_device() initialization to pci_vga_fixup()") moved boot video device detection from efifb to x86 and ia64 pci/fixup.c. For dual-GPU Apple computers above change represents a regression as code in efifb did forcefully override vga_default_device while the merge did not (vgaarb happens prior to PCI fixup). To improve on initial device selection by vgaarb (it cannot know if PCI device not behind bridges see/decode legacy VGA I/O or not), move the screen_info based check from pci_video_fixup() to vgaarb's init function and use it to refine/override decision taken while adding the individual PCI VGA devices. This way PCI fixup has no reason to adjust vga_default_device anymore but can depend on its value for flagging shadowed VBIOS. This has the nice benefit of removing duplicated code but does introduce a #if defined() block in vgaarb. Not all architectures have screen_info and would cause compile to fail without it. Link: https://bugzilla.kernel.org/show_bug.cgi?id=84461 Reported-and-Tested-By: Andreas Noever <andreas.noever@gmail.com> Signed-off-by: Bruno Prémont <bonbons@linux-vserver.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> CC: Matthew Garrett <matthew.garrett@nebula.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-09x86, ia64: Move EFI_FB vga_default_device() initialization to pci_vga_fixup()Bruno Prémont
commit 20cde694027e7477cc532833e38ab9fcaa83fb64 upstream. Commit b4aa0163056b ("efifb: Implement vga_default_device() (v2)") added efifb vga_default_device() so EFI systems that do not load shadow VBIOS or setup VGA get proper value for boot_vga PCI sysfs attribute on the corresponding PCI device. Xorg doesn't detect devices when boot_vga=0, e.g., on some EFI systems such as MacBookAir2,1. Xorg detects the GPU and finds the DRI device but then bails out with "no devices detected". Note: When vga_default_device() is set boot_vga PCI sysfs attribute reflects its state. When unset this attribute is 1 whenever IORESOURCE_ROM_SHADOW flag is set. With introduction of sysfb/simplefb/simpledrm efifb is getting obsolete while having native drivers for the GPU also makes selecting sysfb/efifb optional. Remove the efifb implementation of vga_default_device() and initialize vgaarb's vga_default_device() with the PCI GPU that matches boot screen_info in pci_fixup_video(). [bhelgaas: remove unused "dev" in efifb_setup()] Fixes: b4aa0163056b ("efifb: Implement vga_default_device() (v2)") Tested-by: Anibal Francisco Martinez Cortina <linuxkid.zeuz@gmail.com> Signed-off-by: Bruno Prémont <bonbons@linux-vserver.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Matthew Garrett <matthew.garrett@nebula.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05sched: Fix unreleased llc_shared_mask bit during CPU hotplugWanpeng Li
commit 03bd4e1f7265548832a76e7919a81f3137c44fd1 upstream. The following bug can be triggered by hot adding and removing a large number of xen domain0's vcpus repeatedly: BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 IP: [..] find_busiest_group PGD 5a9d5067 PUD 13067 PMD 0 Oops: 0000 [#3] SMP [...] Call Trace: load_balance ? _raw_spin_unlock_irqrestore idle_balance __schedule schedule schedule_timeout ? lock_timer_base schedule_timeout_uninterruptible msleep lock_device_hotplug_sysfs online_store dev_attr_store sysfs_write_file vfs_write SyS_write system_call_fastpath Last level cache shared mask is built during CPU up and the build_sched_domain() routine takes advantage of it to setup the sched domain CPU topology. However, llc_shared_mask is not released during CPU disable, which leads to an invalid sched domainCPU topology. This patch fix it by releasing the llc_shared_mask correctly during CPU disable. Yasuaki also reported that this can happen on real hardware: https://lkml.org/lkml/2014/7/22/1018 His case is here: == Here is an example on my system. My system has 4 sockets and each socket has 15 cores and HT is enabled. In this case, each core of sockes is numbered as follows: | CPU# Socket#0 | 0-14 , 60-74 Socket#1 | 15-29, 75-89 Socket#2 | 30-44, 90-104 Socket#3 | 45-59, 105-119 Then llc_shared_mask of CPU#30 has 0x3fff80000001fffc0000000. It means that last level cache of Socket#2 is shared with CPU#30-44 and 90-104. When hot-removing socket#2 and #3, each core of sockets is numbered as follows: | CPU# Socket#0 | 0-14 , 60-74 Socket#1 | 15-29, 75-89 But llc_shared_mask is not cleared. So llc_shared_mask of CPU#30 remains having 0x3fff80000001fffc0000000. After that, when hot-adding socket#2 and #3, each core of sockets is numbered as follows: | CPU# Socket#0 | 0-14 , 60-74 Socket#1 | 15-29, 75-89 Socket#2 | 30-59 Socket#3 | 90-119 Then llc_shared_mask of CPU#30 becomes 0x3fff8000fffffffc0000000. It means that last level cache of Socket#2 is shared with CPU#30-59 and 90-104. So the mask has the wrong value. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Tested-by: Linn Crosetto <linn@hp.com> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Toshi Kani <toshi.kani@hp.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Steven Rostedt <srostedt@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1411547885-48165-1-git-send-email-wanpeng.li@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05x86/kaslr: Avoid the setup_data area when picking locationKees Cook
commit 0cacbfbeb5077b63d5d3cf6df88b14ac12ad584b upstream. The KASLR location-choosing logic needs to avoid the setup_data list memory areas as well. Without this, it would be possible to have the ASLR position stomp on the memory, ultimately causing the boot to fail. Signed-off-by: Kees Cook <keescook@chromium.org> Tested-by: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: Pavel Machek <pavel@ucw.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140911161931.GA12001@www.outflux.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05x86 early_ioremap: Increase FIX_BTMAPS_SLOTS to 8Dave Young
commit 3eddc69ffeba092d288c386646bfa5ec0fce25fd upstream. 3.16 kernel boot fail with earlyprintk=efi, it keeps scrolling at the bottom line of screen. Bisected, the first bad commit is below: commit 86dfc6f339886559d80ee0d4bd20fe5ee90450f0 Author: Lv Zheng <lv.zheng@intel.com> Date: Fri Apr 4 12:38:57 2014 +0800 ACPICA: Tables: Fix table checksums verification before installation. I did some debugging by enabling both serial and efi earlyprintk, below is some debug dmesg, seems early_ioremap fails in scroll up function due to no free slot, see below dmesg output: WARNING: CPU: 0 PID: 0 at mm/early_ioremap.c:116 __early_ioremap+0x90/0x1c4() __early_ioremap(ed00c800, 00000c80) not found slot Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 3.17.0-rc1+ #204 Hardware name: Hewlett-Packard HP Z420 Workstation/1589, BIOS J61 v03.15 05/09/2013 Call Trace: dump_stack+0x4e/0x7a warn_slowpath_common+0x75/0x8e ? __early_ioremap+0x90/0x1c4 warn_slowpath_fmt+0x47/0x49 __early_ioremap+0x90/0x1c4 ? sprintf+0x46/0x48 early_ioremap+0x13/0x15 early_efi_map+0x24/0x26 early_efi_scroll_up+0x6d/0xc0 early_efi_write+0x1b0/0x214 call_console_drivers.constprop.21+0x73/0x7e console_unlock+0x151/0x3b2 ? vprintk_emit+0x49f/0x532 vprintk_emit+0x521/0x532 ? console_unlock+0x383/0x3b2 printk+0x4f/0x51 acpi_os_vprintf+0x2b/0x2d acpi_os_printf+0x43/0x45 acpi_info+0x5c/0x63 ? __acpi_map_table+0x13/0x18 ? acpi_os_map_iomem+0x21/0x147 acpi_tb_print_table_header+0x177/0x186 acpi_tb_install_table_with_override+0x4b/0x62 acpi_tb_install_standard_table+0xd9/0x215 ? early_ioremap+0x13/0x15 ? __acpi_map_table+0x13/0x18 acpi_tb_parse_root_table+0x16e/0x1b4 acpi_initialize_tables+0x57/0x59 acpi_table_init+0x50/0xce acpi_boot_table_init+0x1e/0x85 setup_arch+0x9b7/0xcc4 start_kernel+0x94/0x42d ? early_idt_handlers+0x120/0x120 x86_64_start_reservations+0x2a/0x2c x86_64_start_kernel+0xf3/0x100 Quote reply from Lv.zheng about the early ioremap slot usage in this case: """ In early_efi_scroll_up(), 2 mapping entries will be used for the src/dst screen buffer. In drivers/acpi/acpica/tbutils.c, we've improved the early table loading code in acpi_tb_parse_root_table(). We now need 2 mapping entries: 1. One mapping entry is used for RSDT table mapping. Each RSDT entry contains an address for another ACPI table. 2. For each entry in RSDP, we need another mapping entry to map the table to perform necessary check/override before installing it. When acpi_tb_parse_root_table() prints something through EFI earlyprintk console, we'll have 4 mapping entries used. The current 4 slots setting of early_ioremap() seems to be too small for such a use case. """ Thus increase the slot to 8 in this patch to fix this issue. boot-time mappings become 512 page with this patch. Signed-off-by: Dave Young <dyoung@redhat.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05x86/xen: don't copy bogus duplicate entries into kernel page tablesStefan Bader
commit 0b5a50635fc916cf46e3de0b819a61fc3f17e7ee upstream. When RANDOMIZE_BASE (KASLR) is enabled; or the sum of all loaded modules exceeds 512 MiB, then loading modules fails with a warning (and hence a vmalloc allocation failure) because the PTEs for the newly-allocated vmalloc address space are not zero. WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128 vmap_page_range_noflush+0x2a1/0x360() This is caused by xen_setup_kernel_pagetables() copying level2_kernel_pgt into level2_fixmap_pgt, overwriting many non-present entries. Without KASLR, the normal kernel image size only covers the first half of level2_kernel_pgt and module space starts after that. L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[ 0..255]->kernel [256..511]->module [511]->level2_fixmap_pgt[ 0..505]->module This allows 512 MiB of of module vmalloc space to be used before having to use the corrupted level2_fixmap_pgt entries. With KASLR enabled, the kernel image uses the full PUD range of 1G and module space starts in the level2_fixmap_pgt. So basically: L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..511]->kernel [511]->level2_fixmap_pgt[0..505]->module And now no module vmalloc space can be used without using the corrupt level2_fixmap_pgt entries. Fix this by properly converting the level2_fixmap_pgt entries to MFNs, and setting level1_fixmap_pgt as read-only. A number of comments were also using the the wrong L3 offset for level2_kernel_pgt. These have been corrected. Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05x86,mm: fix pte_special versus pte_numaHugh Dickins
commit b38af4721f59d0b564468f623b3e52a638195015 upstream. Sasha Levin has shown oopses on ffffea0003480048 and ffffea0003480008 at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels: where zap_pte_range() checks page->mapping to see if PageAnon(page). Those addresses fit struct pages for pfns d2001 and d2000, and in each dump a register or a stack slot showed d2001730 or d2000730: pte flags 0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has a hole between cfffffff and 100000000, which would need special access. Commit c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") has broken vm_normal_page(): a PROTNONE SPECIAL pte no longer passes the pte_special() test, so zap_pte_range() goes on to try to access a non-existent struct page. Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE) to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE). A hint that this was a problem was that c46a7c817e66 added pte_numa() test to vm_normal_page(), and moved its is_zero_pfn() test from slow to fast path: This was papering over a pte_special() snag when the zero page was encountered during zap. This patch reverts vm_normal_page() to how it was before, relying on pte_special(). It still appears that this patch may be incomplete: aren't there other places which need to be handling PROTNONE along with PRESENT? For example, pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA, but on a PROT_NONE area, that would make it pte_special(). This is side-stepped by the fact that NUMA hinting faults skipped PROT_NONE VMAs and there are no grounds where a NUMA hinting fault on a PROT_NONE VMA would be interesting. Fixes: c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05x86/xen: resume timer irqs earlyDavid Vrabel
commit 8d5999df35314607c38fbd6bdd709e25c3a4eeab upstream. If the timer irqs are resumed during device resume it is possible in certain circumstances for the resume to hang early on, before device interrupts are resumed. For an Ubuntu 14.04 PVHVM guest this would occur in ~0.5% of resume attempts. It is not entirely clear what is occuring the point of the hang but I think a task necessary for the resume calls schedule_timeout(), waiting for a timer interrupt (which never arrives). This failure may require specific tasks to be running on the other VCPUs to trigger (processes are not frozen during a suspend/resume if PREEMPT is disabled). Add IRQF_EARLY_RESUME to the timer interrupts so they are resumed in syscore_resume(). Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05x86/xen: use vmap() to map grant table pages in PVH guestsDavid Vrabel
commit 7d951f3ccb0308c95bf76d5eef9886dea35a7013 upstream. Commit b7dd0e350e0b (x86/xen: safely map and unmap grant frames when in atomic context) causes PVH guests to crash in arch_gnttab_map_shared() when they attempted to map the pages for the grant table. This use of a PV-specific function during the PVH grant table setup is non-obvious and not needed. The standard vmap() function does the right thing. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reported-by: Mukesh Rathor <mukesh.rathor@oracle.com> Tested-by: Mukesh Rathor <mukesh.rathor@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05x86/efi: Enforce CONFIG_RELOCATABLE for EFI boot stubMatt Fleming
commit 7b2a583afb4ab894f78bc0f8bd136e96b6499a7e upstream. Without CONFIG_RELOCATABLE the early boot code will decompress the kernel to LOAD_PHYSICAL_ADDR. While this may have been fine in the BIOS days, that isn't going to fly with UEFI since parts of the firmware code/data may be located at LOAD_PHYSICAL_ADDR. Straying outside of the bounds of the regions we've explicitly requested from the firmware will cause all sorts of trouble. Bruno reports that his machine resets while trying to decompress the kernel image. We already go to great pains to ensure the kernel is loaded into a suitably aligned buffer, it's just that the address isn't necessarily LOAD_PHYSICAL_ADDR, because we can't guarantee that address isn't in-use by the firmware. Explicitly enforce CONFIG_RELOCATABLE for the EFI boot stub, so that we can load the kernel at any address with the correct alignment. Reported-by: Bruno Prémont <bonbons@linux-vserver.org> Tested-by: Bruno Prémont <bonbons@linux-vserver.org> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05x86: MCE: Add raw_lock conversion againThomas Gleixner
commit ed5c41d30ef2ce578fd6b6e2f7ec23f2a58b1eba upstream. Commit ea431643d6c3 ("x86/mce: Fix CMCI preemption bugs") breaks RT by the completely unrelated conversion of the cmci_discover_lock to a regular (non raw) spinlock. This lock was annotated in commit 59d958d2c7de ("locking, x86: mce: Annotate cmci_discover_lock as raw") with a proper explanation why. The argument for converting the lock back to a regular spinlock was: - it does percpu ops without disabling preemption. Preemption is not disabled due to the mistaken use of a raw spinlock. Which is complete nonsense. The raw_spinlock is disabling preemption in the same way as a regular spinlock. In mainline spinlock maps to raw_spinlock, in RT spinlock becomes a "sleeping" lock. raw_spinlock has on RT exactly the same semantics as in mainline. And because this lock is taken in non preemptible context it must be raw on RT. Undo the locking brainfart. Reported-by: Clark Williams <williams@redhat.com> Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05x86_64/vsyscall: Fix warn_bad_vsyscall log outputAndy Lutomirski
commit 53b884ac3745353de220d92ef792515c3ae692f0 upstream. This commit in Linux 3.6: commit c767a54ba0657e52e6edaa97cbe0b0a8bf1c1655 Author: Joe Perches <joe@perches.com> Date: Mon May 21 19:50:07 2012 -0700 x86/debug: Add KERN_<LEVEL> to bare printks, convert printks to pr_<level> caused warn_bad_vsyscall to output garbage in the middle of the line. Revert the bad part of it. The printk in question isn't actually bare; the level is "%s". The bug this fixes is purely cosmetic; backports are optional. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/03eac1f24110bbe496ecc12a4df467e0d88466d4.1406330947.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05x86: don't exclude low BIOS area when allocating address space for non-PCI cardsChristoph Schulz
commit cbace46a9710a480cae51e4611697df5de41713e upstream. Commit 30919b0bf356 ("x86: avoid low BIOS area when allocating address space") moved the test for resource allocations that fall within the first 1MB of address space from the PCI-specific path to a generic path, such that all resource allocations will avoid this area. However, this breaks ISA cards which need to allocate a memory region within the first 1MB. An example is the i82365 PCMCIA controller and derivatives like the Ricoh RF5C296/396 which map part of the PCMCIA socket memory address space into the first 1MB of system memory address space. They do not work anymore as no usable memory region exists due to this change: Intel ISA PCIC probe: Ricoh RF5C296/396 ISA-to-PCMCIA at port 0x3e0 ofs 0x00, 2 sockets host opts [0]: none host opts [1]: none ISA irqs (scanned) = 3,4,5,9,10 status change on irq 10 pcmcia_socket pcmcia_socket1: pccard: PCMCIA card inserted into slot 1 pcmcia_socket pcmcia_socket0: cs: IO port probe 0xc00-0xcff: excluding 0xcf8-0xcff pcmcia_socket pcmcia_socket0: cs: IO port probe 0xa00-0xaff: clean. pcmcia_socket pcmcia_socket0: cs: IO port probe 0x100-0x3ff: excluding 0x170-0x177 0x1f0-0x1f7 0x2f8-0x2ff 0x370-0x37f 0x3c0-0x3e7 0x3f0-0x3ff pcmcia_socket pcmcia_socket0: cs: memory probe 0x0a0000-0x0affff: excluding 0xa0000-0xaffff pcmcia_socket pcmcia_socket0: cs: memory probe 0x0b0000-0x0bffff: excluding 0xb0000-0xbffff pcmcia_socket pcmcia_socket0: cs: memory probe 0x0c0000-0x0cffff: excluding 0xc0000-0xcbfff pcmcia_socket pcmcia_socket0: cs: memory probe 0x0d0000-0x0dffff: clean. pcmcia_socket pcmcia_socket0: cs: memory probe 0x0e0000-0x0effff: clean. pcmcia_socket pcmcia_socket0: cs: memory probe 0x60000000-0x60ffffff: clean. pcmcia_socket pcmcia_socket0: cs: memory probe 0xa0000000-0xa0ffffff: clean. pcmcia_socket pcmcia_socket1: cs: IO port probe 0xc00-0xcff: excluding 0xcf8-0xcff pcmcia_socket pcmcia_socket1: cs: IO port probe 0xa00-0xaff: clean. pcmcia_socket pcmcia_socket1: cs: IO port probe 0x100-0x3ff: excluding 0x170-0x177 0x1f0-0x1f7 0x2f8-0x2ff 0x370-0x37f 0x3c0-0x3e7 0x3f0-0x3ff pcmcia_socket pcmcia_socket1: cs: memory probe 0x0a0000-0x0affff: excluding 0xa0000-0xaffff pcmcia_socket pcmcia_socket1: cs: memory probe 0x0b0000-0x0bffff: excluding 0xb0000-0xbffff pcmcia_socket pcmcia_socket1: cs: memory probe 0x0c0000-0x0cffff: excluding 0xc0000-0xcbfff pcmcia_socket pcmcia_socket1: cs: memory probe 0x0d0000-0x0dffff: clean. pcmcia_socket pcmcia_socket1: cs: memory probe 0x0e0000-0x0effff: clean. pcmcia_socket pcmcia_socket1: cs: memory probe 0x60000000-0x60ffffff: clean. pcmcia_socket pcmcia_socket1: cs: memory probe 0xa0000000-0xa0ffffff: clean. pcmcia_socket pcmcia_socket1: cs: memory probe 0x0cc000-0x0effff: excluding 0xe0000-0xeffff pcmcia_socket pcmcia_socket1: cs: unable to map card memory! If filtering out the first 1MB is reverted, everything works as expected. Tested-by: Robert Resch <fli4l@robert.reschpara.de> Signed-off-by: Christoph Schulz <develop@kristov.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05Revert "KVM: x86: Increase the number of fixed MTRR regs to 10"Paolo Bonzini
commit 0d234daf7e0a3290a3a20c8087eefbd6335a5bd4 upstream. This reverts commit 682367c494869008eb89ef733f196e99415ae862, which causes 32-bit SMP Windows 7 guests to panic. SeaBIOS has a limit on the number of MTRRs that it can handle, and this patch exceeded the limit. Better revert it. Thanks to Nadav Amit for debugging the cause. Reported-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05KVM: nVMX: fix "acknowledge interrupt on exit" when APICv is in useWanpeng Li
commit 56cc2406d68c0f09505c389e276f27a99f495cbd upstream. After commit 77b0f5d (KVM: nVMX: Ack and write vector info to intr_info if L1 asks us to), "Acknowledge interrupt on exit" behavior can be emulated. To do so, KVM will ask the APIC for the interrupt vector if during a nested vmexit if VM_EXIT_ACK_INTR_ON_EXIT is set. With APICv, kvm_get_apic_interrupt would return -1 and give the following WARNING: Call Trace: [<ffffffff81493563>] dump_stack+0x49/0x5e [<ffffffff8103f0eb>] warn_slowpath_common+0x7c/0x96 [<ffffffffa059709a>] ? nested_vmx_vmexit+0xa4/0x233 [kvm_intel] [<ffffffff8103f11a>] warn_slowpath_null+0x15/0x17 [<ffffffffa059709a>] nested_vmx_vmexit+0xa4/0x233 [kvm_intel] [<ffffffffa0594295>] ? nested_vmx_exit_handled+0x6a/0x39e [kvm_intel] [<ffffffffa0537931>] ? kvm_apic_has_interrupt+0x80/0xd5 [kvm] [<ffffffffa05972ec>] vmx_check_nested_events+0xc3/0xd3 [kvm_intel] [<ffffffffa051ebe9>] inject_pending_event+0xd0/0x16e [kvm] [<ffffffffa051efa0>] vcpu_enter_guest+0x319/0x704 [kvm] To fix this, we cannot rely on the processor's virtual interrupt delivery, because "acknowledge interrupt on exit" must only update the virtual ISR/PPR/IRR registers (and SVI, which is just a cache of the virtual ISR) but it should not deliver the interrupt through the IDT. Thus, KVM has to deliver the interrupt "by hand", similar to the treatment of EOI in commit fc57ac2c9ca8 (KVM: lapic: sync highest ISR to hardware apic on EOI, 2014-05-14). The patch modifies kvm_cpu_get_interrupt to always acknowledge an interrupt; there are only two callers, and the other is not affected because it is never reached with kvm_apic_vid_enabled() == true. Then it modifies apic_set_isr and apic_clear_irr to update SVI and RVI in addition to the registers. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: "Zhang, Yang Z" <yang.z.zhang@intel.com> Tested-by: Liu, RongrongX <rongrongx.liu@intel.com> Tested-by: Felipe Reyes <freyes@suse.com> Fixes: 77b0f5d67ff2781f36831cba79674c3e97bd7acf Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05KVM: x86: Inter-privilege level ret emulation is not implemenetedNadav Amit
commit 9e8919ae793f4edfaa29694a70f71a515ae9942a upstream. Return unhandlable error on inter-privilege level ret instruction. This is since the current emulation does not check the privilege level correctly when loading the CS, and does not pop RSP/SS as needed. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-01Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Peter Anvin: "A single fix to not invoke the espfix code on Xen PV, as it turns out to oops the guest when invoked after all. This patch leaves some amount of dead code, in particular unnecessary initialization of the espfix stacks when they won't be used, but in the interest of keeping the patch minimal that cleanup can wait for the next cycle" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86_64/entry/xen: Do not invoke espfix64 on Xen
2014-07-30Merge tag 'stable/for-linus-3.16-rc7-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull Xen fix from David Vrabel: "Fix BUG when trying to expand the grant table. This seems to occur often during boot with Ubuntu 14.04 PV guests" * tag 'stable/for-linus-3.16-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: safely map and unmap grant frames when in atomic context
2014-07-30x86/xen: safely map and unmap grant frames when in atomic contextDavid Vrabel
arch_gnttab_map_frames() and arch_gnttab_unmap_frames() are called in atomic context but were calling alloc_vm_area() which might sleep. Also, if a driver attempts to allocate a grant ref from an interrupt and the table needs expanding, then the CPU may already by in lazy MMU mode and apply_to_page_range() will BUG when it tries to re-enable lazy MMU mode. These two functions are only used in PV guests. Introduce arch_gnttab_init() to allocates the virtual address space in advance. Avoid the use of apply_to_page_range() by using saving and using the array of PTE addresses from the alloc_vm_area() call (which ensures that the required page tables are pre-allocated). Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-07-28x86_64/entry/xen: Do not invoke espfix64 on XenAndy Lutomirski
This moves the espfix64 logic into native_iret. To make this work, it gets rid of the native patch for INTERRUPT_RETURN: INTERRUPT_RETURN on native kernels is now 'jmp native_iret'. This changes the 16-bit SS behavior on Xen from OOPSing to leaking some bits of the Xen hypervisor's RSP (I think). [ hpa: this is a nonzero cost on native, but probably not enough to measure. Xen needs to fix this in their own code, probably doing something equivalent to espfix64. ] Signed-off-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/7b8f1d8ef6597cb16ae004a43c56980a7de3cf94.1406129132.git.luto@amacapital.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: <stable@vger.kernel.org>
2014-07-27Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A bunch of fixes for perf and kprobes: - revert a commit that caused a perf group regression - silence dmesg spam - fix kprobe probing errors on ia64 and ppc64 - filter kprobe faults from userspace - lockdep fix for perf exit path - prevent perf #GP in KVM guest - correct perf event and filters" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kprobes: Fix "Failed to find blacklist" probing errors on ia64 and ppc64 kprobes/x86: Don't try to resolve kprobe faults from userspace perf/x86/intel: Avoid spamming kernel log for BTS buffer failure perf/x86/intel: Protect LBR and extra_regs against KVM lying perf: Fix lockdep warning on process exit perf/x86/intel/uncore: Fix SNB-EP/IVT Cbox filter mappings perf/x86/intel: Use proper dTLB-load-misses event on IvyBridge perf: Revert ("perf: Always destroy groups on exit")
2014-07-27Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Peter Anvin: "A couple of crash fixes, plus a fix that on 32 bits would cause a missing -ENOSYS for nonexistent system calls" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, cpu: Fix cache topology for early P4-SMT x86_32, entry: Store badsys error code in %eax x86, MCE: Robustify mcheck_init_device
2014-07-24x86: Merge tag 'ras_urgent' into x86/urgentH. Peter Anvin
Promote one fix for 3.16 This fix was necessary after 9c15a24b038f ("x86/mce: Improve mcheck_init_device() error handling") went in. What this patch did was, among others, check the return value of misc_register and exit early if it encountered an error. Original code sloppily didn't do that. However, cef12ee52b05 ("xen/mce: Add mcelog support for Xen platform") made it so that xen's init routine xen_late_init_mcelog runs first. This was needed for the xen mcelog device which is supposed to be independent from the baremetal one. Initially it was reported that misc_register() fails often on xen and that's why it needed fixing. However, it is *supposed* to fail by design, when running in dom0 so that the xen mcelog device file gets registered first. And *then* you need the notifier *not* unregistered on the error path so that the timer does get deleted properly in the CPU hotplug notifier. Btw, this fix is needed also on baremetal in the unlikely event that misc_register(&mce_chrdev_device) fails there too. I was unsure whether to rush it in now and decided to delay it to 3.17. However, xen people wanted it promoted as it breaks xen when doing cpu hotplug there. So, after a bit of simmering in tip/master for initial smoke testing, let's move it to 3.16. It fixes a semi-regression which got introduced in 3.16 so no need for stable tagging. tip/x86/ras contains that exact same commit but we can't remove it there as it is not the last one. It won't cause any merge issues, as I confirmed locally but I should state here the special situation of this one fix explicitly anyway. Thanks. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-23x86, cpu: Fix cache topology for early P4-SMTPeter Zijlstra
P4 systems with cpuid level < 4 can have SMT, but the cache topology description available (cpuid2) does not include SMP information. Now we know that SMT shares all cache levels, and therefore we can mark all available cache levels as shared. We do this by setting cpu_llc_id to ->phys_proc_id, since that's the same for each SMT thread. We can do this unconditional since if there's no SMT its still true, the one CPU shares cache with only itself. This fixes a problem where such CPUs report an incorrect LLC CPU mask. This in turn fixes a crash in the scheduler where the topology was build wrong, it assumes the LLC mask to include at least the SMT CPUs. Cc: Josh Boyer <jwboyer@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Bruno Wolff III <bruno@wolff.to> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140722133514.GM12054@laptop.lan Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-07-22x86_32, entry: Store badsys error code in %eaxSven Wegener
Commit 554086d ("x86_32, entry: Do syscall exit work on badsys (CVE-2014-4508)") introduced a regression in the x86_32 syscall entry code, resulting in syscall() not returning proper errors for undefined syscalls on CPUs supporting the sysenter feature. The following code: > int result = syscall(666); > printf("result=%d errno=%d error=%s\n", result, errno, strerror(errno)); results in: > result=666 errno=0 error=Success Obviously, the syscall return value is the called syscall number, but it should have been an ENOSYS error. When run under ptrace it behaves correctly, which makes it hard to debug in the wild: > result=-1 errno=38 error=Function not implemented The %eax register is the return value register. For debugging via ptrace the syscall entry code stores the complete register context on the stack. The badsys handlers only store the ENOSYS error code in the ptrace register set and do not set %eax like a regular syscall handler would. The old resume_userspace call chain contains code that clobbers %eax and it restores %eax from the ptrace registers afterwards. The same goes for the ptrace-enabled call chain. When ptrace is not used, the syscall return value is the passed-in syscall number from the untouched %eax register. Use %eax as the return value register in syscall_badsys and sysenter_badsys, like a real syscall handler does, and have the caller push the value onto the stack for ptrace access. Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Link: http://lkml.kernel.org/r/alpine.LNX.2.11.1407221022380.31021@titan.int.lan.stealer.net Reviewed-and-tested-by: Andy Lutomirski <luto@amacapital.net> Cc: <stable@vger.kernel.org> # If 554086d is backported Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-07-21Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "These are mostly PPC changes for 3.16-new things. However, there is an x86 change too and it is a regression from 3.14. As it only affects nested virtualization and there were other changes in this area in 3.16, I am not nominating it for 3.15-stable" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: Check for nested events if there is an injectable interrupt KVM: PPC: RTAS: Do byte swaps explicitly KVM: PPC: Book3S PR: Fix ABIv2 on LE KVM: PPC: Assembly functions exported to modules need _GLOBAL_TOC() PPC: Add _GLOBAL_TOC for 32bit KVM: PPC: BOOK3S: HV: Use base page size when comparing against slb value KVM: PPC: Book3E: Unlock mmu_lock when setting caching atttribute
2014-07-21x86, MCE: Robustify mcheck_init_deviceBorislav Petkov
BorisO reports that misc_register() fails often on xen. The current code unregisters the CPU hotplug notifier in that case. If then a CPU is offlined and onlined back again, we end up with a second timer running on that CPU, leading to soft lockups and system hangs. So let's leave the hotcpu notifier always registered - even if mce_device_create failed for some cores and never unreg it so that we can deal with the timer handling accordingly. Reported-and-Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: http://lkml.kernel.org/r/1403274493-1371-1-git-send-email-boris.ostrovsky@oracle.com Signed-off-by: Borislav Petkov <bp@suse.de>
2014-07-19Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "The locking department delivers: - A rather large and intrusive bundle of fixes to address serious performance regressions introduced by the new rwsem / mcs technology. Simpler solutions have been discussed, but they would have been ugly bandaids with more risk than doing the right thing. - Make the rwsem spin on owner technology opt-in for architectures and enable it only on the known to work ones. - A few fixes to the lockdep userspace library" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rwsem: Add CONFIG_RWSEM_SPIN_ON_OWNER locking/mutex: Disable optimistic spinning on some architectures locking/rwsem: Reduce the size of struct rw_semaphore locking/rwsem: Rename 'activity' to 'count' locking/spinlocks/mcs: Micro-optimize osq_unlock() locking/spinlocks/mcs: Introduce and use init macro and function for osq locks locking/spinlocks/mcs: Convert osq lock to atomic_t to reduce overhead locking/spinlocks/mcs: Rename optimistic_spin_queue() to optimistic_spin_node() locking/rwsem: Allow conservative optimistic spinning when readers have lock tools/liblockdep: Account for bitfield changes in lockdeps lock_acquire tools/liblockdep: Remove debug print left over from development tools/liblockdep: Fix comparison of a boolean value with a value of 2
2014-07-18Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Peter Anvin: "A couple of key fixes and a few less critical ones. The main ones are: - add a .bss section to the PE/COFF headers when building with EFI stub - invoke the correct paravirt magic when building the espfix page tables Unfortunately both of these areas also have at least one additional fix each still in thie pipeline, but which are not yet ready to push" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Remove unused variable "polling" x86/espfix/xen: Fix allocation of pages for paravirt page tables x86/efi: Include a .bss section within the PE/COFF headers efi: fdt: Do not report an error during boot if UEFI is not available efi/arm64: efistub: remove local copy of linux_banner
2014-07-16Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "A cpufreq lockup fix and a compiler warning fix" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix compiler warnings x86, tsc: Fix cpufreq lockup
2014-07-16Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Tooling fixes and an Intel PMU driver fixlet" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Do not allow optimized switch for non-cloned events perf/x86/intel: ignore CondChgd bit to avoid false NMI handling perf symbols: Get kernel start address by symbol name perf tools: Fix segfault in cumulative.callchain report
2014-07-16locking/mutex: Disable optimistic spinning on some architecturesPeter Zijlstra
The optimistic spin code assumes regular stores and cmpxchg() play nice; this is found to not be true for at least: parisc, sparc32, tile32, metag-lock1, arc-!llsc and hexagon. There is further wreckage, but this in particular seemed easy to trigger, so blacklist this. Opt in for known good archs. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Reported-by: Mikulas Patocka <mpatocka@redhat.com> Cc: David Miller <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Jason Low <jason.low2@hp.com> Cc: Waiman Long <waiman.long@hp.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: John David Anglin <dave.anglin@bell.net> Cc: James Hogan <james.hogan@imgtec.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: stable@vger.kernel.org Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: sparclinux@vger.kernel.org Link: http://lkml.kernel.org/r/20140606175316.GV13930@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16kprobes/x86: Don't try to resolve kprobe faults from userspaceAndy Lutomirski
This commit: commit 6f6343f53d133bae516caf3d254bce37d8774625 Author: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Date: Thu Apr 17 17:17:33 2014 +0900 kprobes/x86: Call exception handlers directly from do_int3/do_debug appears to have inadvertently dropped a check that the int3 came from kernel mode. Trying to dereference addr when addr is user-controlled is completely bogus. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Link: http://lkml.kernel.org/r/c4e339882c121aa76254f2adde3fcbdf502faec2.1405099506.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16perf/x86/intel: Avoid spamming kernel log for BTS buffer failureDavid Rientjes
It's unnecessary to excessively spam the kernel log anytime the BTS buffer cannot be allocated, so make this allocation __GFP_NOWARN. The user probably will want to at least find some artifact that the allocation has failed in the past, probably due to fragmentation because of its large size, when it's not allocated at bootstrap. Thus, add a WARN_ONCE() so something is left behind for them to understand why perf commnads that require PEBS is not working properly. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1406301600460.26302@chino.kir.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16perf/x86/intel: Protect LBR and extra_regs against KVM lyingKan Liang
With -cpu host, KVM reports LBR and extra_regs support, if the host has support. When the guest perf driver tries to access LBR or extra_regs MSR, it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support. So check the related MSRs access right once at initialization time to avoid the error access at runtime. For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y (for host kernel). And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel). Start the guest with -cpu host. Run perf record with --branch-any or --branch-filter in guest to trigger LBR Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to trigger offcore_rsp #GP Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com> Cc: Mark Davies <junk@eslaf.co.uk> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> Cc: Yan, Zheng <zheng.z.yan@intel.com> Link: http://lkml.kernel.org/r/1405365957-20202-1-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16perf/x86/intel/uncore: Fix SNB-EP/IVT Cbox filter mappingsStephane Eranian
This patch fixes the SNB-EP and IVT Cbox filter mapping table. The table controls which filters are supported by which events. There were several mistakes in those tables causing some filters to be ignored, such as NID on TOR_INSERTS. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: zheng.z.yan@intel.com Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140630144624.GA2604@quad Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16perf/x86/intel: Use proper dTLB-load-misses event on IvyBridgeVince Weaver
This was discussed back in February: https://lkml.org/lkml/2014/2/18/956 But I never saw a patch come out of it. On IvyBridge we share the SandyBridge cache event tables, but the dTLB-load-miss event is not compatible. Patch it up after the fact to the proper DTLB_LOAD_MISSES.DEMAND_LD_MISS_CAUSES_A_WALK Signed-off-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1407141528200.17214@vincent-weaver-1.umelst.maine.edu Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16x86: Remove unused variable "polling"Paul Bolle
Compile tested. "polling" is unused since commit f80c5b39b80a ("sched/idle, x86: Switch from TS_POLLING to TIF_POLLING_NRFLAG"). Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jkosina@suse.cz> Link: http://lkml.kernel.org/r/1404138749.2978.6.camel@x41 Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-14x86/espfix/xen: Fix allocation of pages for paravirt page tablesBoris Ostrovsky
init_espfix_ap() is currently off by one level when informing hypervisor that allocated pages will be used for ministacks' page tables. The most immediate effect of this on a PV guest is that if 'stack_page = __get_free_page()' returns a non-zeroed-out page the hypervisor will refuse to use it for a page table (which it shouldn't be anyway). This will result in warnings by both Xen and Linux. More importantly, a subsequent write to that page (again, by a PV guest) is likely to result in fatal page fault. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: http://lkml.kernel.org/r/1404926298-5565-1-git-send-email-boris.ostrovsky@oracle.com Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: <stable@vger.kernel.org>
2014-07-14Merge tag 'efi-urgent' into x86/urgentH. Peter Anvin
* Remove a duplicate copy of linux_banner from the arm64 EFI stub which, apart from reducing code duplication also stops the arm64 stub being rebuilt every time make is invoked - Ard Biesheuvel * Fix the EFI fdt code to not report a boot error if UEFI is unavailable since booting without UEFI parameters is a valid use case for non-UEFI platforms - Catalin Marinas * Include a .bss section in the EFI boot stub PE/COFF headers to fix a memory corruption bug - Michael Brown Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-11Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Peter Anvin: "A couple of further build fixes for the VDSO code. This is turning into a bit of a headache, and Andy has already come up with a more ultimate cleanup, but most likely that is 3.17 material" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86-32, vdso: Fix vDSO build error due to missing align_vdso_addr() x86-64, vdso: Fix vDSO build breakage due to empty .rela.dyn
2014-07-10x86-32, vdso: Fix vDSO build error due to missing align_vdso_addr()Jan Beulich
Relying on static functions used just once to get inlined (and subsequently have dead code paths eliminated) is wrong: Compilers are free to decide whether they do this, regardless of optimization level. With this not happening for vdso_addr() (observed with gcc 4.1.x), an unresolved reference to align_vdso_addr() causes the build to fail. [ hpa: vdso_addr() is never actually used on x86-32, as calculate_addr in map_vdso() is always false. It ought to be possible to clean this up further, but this fixes the immediate problem. ] Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/53B5863B02000078000204D5@mail.emea.novell.com Acked-by: Andy Lutomirski <luto@amacapital.net> Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Tested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-10x86-64, vdso: Fix vDSO build breakage due to empty .rela.dynJan Beulich
Certain ld versions (observed with 2.20.0) put an empty .rela.dyn section into shared object files, breaking the assumption on the number of sections to be copied to the final output. Simply discard any empty SHT_REL and SHT_RELA sections to address this. Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/53B5861E02000078000204D1@mail.emea.novell.com Acked-by: Andy Lutomirski <luto@amacapital.net> Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Tested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds
Pull crypto fixes from Herbert Xu: "This push fixes an error in sha512_ssse3 that leads to incorrect output as well as a memory leak in caam_jr when the module is unloaded" * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: caam - fix memleak in caam_jr module crypto: sha512_ssse3 - fix byte count to bit count conversion
2014-07-10x86/efi: Include a .bss section within the PE/COFF headersMichael Brown
The PE/COFF headers currently describe only the initialised-data portions of the image, and result in no space being allocated for the uninitialised-data portions. Consequently, the EFI boot stub will end up overwriting unexpected areas of memory, with unpredictable results. Fix by including a .bss section in the PE/COFF headers (functionally equivalent to the init_size field in the bzImage header). Signed-off-by: Michael Brown <mbrown@fensystems.co.uk> Cc: Thomas Bächler <thomas@archlinux.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: <stable@vger.kernel.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com>