aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2010-09-20x86: hpet: Work around hardware stupidityThomas Gleixner
commit 54ff7e595d763d894104d421b103a89f7becf47c upstream. This more or less reverts commits 08be979 (x86: Force HPET readback_cmp for all ATI chipsets) and 30a564be (x86, hpet: Restrict read back to affected ATI chipsets) to the status of commit 8da854c (x86, hpet: Erratum workaround for read after write of HPET comparator). The delta to commit 8da854c is mostly comments and the change from WARN_ONCE to printk_once as we know the call path of this function already. This needs really in depth explanation: First of all the HPET design is a complete failure. Having a counter compare register which generates an interrupt on matching values forces the software to do at least one superfluous readback of the counter register. While it is nice in theory to program "absolute" time events it is practically useless because the timer runs at some absurd frequency which can never be matched to real world units. So we are forced to calculate a relative delta and this forces a readout of the actual counter value, adding the delta and programming the compare register. When the delta is small enough we run into the danger that we program a compare value which is already in the past. Due to the compare for equal nature of HPET we need to read back the counter value after writing the compare rehgister (btw. this is necessary for absolute timeouts as well) to make sure that we did not miss the timer event. We try to work around that by setting the minimum delta to a value which is larger than the theoretical time which elapses between the counter readout and the compare register write, but that's only true in theory. A NMI or SMI which hits between the readout and the write can easily push us beyond that limit. This would result in waiting for the next HPET timer interrupt until the 32bit wraparound of the counter happens which takes about 306 seconds. So we designed the next event function to look like: match = read_cnt() + delta; write_compare_ref(match); return read_cnt() < match ? 0 : -ETIME; At some point we got into trouble with certain ATI chipsets. Even the above "safe" procedure failed. The reason was that the write to the compare register was delayed probably for performance reasons. The theory was that they wanted to avoid the synchronization of the write with the HPET clock, which is understandable. So the write does not hit the compare register directly instead it goes to some intermediate register which is copied to the real compare register in sync with the HPET clock. That opens another window for hitting the dreaded "wait for a wraparound" problem. To work around that "optimization" we added a read back of the compare register which either enforced the update of the just written value or just delayed the readout of the counter enough to avoid the issue. We unfortunately never got any affirmative info from ATI/AMD about this. One thing is sure, that we nuked the performance "optimization" that way completely and I'm pretty sure that the result is worse than before some HW folks came up with those. Just for paranoia reasons I added a check whether the read back compare register value was the same as the value we wrote right before. That paranoia check triggered a couple of years after it was added on an Intel ICH9 chipset. Venki added a workaround (commit 8da854c) which was reading the compare register twice when the first check failed. We considered this to be a penalty in general and restricted the readback (thus the wasted CPU cycles) to the known to be affected ATI chipsets. This turned out to be a utterly wrong decision. 2.6.35 testers experienced massive problems and finally one of them bisected it down to commit 30a564be which spured some further investigation. Finally we got confirmation that the write to the compare register can be delayed by up to two HPET clock cycles which explains the problems nicely. All we can do about this is to go back to Venki's initial workaround in a slightly modified version. Just for the record I need to say, that all of this could have been avoided if hardware designers and of course the HPET committee would have thought about the consequences for a split second. It's out of my comprehension why designing a working timer is so hard. There are two ways to achieve it: 1) Use a counter wrap around aware compare_reg <= counter_reg implementation instead of the easy compare_reg == counter_reg Downsides: - It needs more silicon. - It needs a readout of the counter to apply a relative timeout. This is necessary as the counter does not run in any useful (and adjustable) frequency and there is no guarantee that the counter which is used for timer events is the same which is used for reading the actual time (and therefor for calculating the delta) Upsides: - None 2) Use a simple down counter for relative timer events Downsides: - Absolute timeouts are not possible, which is not a problem at all in the context of an OS and the expected max. latencies/jitter (also see Downsides of #1) Upsides: - It needs less or equal silicon. - It works ALWAYS - It is way faster than a compare register based solution (One write versus one write plus at least one and up to four reads) I would not be so grumpy about all of this, if I would not have been ignored for many years when pointing out these flaws to various hardware folks. I really hate timers (at least those which seem to be designed by janitors). Though finally we got a reasonable explanation plus a solution and I want to thank all the folks involved in chasing it down and providing valuable input to this. Bisected-by: Nix <nix@esperi.org.uk> Reported-by: Artur Skawina <art.08.09@gmail.com> Reported-by: Damien Wyart <damien.wyart@free.fr> Reported-by: John Drescher <drescherjm@gmail.com> Cc: Venkatesh Pallipadi <venki@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: Borislav Petkov <borislav.petkov@amd.com> Cc: stable@kernel.org Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-09-20x86-64, compat: Retruncate rax after ia32 syscall entry tracingRoland McGrath
commit eefdca043e8391dcd719711716492063030b55ac upstream. In commit d4d6715, we reopened an old hole for a 64-bit ptracer touching a 32-bit tracee in system call entry. A %rax value set via ptrace at the entry tracing stop gets used whole as a 32-bit syscall number, while we only check the low 32 bits for validity. Fix it by truncating %rax back to 32 bits after syscall_trace_enter, in addition to testing the full 64 bits as has already been added. Reported-by: Ben Hawkes <hawkes@sota.gen.nz> Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-09-20compat: Make compat_alloc_user_space() incorporate the access_ok()H. Peter Anvin
commit c41d68a513c71e35a14f66d71782d27a79a81ea6 upstream. compat_alloc_user_space() expects the caller to independently call access_ok() to verify the returned area. A missing call could introduce problems on some architectures. This patch incorporates the access_ok() check into compat_alloc_user_space() and also adds a sanity check on the length. The existing compat_alloc_user_space() implementations are renamed arch_compat_alloc_user_space() and are used as part of the implementation of the new global function. This patch assumes NULL will cause __get_user()/__put_user() to either fail or access userspace on all architectures. This should be followed by checking the return value of compat_access_user_space() for NULL in the callers, at which time the access_ok() in the callers can also be removed. Reported-by: Ben Hawkes <hawkes@sota.gen.nz> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Chris Metcalf <cmetcalf@tilera.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Tony Luck <tony.luck@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: James Bottomley <jejb@parisc-linux.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-09-20x86-64, compat: Test %rax for the syscall number, not %eaxH. Peter Anvin
commit 36d001c70d8a0144ac1d038f6876c484849a74de upstream. On 64 bits, we always, by necessity, jump through the system call table via %rax. For 32-bit system calls, in theory the system call number is stored in %eax, and the code was testing %eax for a valid system call number. At one point we loaded the stored value back from the stack to enforce zero-extension, but that was removed in checkin d4d67150165df8bf1cc05e532f6efca96f907cab. An actual 32-bit process will not be able to introduce a non-zero-extended number, but it can happen via ptrace. Instead of re-introducing the zero-extension, test what we are actually going to use, i.e. %rax. This only adds a handful of REX prefixes to the code. Reported-by: Ben Hawkes <hawkes@sota.gen.nz> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Roland McGrath <roland@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-09-20x86, tsc: Fix a preemption leak in restore_sched_clock_state()Peter Zijlstra
commit 55496c896b8a695140045099d4e0175cf09d4eae upstream. Doh, a real life genuine preemption leak.. This caused a suspend failure. Reported-bisected-and-tested-by-the-invaluable: Jeff Chua <jeff.chua.linux@gmail.com> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Nico Schottelius <nico-linux-20100709@schottelius.org> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Florian Pritz <flo@xssn.at> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Len Brown <lenb@kernel.org> sleep states LKML-Reference: <1284150773.402.122.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-09-20oprofile, x86: fix init_sysfs() function stubRobert Richter
commit 269f45c25028c75fe10e6d9be86e7202ab461fbc upstream. The use of the return value of init_sysfs() with commit 10f0412 oprofile, x86: fix init_sysfs error handling discovered the following build error for !CONFIG_PM: .../linux/arch/x86/oprofile/nmi_int.c: In function ‘op_nmi_init’: .../linux/arch/x86/oprofile/nmi_int.c:784: error: expected expression before ‘do’ make[2]: *** [arch/x86/oprofile/nmi_int.o] Error 1 make[1]: *** [arch/x86/oprofile] Error 2 This patch fixes this. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-09-20oprofile, x86: fix init_sysfs error handlingRobert Richter
commit 10f0412f57f2a76a90eff4376f59cbb0a39e4e18 upstream. On failure init_sysfs() might not properly free resources. The error code of the function is not checked. And, when reinitializing the exit function might be called twice. This patch fixes all this. Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-09-20PCI: MSI: Restore read_msi_msg_desc(); add get_cached_msi_msg_desc()Ben Hutchings
commit 30da55242818a8ca08583188ebcbaccd283ad4d9 upstream. commit 2ca1af9aa3285c6a5f103ed31ad09f7399fc65d7 "PCI: MSI: Remove unsafe and unnecessary hardware access" changed read_msi_msg_desc() to return the last MSI message written instead of reading it from the device, since it may be called while the device is in a reduced power state. However, the pSeries platform code really does need to read messages from the device, since they are initially written by firmware. Therefore: - Restore the previous behaviour of read_msi_msg_desc() - Add new functions get_cached_msi_msg{,_desc}() which return the last MSI message written - Use the new functions where appropriate Acked-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-09-20perf, x86, Pentium4: Clear the P4_CCCR_FORCE_OVF flagLin Ming
commit 8d330919927ea31fa083b5a80084dc991da813a0 upstream. If on Pentium4 CPUs the FORCE_OVF flag is set then an NMI happens on every event, which can generate a flood of NMIs. Clear it. Reported-by: Vince Weaver <vweaver1@eecs.utk.edu> Signed-off-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-09-20x86, tsc, sched: Recompute cyc2ns_offset's during resume from sleep statesSuresh Siddha
commit cd7240c0b900eb6d690ccee088a6c9b46dae815a upstream. TSC's get reset after suspend/resume (even on cpu's with invariant TSC which runs at a constant rate across ACPI P-, C- and T-states). And in some systems BIOS seem to reinit TSC to arbitrary large value (still sync'd across cpu's) during resume. This leads to a scenario of scheduler rq->clock (sched_clock_cpu()) less than rq->age_stamp (introduced in 2.6.32). This leads to a big value returned by scale_rt_power() and the resulting big group power set by the update_group_power() is causing improper load balancing between busy and idle cpu's after suspend/resume. This resulted in multi-threaded workloads (like kernel-compilation) go slower after suspend/resume cycle on core i5 laptops. Fix this by recomputing cyc2ns_offset's during resume, so that sched_clock() continues from the point where it was left off during suspend. Reported-by: Florian Pritz <flo@xssn.at> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1282262618.2675.24.camel@sbsiddha-MOBL3.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26oprofile: add support for Intel processor model 30Josh Hunt
commit a7c55cbee0c1bae9bf5a15a08300e91d88706e45 upstream. Newer Intel processors identifying themselves as model 30 are not recognized by oprofile. <cpuinfo snippet> model : 30 model name : Intel(R) Xeon(R) CPU X3470 @ 2.93GHz </cpuinfo snippet> Running oprofile on these machines gives the following: + opcontrol --init + opcontrol --list-events oprofile: available events for CPU type "Intel Architectural Perfmon" See Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3B (Document 253669) Chapter 18 for architectural perfmon events This is a limited set of fallback events because oprofile doesn't know your CPU CPU_CLK_UNHALTED: (counter: all) Clock cycles when not halted (min count: 6000) INST_RETIRED: (counter: all) number of instructions retired (min count: 6000) LLC_MISSES: (counter: all) Last level cache demand requests from this core that missed the LLC (min count: 6000) Unit masks (default 0x41) ---------- 0x41: No unit mask LLC_REFS: (counter: all) Last level cache demand requests from this core (min count: 6000) Unit masks (default 0x4f) ---------- 0x4f: No unit mask BR_MISS_PRED_RETIRED: (counter: all) number of mispredicted branches retired (precise) (min count: 500) + opcontrol --shutdown Tested using oprofile 0.9.6. Signed-off-by: Josh Hunt <johunt@akamai.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26Oprofile: Change CPUIDS from decimal to hex, and add some commentsJohn Villalovos
commit 45c34e05c4e3d36e7c44e790241ea11a1d90d54e upstream. Back when the patch was submitted for "Add Xeon 7500 series support to oprofile", Robert Richter had asked for a followon patch that converted all the CPU ID values to hex. I have done that here for the "i386/core_i7" and "i386/atom" class processors in the ppro_init() function and also added some comments on where to find documentation on the Intel processors. Signed-off-by: John L. Villalovos <john.l.villalovos@intel.com> Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26x86, asm: Clean up and simplify set_64bit()H. Peter Anvin
commit 69309a05907546fb686b251d4ab041c26afe1e1d upstream. Clean up and simplify set_64bit(). This code is quite old (1.3.11) and contains a fair bit of auxilliary machinery that current versions of gcc handle just fine automatically. Worse, the auxilliary machinery can actually cause an unnecessary spill to memory. Furthermore, the loading of the old value inside the loop in the 32-bit case is unnecessary: if the value doesn't match, the CMPXCHG8B instruction will already have loaded the "new previous" value for us. Clean up the comment, too, and remove page references to obsolete versions of the Intel SDM. Signed-off-by: H. Peter Anvin <hpa@zytor.com> LKML-Reference: <tip-*@vger.kernel.org> Tested-by: Mark Stanovich <mrktimber@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26x86, apic: Fix apic=debug boot crashDaniel Kiper
commit 05e407603e527f9d808dd3866d3a17c2ce4dfcc5 upstream. Fix a boot crash when apic=debug is used and the APIC is not properly initialized. This issue appears during Xen Dom0 kernel boot but the fix is generic and the crash could occur on real hardware as well. Signed-off-by: Daniel Kiper <dkiper@net-space.pl> Cc: xen-devel@lists.xensource.com Cc: konrad.wilk@oracle.com Cc: jeremy@goop.org LKML-Reference: <20100819224616.GB9967@router-fw-old.local.net-space.pl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26perf, x86: Fix Intel-nhm PMU programming errata workaroundZhang, Yanmin
commit 351af0725e5222e35741011d1ea62215c1ed06db upstream. Fix the Errata AAK100/AAP53/BD53 workaround, the officialy documented workaround we implemented in: 11164cd: perf, x86: Add Nehelem PMU programming errata workaround doesn't actually work fully and causes a stuck PMU state under load and non-functioning perf profiling. A functional workaround was found by trial & error. Affects all Nehalem-class Intel PMUs. Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1281073148.2125.63.camel@ymzhang.sh.intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26x86, hotplug: Serialize CPU hotplug to avoid bringup concurrency issuesBorislav Petkov
commit d7c53c9e822a4fefa13a0cae76f3190bfd0d5c11 upstream. When testing cpu hotplug code on 32-bit we kept hitting the "CPU%d: Stuck ??" message due to multiple cores concurrently accessing the cpu_callin_mask, among others. Since these codepaths are not protected from concurrent access due to the fact that there's no sane reason for making an already complex code unnecessarily more complex - we hit the issue only when insanely switching cores off- and online - serialize hotplugging cores on the sysfs level and be done with it. [ v2.1: fix !HOTPLUG_CPU build ] Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> LKML-Reference: <20100819181029.GC17171@aftab> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26x86, apic: Map the local apic when parsing the MP table.Eric W. Biederman
commit 5989cd6a1cbf86587edcc856791f960978087311 upstream. This fixes a regression in 2.6.35 from 2.6.34, that is present for select models of Intel cpus when people are using an MP table. The commit cf7500c0ea133d66f8449d86392d83f840102632 "x86, ioapic: In mpparse use mp_register_ioapic" started calling mp_register_ioapic from MP_ioapic_info. An extremely simple change that was obviously correct. Unfortunately mp_register_ioapic did just a little more than the previous hand crafted code and so we gained this call path. The problem call path is: MP_ioapic_info() mp_register_ioapic() io_apic_unique_id() io_apic_get_unique_id() get_physical_broadcast() modern_apic() lapic_get_version() apic_read(APIC_LVR) Which turned out to be a problem because the local apic was not mapped, at that point, unlike the similar point in the ACPI parsing code. This problem is fixed by mapping the local apic when parsing the mptable as soon as we reasonably can. Looking at the number of places we setup the fixmap for the local apic, I see some serious simplification opportunities. For the moment except for not duplicating the setting up of the fixmap in init_apic_mappings, I have not acted on them. The regression from 2.6.34 is tracked in bug https://bugzilla.kernel.org/show_bug.cgi?id=16173 Reported-by: David Hill <hilld@binarystorm.net> Reported-by: Tvrtko Ursulin <tvrtko.ursulin@sophos.com> Tested-by: Tvrtko Ursulin <tvrtko.ursulin@sophos.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> LKML-Reference: <m1eiee86jg.fsf_-_@fess.ebiederm.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26perf, x86: P4 PMU -- update nmi irq statistics and unmask lvt entry properlyCyrill Gorcunov
commit 1c250d709fdc8aa5bf42d90be99428a01a256a55 upstream. In case if last active performance counter is not overflowed at moment of NMI being triggered by another counter, the irq statistics may miss an update stage. As a more serious consequence -- apic quirk may not be triggered so apic lvt entry stay masked. Tested-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Stephane Eranian <eranian@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20100805150917.GA6311@lenovo> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26x86, asm: Refactor atomic64_386_32.S to support old binutils and be cleanerLuca Barbieri
commit 30246557a06bb20618bed906a06d1e1e0faa8bb4 upstream. The old code didn't work on binutils 2.12 because setting a symbol to a register apparently requires a fairly recent version. This commit refactors the code to use the C preprocessor instead, and in the process makes the whole code a bit easier to understand. The object code produced is unchanged as expected. This fixes kernel bugzilla 16506. Reported-by: Dieter Stussy <kd6lvw+software@kd6lvw.ampr.org> Signed-off-by: Luca Barbieri <luca@luca-barbieri.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> LKML-Reference: <tip-*@git.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-20vmware: fix build error in vmware.cGreg Kroah-Hartman
This fixes a build error reported in vmware.c due to commit 9f242dc10e0c3c1eb32d8c83c18650a35fd7f80d Reported-by: Sven Joachim <svenjoac@gmx.de> Cc: Alok Kataria <akataria@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13x86: don't send SIGBUS for kernel page faultsLinus Torvalds
commit 96054569190bdec375fe824e48ca1f4e3b53dd36 upstream. It's wrong for several reasons, but the most direct one is that the fault may be for the stack accesses to set up a previous SIGBUS. When we have a kernel exception, the kernel exception handler does all the fixups, not some user-level signal handler. Even apart from the nested SIGBUS issue, it's also wrong to give out kernel fault addresses in the signal handler info block, or to send a SIGBUS when a system call already returns EFAULT. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13x86, mtrr: Use stop machine context to rendezvous all the cpu'sSuresh Siddha
commit 68f202e4e87cfab4439568bf397fcc5c7cf8d729 upstream. Use the stop machine context rather than IPI's to rendezvous all the cpus for MTRR initialization that happens during cpu bringup or for MTRR modifications during runtime. This avoids deadlock scenario (reported by Prarit) like: cpu A holds a read_lock (tasklist_lock for example) with irqs enabled cpu B waits for the same lock with irqs disabled using write_lock_irq cpu C doing set_mtrr() (during AP bringup for example), which will try to rendezvous all the cpus using IPI's This will result in C and A come to the rendezvous point and waiting for B. B is stuck forever waiting for the lock and thus not reaching the rendezvous point. Using stop cpu (run in the process context of per cpu based keventd) to do this rendezvous, avoids this deadlock scenario. Also make sure all the cpu's are in the rendezvous handler before we proceed with the local_irq_save() on each cpu. This lock step disabling irqs on all the cpus will avoid other deadlock scenarios (for example involving with the blocking smp_call_function's etc). [ This problem is very old. Marking -stable only for 2.6.35 as the stop_one_cpu_nowait() API is present only in 2.6.35. Any older kernel interested in this fix need to do some more work in backporting this patch. ] Reported-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> LKML-Reference: <1280515602.2682.10.camel@sbsiddha-MOBL3.sc.intel.com> Acked-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13x86: Add memory modify constraints to xchg() and cmpxchg()H. Peter Anvin
commit 113fc5a6e8c2288619ff7e8187a6f556b7e0d372 upstream. xchg() and cmpxchg() modify their memory operands, not merely read them. For some versions of gcc the "memory" clobber has apparently dealt with the situation, but not for all. Originally-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: Glauber Costa <glommer@redhat.com> Cc: Avi Kivity <avi@redhat.com> Cc: Peter Palfrader <peter@palfrader.org> Cc: Greg KH <gregkh@suse.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Zachary Amsden <zamsden@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> LKML-Reference: <4C4F7277.8050306@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13x86/PCI: use host bridge _CRS info on ASRock ALiveSATA2-GLANBjorn Helgaas
commit 2491762cfb475dbdfa3db11ebea6de49f58b7fac upstream. This DMI quirk turns on "pci=use_crs" for the ALiveSATA2-GLAN because amd_bus.c doesn't handle this system correctly. The system has a single HyperTransport I/O chain, but has two PCI host bridges to buses 00 and 80. amd_bus.c learns the MMIO range associated with buses 00-ff and that this range is routed to the HT chain hosted at node 0, link 0: bus: [00, ff] on node 0 link 0 bus: 00 index 1 [mem 0x80000000-0xfcffffffff] This includes the address space for both bus 00 and bus 80, and amd_bus.c assumes it's all routed to bus 00. We find device 80:01.0, which BIOS left in the middle of that space, but we don't find a bridge from bus 00 to bus 80, so we conclude that 80:01.0 is unreachable from bus 00, and we move it from the original, working, address to something outside the bus 00 aperture, which does not work: pci 0000:80:01.0: reg 10: [mem 0xfebfc000-0xfebfffff 64bit] pci 0000:80:01.0: BAR 0: assigned [mem 0xfd00000000-0xfd00003fff 64bit] The BIOS told us everything we need to know to handle this correctly, so we're better off if we just pay attention, which lets us leave the 80:01.0 device at the original, working, address: ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-7f]) pci_root PNP0A03:00: host bridge window [mem 0x80000000-0xff37ffff] ACPI: PCI Root Bridge [PCI1] (domain 0000 [bus 80-ff]) pci_root PNP0A08:00: host bridge window [mem 0xfebfc000-0xfebfffff] This was a regression between 2.6.33 and 2.6.34. In 2.6.33, amd_bus.c was used only when we found multiple HT chains. 3e3da00c01d050, which enabled amd_bus.c even on systems with a single HT chain, caused this failure. This quirk was written by Graham. If we ever enable "pci=use_crs" for machines from 2006 or earlir, this quirk should be removed. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=16007 Reported-by: Graham Ramsey <ramsey.graham@ntlworld.com> Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13x86, kmmio/mmiotrace: Fix double free of kmmio_fault_pagesMarcin Slusarz
commit 8b8f79b927b6b302bb65fb8c56e7a19be5fbdbef upstream. After every iounmap mmiotrace has to free kmmio_fault_pages, but it can't do it directly, so it defers freeing by RCU. It usually works, but when mmiotraced code calls ioremap-iounmap multiple times without sleeping between (so RCU won't kick in and start freeing) it can be given the same virtual address, so at every iounmap mmiotrace will schedule the same pages for release. Obviously it will explode on second free. Fix it by marking kmmio_fault_pages which are scheduled for release and not adding them second time. Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Tested-by: Marcin Kocielnicki <koriakin@0x04.net> Tested-by: Shinpei KATO <shinpei@il.is.s.u-tokyo.ac.jp> Acked-by: Pekka Paalanen <pq@iki.fi> Cc: Stuart Bennett <stuart@freedesktop.org> Cc: Marcin Kocielnicki <koriakin@0x04.net> Cc: nouveau@lists.freedesktop.org LKML-Reference: <20100613215654.GA3829@joi.lan> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13x86, vmware: Preset lpj values when on VMware.Alok Kataria
commit 9f242dc10e0c3c1eb32d8c83c18650a35fd7f80d upstream. When running on VMware's platform, we have seen situations where the AP's try to calibrate the lpj values and fail to get good calibration runs becasue of timing issues. As a result delays don't work correctly on all cpus. The solutions is to set preset_lpj value based on the current tsc frequency value. This is similar to what KVM does as well. Signed-off-by: Alok N Kataria <akataria@vmware.com> LKML-Reference: <1280790637.14933.29.camel@ank32.eng.vmware.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-10xen: drop xen_sched_clock in favour of using plain wallclock timeJeremy Fitzhardinge
commit 8a22b9996b001c88f2bfb54c6de6a05fc39e177a upstream. xen_sched_clock only counts unstolen time. In principle this should be useful to the Linux scheduler so that it knows how much time a process actually consumed. But in practice this doesn't work very well as the scheduler expects the sched_clock time to be synchronized between cpus. It also uses sched_clock to measure the time a task spends sleeping, in which case "unstolen time" isn't meaningful. So just use plain xen_clocksource_read to return wallclock nanoseconds for sched_clock. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-28x86,kgdb: Fix hw breakpoint regressionJason Wessel
HW breakpoints events stopped working correctly with kgdb as a result of commit: 018cbffe6819f6f8db20a0a3acd9bab9bfd667e4 (Merge commit 'v2.6.33' into perf/core). The regression occurred because the behavior changed for setting NOTIFY_STOP as the return value to the die notifier if the breakpoint was known to the HW breakpoint API. Because kgdb is using the HW breakpoint API to register HW breakpoints slots, it must also now implement the overflow_handler call back else kgdb does not get to see the events from the die notifier. The kgdb_ll_trap function will be changed to be general purpose code which can allow an easy way to implement the hw_breakpoint API overflow call back. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Dongdong Deng <dongdong.deng@windriver.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-07-26Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Do not try to disable hpet if it hasn't been initialized before x86, i8259: Only register sysdev if we have a real 8259 PIC
2010-07-26Merge branch 'fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq: [CPUFREQ] powernow-k8: Limit Pstate transition latency check [CPUFREQ] Fix PCC driver error path [CPUFREQ] fix double freeing in error path of pcc-cpufreq [CPUFREQ] pcc driver should check for pcch method before calling _OSC [CPUFREQ] fix memory leak in cpufreq_add_dev [CPUFREQ] revert "[CPUFREQ] remove rwsem lock from CPUFREQ_GOV_STOP call (second call site)"
2010-07-26[CPUFREQ] powernow-k8: Limit Pstate transition latency checkBorislav Petkov
The Pstate transition latency check was added for broken F10h BIOSen which wrongly contain a value of 0 for transition and bus master latency. Fam11h and later, however, (will) have similar transition latency so extend that behavior for them too. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Signed-off-by: Dave Jones <davej@redhat.com>
2010-07-26[CPUFREQ] Fix PCC driver error pathMatthew Garrett
The PCC cpufreq driver unmaps the mailbox address range if any CPUs fail to initialise, but doesn't do anything to remove the registered CPUs from the cpufreq core resulting in failures further down the line. We're better off simply returning a failure - the cpufreq core will unregister us cleanly if we end up with no successfully registered CPUs. Tidy up the failure path and also add a sanity check to ensure that the firmware gives us a realistic frequency - the core deals badly with that being set to 0. Signed-off-by: Matthew Garrett <mjg@redhat.com> Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com> Signed-off-by: Dave Jones <davej@redhat.com>
2010-07-26[CPUFREQ] fix double freeing in error path of pcc-cpufreqDaniel J Blueman
Prevent double freeing on error path. Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com> Signed-off-by: Dave Jones <davej@redhat.com>
2010-07-26[CPUFREQ] pcc driver should check for pcch method before calling _OSCMatthew Garrett
The pcc specification documents an _OSC method that's incompatible with the one defined as part of the ACPI spec. This shouldn't be a problem as both are supposed to be guarded with a UUID. Unfortunately approximately nobody (including HP, who wrote this spec) properly check the UUID on entry to the _OSC call. Right now this could result in surprising behaviour if the pcc driver performs an _OSC call on a machine that doesn't implement the pcc specification. Check whether the PCCH method exists first in order to reduce this probability. Signed-off-by: Matthew Garrett <mjg@redhat.com> Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com> Signed-off-by: Dave Jones <davej@redhat.com>
2010-07-26Merge branch 'kvm-updates/2.6.35' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
* 'kvm-updates/2.6.35' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: Use kmalloc() instead of vmalloc() for KVM_[GS]ET_MSR KVM: MMU: fix conflict access permissions in direct sp
2010-07-24Merge branch 'bugzilla-16396' into releaseLen Brown
2010-07-24ACPI / Sleep: Allow the NVS saving to be skipped during suspend to RAMRafael J. Wysocki
Commit 2a6b69765ad794389f2fc3e14a0afa1a995221c2 (ACPI: Store NVS state even when entering suspend to RAM) caused the ACPI suspend code save the NVS area during suspend and restore it during resume unconditionally, although it is known that some systems need to use acpi_sleep=s4_nonvs for hibernation to work. To allow the affected systems to avoid saving and restoring the NVS area during suspend to RAM and resume, introduce kernel command line option acpi_sleep=nonvs and make acpi_sleep=s4_nonvs work as its alias temporarily (add acpi_sleep=s4_nonvs to the feature removal file). Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16396 . Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Reported-and-tested-by: tomas m <tmezzadra@gmail.com> Signed-off-by: Len Brown <len.brown@intel.com>
2010-07-23x86: Do not try to disable hpet if it hasn't been initialized beforeStefano Stabellini
hpet_disable is called unconditionally on machine reboot if hpet support is compiled in the kernel. hpet_disable only checks if the machine is hpet capable but doesn't make sure that hpet has been initialized. [ tglx: Made it a one liner and removed the redundant hpet_address check ] Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Venkatesh Pallipadi <venki@google.com> LKML-Reference: <alpine.DEB.2.00.1007211726240.22235@kaball-desktop> Cc: stable@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-23KVM: Use kmalloc() instead of vmalloc() for KVM_[GS]ET_MSRAvi Kivity
We don't need more than a page, and vmalloc() is slower (much slower recently due to a regression). Signed-off-by: Avi Kivity <avi@redhat.com>
2010-07-23KVM: MMU: fix conflict access permissions in direct spXiao Guangrong
In no-direct mapping, we mark sp is 'direct' when we mapping the guest's larger page, but its access is encoded form upper page-struct entire not include the last mapping, it will cause access conflict. For example, have this mapping: [W] / PDE1 -> |---| P[W] | | LPA \ PDE2 -> |---| [R] P have two children, PDE1 and PDE2, both PDE1 and PDE2 mapping the same lage page(LPA). The P's access is WR, PDE1's access is WR, PDE2's access is RO(just consider read-write permissions here) When guest access PDE1, we will create a direct sp for LPA, the sp's access is from P, is W, then we will mark the ptes is W in this sp. Then, guest access PDE2, we will find LPA's shadow page, is the same as PDE's, and mark the ptes is RO. So, if guest access PDE1, the incorrect #PF is occured. Fixed by encode the last mapping access into direct shadow page Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2010-07-22Merge branch 'bugzilla-15886' into releaseLen Brown
2010-07-22ACPI: skip checking BM_STS if the BIOS doesn't ask for itLen Brown
It turns out that there is a bit in the _CST for Intel FFH C3 that tells the OS if we should be checking BM_STS or not. Linux has been unconditionally checking BM_STS. If the chip-set is configured to enable BM_STS, it can retard or completely prevent entry into deep C-states -- as illustrated by turbostat: http://userweb.kernel.org/~lenb/acpi/utils/pmtools/turbostat/ ref: Intel Processor Vendor-Specific ACPI Interface Specification table 4 "_CST FFH GAS Field Encoding" Bit 1: Set to 1 if OSPM should use Bus Master avoidance for this C-state https://bugzilla.kernel.org/show_bug.cgi?id=15886 Signed-off-by: Len Brown <len.brown@intel.com>
2010-07-21x86: auditsyscall: fix fastpath return value after rescheduleRoland McGrath
In the CONFIG_AUDITSYSCALL fast-path for x86 64-bit system calls, we can pass a bad return value and/or error indication for the system call to audit_syscall_exit(). This happens when TIF_NEED_RESCHED was set as the system call returned, so we went out to schedule() and came back to the exit-audit fast-path. The fix is to reload the user return value register from the pt_regs before using it for audit_syscall_exit(). Both the 32-bit kernel's fast path and the 64-bit kernel's 32-bit system call fast paths work slightly differently, so that they always leave the fast path entirely to reschedule and don't return there, so they don't have the analogous bugs. Reported-by: Alexander Viro <aviro@redhat.com> Signed-off-by: Roland McGrath <roland@redhat.com>
2010-07-21Fix up trivial spelling errors ('taht' -> 'that')Linus Torvalds
Pointed out by Lucas who found the new one in a comment in setup_percpu.c. And then I fixed the others that I grepped for. Reported-by: Lucas <canolucas@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-20x86, numa: fix boot without RAM on node0 againYinghai Lu
Commit e534c7c5f8d6 ("numa: x86_64: use generic percpu var numa_node_id() implementation") broke numa systems that don't have ram on node0 when MEMORY_HOTPLUG is enabled, because cpu_up() will call cpu_to_node() before per_cpu(numa_node) is setup for APs. When Node0 doesn't have RAM, on x86, cpus already round it to nearest node with RAM in x86_cpu_to_node_map. and per_cpu(numa_node) is not set up until in c_init for APs. When later cpu_up() calling cpu_to_node() will get 0 again, and make it online even there is no RAM on node0. so later all APs can not booted up, and later will have panic. [ 1.611101] On node 0 totalpages: 0 ......... [ 2.608558] On node 0 totalpages: 0 [ 2.612065] Brought up 1 CPUs [ 2.615199] Total of 1 processors activated (3990.31 BogoMIPS). ... 93.225341] calling loop_init+0x0/0x1a4 @ 1 [ 93.229314] PERCPU: allocation failed, size=80 align=8, failed to populate [ 93.246539] Pid: 1, comm: swapper Tainted: G W 2.6.35-rc4-tip-yh-04371-gd64e6c4-dirty #354 [ 93.264621] Call Trace: [ 93.266533] [<ffffffff81125e43>] pcpu_alloc+0x83a/0x8e7 [ 93.270710] [<ffffffff81125f15>] __alloc_percpu+0x10/0x12 [ 93.285849] [<ffffffff8140786c>] alloc_disk_node+0x94/0x16d [ 93.291811] [<ffffffff81407956>] alloc_disk+0x11/0x13 [ 93.306157] [<ffffffff81503e51>] loop_alloc+0xa7/0x180 [ 93.310538] [<ffffffff8277ef48>] loop_init+0x9b/0x1a4 [ 93.324909] [<ffffffff8277eead>] ? loop_init+0x0/0x1a4 [ 93.329650] [<ffffffff810001f2>] do_one_initcall+0x57/0x136 [ 93.345197] [<ffffffff827486d0>] kernel_init+0x184/0x20e [ 93.348146] [<ffffffff81034954>] kernel_thread_helper+0x4/0x10 [ 93.365194] [<ffffffff81c7cc3c>] ? restore_args+0x0/0x30 [ 93.369305] [<ffffffff8274854c>] ? kernel_init+0x0/0x20e [ 93.386011] [<ffffffff81034950>] ? kernel_thread_helper+0x0/0x10 [ 93.392047] loop: out of memory ... Try to assign per_cpu(numa_node) early [akpm@linux-foundation.org: tidy up code comment] Signed-off-by: Yinghai <yinghai@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Tejun Heo <tj@kernel.org> Cc: Denys Vlasenko <vda.linux@googlemail.com> Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-20x86, i8259: Only register sysdev if we have a real 8259 PICAdam Lackorzynski
My platform makes use of the null_legacy_pic choice and oopses when doing a shutdown as the shutdown code goes through all the registered sysdevs and calls their shutdown method which in my case poke on a non-existing i8259. Imho the i8259 specific sysdev should only be registered if the i8259 is actually there. Do not register the sysdev function when the null_legacy_pic is used so that the i8259 resume, suspend and shutdown functions are not called. Signed-off-by: Adam Lackorzynski <adam@os.inf.tu-dresden.de> LKML-Reference: <201007202218.o6KMIJ3m020955@imap1.linux-foundation.org> Cc: Jacob Pan <jacob.jun.pan@intel.com> Cc: <stable@kernel.org> 2.6.34 Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2010-07-19mm: add context argument to shrinker callbackDave Chinner
The current shrinker implementation requires the registered callback to have global state to work from. This makes it difficult to shrink caches that are not global (e.g. per-filesystem caches). Pass the shrinker structure to the callback so that users can embed the shrinker structure in the context the shrinker needs to operate on and get back to it in the callback via container_of(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-07-19Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, pci, mrst: Add extra sanity check in walking the PCI extended cap chain x86: Fix x2apic preenabled system with kexec x86: Force HPET readback_cmp for all ATI chipsets
2010-07-18Merge branch 'x86/kprobes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland * 'x86/kprobes' of git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland: x86: kprobes: fix swapped segment registers in kretprobe
2010-07-18x86: kprobes: fix swapped segment registers in kretprobeRoland McGrath
In commit f007ea26, the order of the %es and %ds segment registers got accidentally swapped, so synthesized 'struct pt_regs' frames have the two values inverted. It's almost sure that these values never matter, and that they also never differ. But wrong is wrong. Signed-off-by: Roland McGrath <roland@redhat.com>