linux/arch/x86/kernel/cpu/mcheck, branch v3.7

Merge tag 'please-pull-tangchen' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/urgent

2012-11-13T18:01:01Z

Pull MCE fix from Tony Luck: "Fix problem in CMCI rediscovery code that was illegally migrating worker threads to other cpus." Signed-off-by: Ingo Molnar

x86/mce: Do not change worker's running cpu in cmci_rediscover().

2012-10-30T21:38:12Z

cmci_rediscover() used set_cpus_allowed_ptr() to change the current process's running cpu, and migrate itself to the dest cpu. But worker processes are not allowed to be migrated. If current is a worker, the worker will be migrated to another cpu, but the corresponding worker_pool is still on the original cpu. In this case, the following BUG_ON in try_to_wake_up_local() will be triggered: BUG_ON(rq != this_rq()); This will cause the kernel panic. The call trace is like the following: [ 6155.451107] ------------[ cut here ]------------ [ 6155.452019] kernel BUG at kernel/sched/core.c:1654! ...... [ 6155.452019] RIP: 0010:[] [] try_to_wake_up_local+0x115/0x130 ...... [ 6155.452019] Call Trace: [ 6155.452019] [] __schedule+0x764/0x880 [ 6155.452019] [] schedule+0x29/0x70 [ 6155.452019] [] schedule_timeout+0x235/0x2d0 [ 6155.452019] [] ? mark_held_locks+0x8d/0x140 [ 6155.452019] [] ? __lock_release+0x133/0x1a0 [ 6155.452019] [] ? _raw_spin_unlock_irq+0x30/0x50 [ 6155.452019] [] ? trace_hardirqs_on_caller+0x105/0x190 [ 6155.452019] [] wait_for_common+0x12b/0x180 [ 6155.452019] [] ? try_to_wake_up+0x2f0/0x2f0 [ 6155.452019] [] wait_for_completion+0x1d/0x20 [ 6155.452019] [] stop_one_cpu+0x8a/0xc0 [ 6155.452019] [] ? __migrate_task+0x1a0/0x1a0 [ 6155.452019] [] ? complete+0x28/0x60 [ 6155.452019] [] set_cpus_allowed_ptr+0x128/0x130 [ 6155.452019] [] cmci_rediscover+0xf5/0x140 [ 6155.452019] [] mce_cpu_callback+0x18d/0x19d [ 6155.452019] [] notifier_call_chain+0x67/0x150 [ 6155.452019] [] __raw_notifier_call_chain+0xe/0x10 [ 6155.452019] [] __cpu_notify+0x20/0x40 [ 6155.452019] [] cpu_notify_nofail+0x15/0x30 [ 6155.452019] [] _cpu_down+0x262/0x2e0 [ 6155.452019] [] cpu_down+0x36/0x50 [ 6155.452019] [] acpi_processor_remove+0x50/0x11e [ 6155.452019] [] acpi_device_remove+0x90/0xb2 [ 6155.452019] [] __device_release_driver+0x7c/0xf0 [ 6155.452019] [] device_release_driver+0x2f/0x50 [ 6155.452019] [] acpi_bus_remove+0x32/0x6d [ 6155.452019] [] acpi_bus_trim+0x87/0xee [ 6155.452019] [] acpi_bus_hot_remove_device+0x88/0x16b [ 6155.452019] [] acpi_os_execute_deferred+0x27/0x34 [ 6155.452019] [] process_one_work+0x219/0x680 [ 6155.452019] [] ? process_one_work+0x1b8/0x680 [ 6155.452019] [] ? acpi_os_wait_events_complete+0x23/0x23 [ 6155.452019] [] worker_thread+0x12e/0x320 [ 6155.452019] [] ? manage_workers+0x110/0x110 [ 6155.452019] [] kthread+0xc6/0xd0 [ 6155.452019] [] kernel_thread_helper+0x4/0x10 [ 6155.452019] [] ? retint_restore_args+0x13/0x13 [ 6155.452019] [] ? __init_kthread_worker+0x70/0x70 [ 6155.452019] [] ? gs_change+0x13/0x13 This patch removes the set_cpus_allowed_ptr() call, and put the cmci rediscover jobs onto all the other cpus using system_wq. This could bring some delay for the jobs. Signed-off-by: Tang Chen Signed-off-by: Miao Xie Signed-off-by: Tony Luck

x86, AMD: Change Boris' email address

2012-10-30T09:05:50Z

Move to private email and put in maintained status. Signed-off-by: Borislav Petkov Link: http://lkml.kernel.org/r/1351532410-4887-1-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar

Merge commit '5bc66170dc486556a1e36fd384463536573f4b82' into x86/urgent

2012-10-19T14:55:09Z

From Borislav Petkov : Below is a RAS fix which reverts the addition of a sysfs attribute which we agreed is not needed, post-factum. And this should go in now because that sysfs attribute is going to end up in 3.7 otherwise and thus exposed to userspace; removing it then would be a lot harder. This is done as a merge rather than a simple patch/cherry-pick since the baseline for this patch was not in the previous x86/urgent. Signed-off-by: H. Peter Anvin

x86, MCE: Remove bios_cmci_threshold sysfs attribute

2012-10-19T13:22:29Z

450cc201038f3 ("x86/mce: Provide boot argument to honour bios-set CMCI threshold") added the bios_cmci_threshold sysfs attribute which was supposed to communicate to userspace tools that BIOS CMCI threshold has been honoured. However, this info is not of any importance to userspace - it should rather get the actual error count it has been thresholded already from MCi_STATUS[38:52]. So drop this before it becomes a used interface (good thing we caught this early in 3.7-rc1, right after the merge window closed). Cc: Naveen N. Rao Acked-by: Tony Luck Link: http://lkml.kernel.org/r/20121017105940.GA14590@x1.osrc.amd.com Signed-off-by: Borislav Petkov

x86, amd, mce: Avoid NULL pointer reference on CPU northbridge lookup

2012-10-17T18:25:32Z

When booting on a federated multi-server system (NumaScale), the processor Northbridge lookup returns NULL; add guards to prevent this causing an oops. On those systems, the northbridge is accessed through MMIO and the "normal" northbridge enumeration in amd_nb.c doesn't work since we're generating the northbridge ID from the initial APIC ID and the last is not unique on those systems. Long story short, we end up without northbridge descriptors. Signed-off-by: Daniel J Blueman Cc: stable@vger.kernel.org # 3.6 Link: http://lkml.kernel.org/r/1349073725-14093-1-git-send-email-daniel@numascale-asia.com [ Boris: beef up commit message ] Signed-off-by: Borislav Petkov Signed-off-by: H. Peter Anvin

x86/mce: Provide boot argument to honour bios-set CMCI threshold

2012-09-27T17:08:00Z

The ACPI spec doesn't provide for a way for the bios to pass down recommended thresholds to the OS on a _per-bank_ basis. This patch adds a new boot option, which if passed, tells Linux to use CMCI thresholds set by the bios. As fail-safe, we initialize threshold to 1 if some banks have not been initialized by the bios and warn the user. Signed-off-by: Naveen N. Rao Signed-off-by: Tony Luck

Merge tag 'v3.6-rc6' into x86/mce

2012-09-19T15:01:25Z

Merge Linux v3.6-rc6, to refresh this tree. Signed-off-by: Ingo Molnar

x86/mce: Add CMCI poll mode

2012-08-09T18:44:51Z

On Intel systems corrected machine check interrupts (CMCI) may be sent to multiple logical processors; possibly to all processors on the affected socket (SDM Volume 3B "15.5.1 CMCI Local APIC Interface"). This means that a persistent error (such as a stuck bit in ECC memory) may cause a storm of interrupts that greatly hinders or prevents forward progress (probably on many processors). To solve this we keep track of the rate at which each processor sees CMCI. If we exceed a threshold, we disable CMCI delivery and switch to polling the machine check banks. If the storm subsides (none of the affected processors see any more errors for a complete poll interval) we re-enable CMCI. [Tony: Added console messages when storm begins/ends and increased storm threshold from 5 to 15 so we have a few more logged entries before we disable interrupts and start dropping reports] Signed-off-by: Chen Gong Signed-off-by: Thomas Gleixner Tested-by: Chen Gong Signed-off-by: Tony Luck

x86/mce: Make cmci_discover() quiet

2012-08-09T17:59:21Z

cmci_discover() works out which machine check banks support CMCI, and which of those are shared by multiple logical processors. It uses this information to ensure that exactly one cpu is designated the owner of each bank so that when interrupts are broadcast to multiple cpus, only one of them will look in a shared bank to log the error and clear the bank. At boot time cmci_discover() performs this task silently. But during certain cpu hotplug operations it prints out a set of summary lines like this: CPU 35 MCA banks CMCI:0 CMCI:1 CMCI:3 CMCI:5 CMCI:6 CMCI:7 CMCI:8 CMCI:9 CMCI:10 CMCI:11 CPU 1 MCA banks CMCI:0 CMCI:1 CMCI:3 CPU 39 MCA banks CMCI:0 CMCI:1 CMCI:3 CPU 38 MCA banks CMCI:0 CMCI:1 CMCI:3 CPU 32 MCA banks CMCI:0 CMCI:1 CMCI:3 CPU 37 MCA banks CMCI:0 CMCI:1 CMCI:3 CPU 36 MCA banks CMCI:0 CMCI:1 CMCI:3 CPU 34 MCA banks CMCI:0 CMCI:1 CMCI:3 The value of these messages seems very low. A user might painstakingly cross-check against the data sheet for a processor to ensure that all CMCI supported banks are correctly reported, but this seems improbable. If users really wanted to do this, we should print the information at boot time too. Remove the messages. Signed-off-by: Tony Luck