<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/arch/x86/kernel/cpu/mcheck, branch v3.7</title>
<subtitle>Linux kernel source tree</subtitle>
<id>https://git.amat.us/linux/atom/arch/x86/kernel/cpu/mcheck?h=v3.7</id>
<link rel='self' href='https://git.amat.us/linux/atom/arch/x86/kernel/cpu/mcheck?h=v3.7'/>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/'/>
<updated>2012-11-13T18:01:01Z</updated>
<entry>
<title>Merge tag 'please-pull-tangchen' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/urgent</title>
<updated>2012-11-13T18:01:01Z</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@kernel.org</email>
</author>
<published>2012-11-13T18:01:01Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=226f69a4b7ed1da248d653aa1535bad61ba5fa49'/>
<id>urn:sha1:226f69a4b7ed1da248d653aa1535bad61ba5fa49</id>
<content type='text'>
Pull MCE fix from Tony Luck:

   "Fix problem in CMCI rediscovery code that was illegally
    migrating worker threads to other cpus."

Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>x86/mce: Do not change worker's running cpu in cmci_rediscover().</title>
<updated>2012-10-30T21:38:12Z</updated>
<author>
<name>Tang Chen</name>
<email>tangchen@cn.fujitsu.com</email>
</author>
<published>2012-10-29T03:01:50Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=85b97637bb40a9f486459dd254598759af9c3d50'/>
<id>urn:sha1:85b97637bb40a9f486459dd254598759af9c3d50</id>
<content type='text'>
cmci_rediscover() used set_cpus_allowed_ptr() to change the current process's
running cpu, and migrate itself to the dest cpu. But worker processes are not
allowed to be migrated. If current is a worker, the worker will be migrated to
another cpu, but the corresponding  worker_pool is still on the original cpu.

In this case, the following BUG_ON in try_to_wake_up_local() will be triggered:
BUG_ON(rq != this_rq());

This will cause the kernel panic. The call trace is like the following:

[ 6155.451107] ------------[ cut here ]------------
[ 6155.452019] kernel BUG at kernel/sched/core.c:1654!
......
[ 6155.452019] RIP: 0010:[&lt;ffffffff810add15&gt;]  [&lt;ffffffff810add15&gt;] try_to_wake_up_local+0x115/0x130
......
[ 6155.452019] Call Trace:
[ 6155.452019]  [&lt;ffffffff8166fc14&gt;] __schedule+0x764/0x880
[ 6155.452019]  [&lt;ffffffff81670059&gt;] schedule+0x29/0x70
[ 6155.452019]  [&lt;ffffffff8166de65&gt;] schedule_timeout+0x235/0x2d0
[ 6155.452019]  [&lt;ffffffff810db57d&gt;] ? mark_held_locks+0x8d/0x140
[ 6155.452019]  [&lt;ffffffff810dd463&gt;] ? __lock_release+0x133/0x1a0
[ 6155.452019]  [&lt;ffffffff81671c50&gt;] ? _raw_spin_unlock_irq+0x30/0x50
[ 6155.452019]  [&lt;ffffffff810db8f5&gt;] ? trace_hardirqs_on_caller+0x105/0x190
[ 6155.452019]  [&lt;ffffffff8166fefb&gt;] wait_for_common+0x12b/0x180
[ 6155.452019]  [&lt;ffffffff810b0b30&gt;] ? try_to_wake_up+0x2f0/0x2f0
[ 6155.452019]  [&lt;ffffffff8167002d&gt;] wait_for_completion+0x1d/0x20
[ 6155.452019]  [&lt;ffffffff8110008a&gt;] stop_one_cpu+0x8a/0xc0
[ 6155.452019]  [&lt;ffffffff810abd40&gt;] ? __migrate_task+0x1a0/0x1a0
[ 6155.452019]  [&lt;ffffffff810a6ab8&gt;] ? complete+0x28/0x60
[ 6155.452019]  [&lt;ffffffff810b0fd8&gt;] set_cpus_allowed_ptr+0x128/0x130
[ 6155.452019]  [&lt;ffffffff81036785&gt;] cmci_rediscover+0xf5/0x140
[ 6155.452019]  [&lt;ffffffff816643c0&gt;] mce_cpu_callback+0x18d/0x19d
[ 6155.452019]  [&lt;ffffffff81676187&gt;] notifier_call_chain+0x67/0x150
[ 6155.452019]  [&lt;ffffffff810a03de&gt;] __raw_notifier_call_chain+0xe/0x10
[ 6155.452019]  [&lt;ffffffff81070470&gt;] __cpu_notify+0x20/0x40
[ 6155.452019]  [&lt;ffffffff810704a5&gt;] cpu_notify_nofail+0x15/0x30
[ 6155.452019]  [&lt;ffffffff81655182&gt;] _cpu_down+0x262/0x2e0
[ 6155.452019]  [&lt;ffffffff81655236&gt;] cpu_down+0x36/0x50
[ 6155.452019]  [&lt;ffffffff813d3eaa&gt;] acpi_processor_remove+0x50/0x11e
[ 6155.452019]  [&lt;ffffffff813a6978&gt;] acpi_device_remove+0x90/0xb2
[ 6155.452019]  [&lt;ffffffff8143cbec&gt;] __device_release_driver+0x7c/0xf0
[ 6155.452019]  [&lt;ffffffff8143cd6f&gt;] device_release_driver+0x2f/0x50
[ 6155.452019]  [&lt;ffffffff813a7870&gt;] acpi_bus_remove+0x32/0x6d
[ 6155.452019]  [&lt;ffffffff813a7932&gt;] acpi_bus_trim+0x87/0xee
[ 6155.452019]  [&lt;ffffffff813a7a21&gt;] acpi_bus_hot_remove_device+0x88/0x16b
[ 6155.452019]  [&lt;ffffffff813a33ee&gt;] acpi_os_execute_deferred+0x27/0x34
[ 6155.452019]  [&lt;ffffffff81090589&gt;] process_one_work+0x219/0x680
[ 6155.452019]  [&lt;ffffffff81090528&gt;] ? process_one_work+0x1b8/0x680
[ 6155.452019]  [&lt;ffffffff813a33c7&gt;] ? acpi_os_wait_events_complete+0x23/0x23
[ 6155.452019]  [&lt;ffffffff810923be&gt;] worker_thread+0x12e/0x320
[ 6155.452019]  [&lt;ffffffff81092290&gt;] ? manage_workers+0x110/0x110
[ 6155.452019]  [&lt;ffffffff81098396&gt;] kthread+0xc6/0xd0
[ 6155.452019]  [&lt;ffffffff8167c4c4&gt;] kernel_thread_helper+0x4/0x10
[ 6155.452019]  [&lt;ffffffff81671f30&gt;] ? retint_restore_args+0x13/0x13
[ 6155.452019]  [&lt;ffffffff810982d0&gt;] ? __init_kthread_worker+0x70/0x70
[ 6155.452019]  [&lt;ffffffff8167c4c0&gt;] ? gs_change+0x13/0x13

This patch removes the set_cpus_allowed_ptr() call, and put the cmci rediscover
jobs onto all the other cpus using system_wq. This could bring some delay for
the jobs.

Signed-off-by: Tang Chen &lt;tangchen@cn.fujitsu.com&gt;
Signed-off-by: Miao Xie &lt;miaox@cn.fujitsu.com&gt;
Signed-off-by: Tony Luck &lt;tony.luck@intel.com&gt;
</content>
</entry>
<entry>
<title>x86, AMD: Change Boris' email address</title>
<updated>2012-10-30T09:05:50Z</updated>
<author>
<name>Borislav Petkov</name>
<email>bp@alien8.de</email>
</author>
<published>2012-10-29T17:40:08Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=e6d41e8c697e07832efa4a85bf23438bc4c4e1b2'/>
<id>urn:sha1:e6d41e8c697e07832efa4a85bf23438bc4c4e1b2</id>
<content type='text'>
Move to private email and put in maintained status.

Signed-off-by: Borislav Petkov &lt;bp@alien8.de&gt;
Link: http://lkml.kernel.org/r/1351532410-4887-1-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge commit '5bc66170dc486556a1e36fd384463536573f4b82' into x86/urgent</title>
<updated>2012-10-19T14:55:09Z</updated>
<author>
<name>H. Peter Anvin</name>
<email>hpa@linux.intel.com</email>
</author>
<published>2012-10-19T14:54:24Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=4533d86270d7986e00594495dde9a109d6be27ae'/>
<id>urn:sha1:4533d86270d7986e00594495dde9a109d6be27ae</id>
<content type='text'>
From Borislav Petkov &lt;bp@amd64.org&gt;:

Below is a RAS fix which reverts the addition of a sysfs attribute
which we agreed is not needed, post-factum. And this should go in now
because that sysfs attribute is going to end up in 3.7 otherwise and
thus exposed to userspace; removing it then would be a lot harder.

This is done as a merge rather than a simple patch/cherry-pick since
the baseline for this patch was not in the previous x86/urgent.

Signed-off-by: H. Peter Anvin &lt;hpa@linux.intel.com&gt;
</content>
</entry>
<entry>
<title>x86, MCE: Remove bios_cmci_threshold sysfs attribute</title>
<updated>2012-10-19T13:22:29Z</updated>
<author>
<name>Borislav Petkov</name>
<email>borislav.petkov@amd.com</email>
</author>
<published>2012-10-18T13:10:56Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=5bc66170dc486556a1e36fd384463536573f4b82'/>
<id>urn:sha1:5bc66170dc486556a1e36fd384463536573f4b82</id>
<content type='text'>
450cc201038f3 ("x86/mce: Provide boot argument to honour bios-set CMCI
threshold") added the bios_cmci_threshold sysfs attribute which was
supposed to communicate to userspace tools that BIOS CMCI threshold has
been honoured.

However, this info is not of any importance to userspace - it should
rather get the actual error count it has been thresholded already from
MCi_STATUS[38:52].

So drop this before it becomes a used interface (good thing we caught
this early in 3.7-rc1, right after the merge window closed).

Cc: Naveen N. Rao &lt;naveen.n.rao@linux.vnet.ibm.com&gt;
Acked-by: Tony Luck &lt;tony.luck@intel.com&gt;
Link: http://lkml.kernel.org/r/20121017105940.GA14590@x1.osrc.amd.com
Signed-off-by: Borislav Petkov &lt;borislav.petkov@amd.com&gt;
</content>
</entry>
<entry>
<title>x86, amd, mce: Avoid NULL pointer reference on CPU northbridge lookup</title>
<updated>2012-10-17T18:25:32Z</updated>
<author>
<name>Daniel J Blueman</name>
<email>daniel@numascale-asia.com</email>
</author>
<published>2012-10-01T06:42:05Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=21c5e50e15b1abd797e62f18fd7f90b9cc004cbd'/>
<id>urn:sha1:21c5e50e15b1abd797e62f18fd7f90b9cc004cbd</id>
<content type='text'>
When booting on a federated multi-server system (NumaScale), the
processor Northbridge lookup returns NULL; add guards to prevent this
causing an oops.

On those systems, the northbridge is accessed through MMIO and the
"normal" northbridge enumeration in amd_nb.c doesn't work since we're
generating the northbridge ID from the initial APIC ID and the last
is not unique on those systems. Long story short, we end up without
northbridge descriptors.

Signed-off-by: Daniel J Blueman &lt;daniel@numascale-asia.com&gt;
Cc: stable@vger.kernel.org # 3.6
Link: http://lkml.kernel.org/r/1349073725-14093-1-git-send-email-daniel@numascale-asia.com
[ Boris: beef up commit message ]
Signed-off-by: Borislav Petkov &lt;borislav.petkov@amd.com&gt;
Signed-off-by: H. Peter Anvin &lt;hpa@linux.intel.com&gt;
</content>
</entry>
<entry>
<title>x86/mce: Provide boot argument to honour bios-set CMCI threshold</title>
<updated>2012-09-27T17:08:00Z</updated>
<author>
<name>Naveen N. Rao</name>
<email>naveen.n.rao@linux.vnet.ibm.com</email>
</author>
<published>2012-09-27T17:08:00Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=450cc201038f31bd496e1b3a44a49790b8827a06'/>
<id>urn:sha1:450cc201038f31bd496e1b3a44a49790b8827a06</id>
<content type='text'>
The ACPI spec doesn't provide for a way for the bios to pass down
recommended thresholds to the OS on a _per-bank_ basis. This patch adds
a new boot option, which if passed, tells Linux to use CMCI thresholds
set by the bios.

As fail-safe, we initialize threshold to 1 if some banks have not been
initialized by the bios and warn the user.

Signed-off-by: Naveen N. Rao &lt;naveen.n.rao@linux.vnet.ibm.com&gt;
Signed-off-by: Tony Luck &lt;tony.luck@intel.com&gt;
</content>
</entry>
<entry>
<title>Merge tag 'v3.6-rc6' into x86/mce</title>
<updated>2012-09-19T15:01:25Z</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@kernel.org</email>
</author>
<published>2012-09-19T15:01:25Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=f1f652447694f92beff8a534d94b36ea441c939a'/>
<id>urn:sha1:f1f652447694f92beff8a534d94b36ea441c939a</id>
<content type='text'>
Merge Linux v3.6-rc6, to refresh this tree.

Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>x86/mce: Add CMCI poll mode</title>
<updated>2012-08-09T18:44:51Z</updated>
<author>
<name>Chen Gong</name>
<email>gong.chen@linux.intel.com</email>
</author>
<published>2012-08-09T18:44:51Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=55babd8f41f122f5f4c7cebf520c766c983282c6'/>
<id>urn:sha1:55babd8f41f122f5f4c7cebf520c766c983282c6</id>
<content type='text'>
On Intel systems corrected machine check interrupts (CMCI) may be sent to
multiple logical processors; possibly to all processors on the affected
socket (SDM Volume 3B "15.5.1 CMCI Local APIC Interface").  This means
that a persistent error (such as a stuck bit in ECC memory) may cause
a storm of interrupts that greatly hinders or prevents forward progress
(probably on many processors).

To solve this we keep track of the rate at which each processor sees
CMCI. If we exceed a threshold, we disable CMCI delivery and switch to
polling the machine check banks. If the storm subsides (none of the
affected processors see any more errors for a complete poll interval) we
re-enable CMCI.

[Tony: Added console messages when storm begins/ends and increased storm
threshold from 5 to 15 so we have a few more logged entries before we
disable interrupts and start dropping reports]

Signed-off-by: Chen Gong &lt;gong.chen@linux.intel.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Tested-by: Chen Gong &lt;gong.chen@linux.intel.com&gt;
Signed-off-by: Tony Luck &lt;tony.luck@intel.com&gt;
</content>
</entry>
<entry>
<title>x86/mce: Make cmci_discover() quiet</title>
<updated>2012-08-09T17:59:21Z</updated>
<author>
<name>Tony Luck</name>
<email>tony.luck@intel.com</email>
</author>
<published>2012-08-09T17:59:21Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=4670a300a2169e1e922593c5d35b0cdaee112901'/>
<id>urn:sha1:4670a300a2169e1e922593c5d35b0cdaee112901</id>
<content type='text'>
cmci_discover() works out which machine check banks support CMCI, and
which of those are shared by multiple logical processors. It uses this
information to ensure that exactly one cpu is designated the owner of
each bank so that when interrupts are broadcast to multiple cpus, only one
of them will look in a shared bank to log the error and clear the bank.

At boot time cmci_discover() performs this task silently. But during
certain cpu hotplug operations it prints out a set of summary lines
like this:

CPU 35 MCA banks CMCI:0 CMCI:1 CMCI:3 CMCI:5 CMCI:6 CMCI:7 CMCI:8 CMCI:9 CMCI:10 CMCI:11
CPU 1 MCA banks CMCI:0 CMCI:1 CMCI:3
CPU 39 MCA banks CMCI:0 CMCI:1 CMCI:3
CPU 38 MCA banks CMCI:0 CMCI:1 CMCI:3
CPU 32 MCA banks CMCI:0 CMCI:1 CMCI:3
CPU 37 MCA banks CMCI:0 CMCI:1 CMCI:3
CPU 36 MCA banks CMCI:0 CMCI:1 CMCI:3
CPU 34 MCA banks CMCI:0 CMCI:1 CMCI:3

The value of these messages seems very low. A user might painstakingly
cross-check against the data sheet for a processor to ensure that all
CMCI supported banks are correctly reported, but this seems improbable.
If users really wanted to do this, we should print the information at
boot time too.

Remove the messages.

Signed-off-by: Tony Luck &lt;tony.luck@intel.com&gt;
</content>
</entry>
</feed>
