aboutsummaryrefslogtreecommitdiff
path: root/drivers/edac
AgeCommit message (Collapse)Author
2010-05-10i7core_edac: Probe on Xeons earilerKeith Mannthey
On the Xeon 55XX series cpus the pci deives are not exposed via acpi so we much explicitly probe them to make the usable as a Linux PCI device. This moves the detection of this state to before pci_register_driver is called. Its present position was not working on my systems, the driver would complain about not finding a specific device. This patch allows the driver to load on my systems. Signed-off-by: Keith Mannthey <kmannth@us.ibm.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core: Use registered memories per processorMauro Carvalho Chehab
Instead of assuming that the entire machine has either registered or unregistered memories, do it at CPU socket based. While here, fix a bug at i7core_mce_output_error(), where the we're using m->cpu directly as if it would represent a socket. Instead, the proper socket_id is given by cpu_data[m->cpu].phys_proc_id. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com> ---
2010-05-10i7core_edac: Use Device 3 function 2 to report errors with RDIMM'sMauro Carvalho Chehab
Nehalem and upper chipsets provide an special device that has corrected memory error counters detected with registered dimms. This device is only seen if there are registered memories plugged. After this patch, on a machine fully equiped with RDIMM's, it will use the Device 3 function 2 to count corrected errors instead on relying at mcelog. For unregistered DIMMs, it will keep the old behavior, counting errors via mcelog. This patch were developed together with Keith Mannthey <kmannth@us.ibm.com> Signed-off-by: Keith Mannthey <kmannth@us.ibm.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: Fix ecc enable shiftKeith Mannthey
From: Keith Mannthey <kmannth@us.ibm.com> Simple correction to a shift value. ECC_ENABLED is bit 4 of MC_STATUS, Dev 3 Fun 0 Offset 0x4c This correctly identifies the state of the ECC at the machine. Signed-off-by: Keith Mannthey <kmannth@us.ibm.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: Print an error message if pci register failsMauro Carvalho Chehab
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: CodingSyle fixes/cleanupsMauro Carvalho Chehab
No functional changes. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: fix error injectionMauro Carvalho Chehab
There were two stupid error injection bugs introduced by wrong cut-and-paste: one at socket store, and another at the error inject register. The last one were causing the code to not work at all. While here, adds debug messages to allow seeing what registers are being set while sending error injection. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: fix error codes for sysfs error injection interfaceMauro Carvalho Chehab
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: some fixes at error injection codeMauro Carvalho Chehab
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: Some cleanups at displayed infoMauro Carvalho Chehab
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core: remove some uneeded noisy debug messagesMauro Carvalho Chehab
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core: add socket info at the debug msgMauro Carvalho Chehab
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core: better document i7core_get_active_channels()Mauro Carvalho Chehab
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core: fix get_devices routine for Xeon55xxMauro Carvalho Chehab
i7core_get_devices() were preparet to get just the first found device of each type. Due to that, on Xeon 55xx, only socket 1 were retrived. Rework i7core_get_devices() to clean it and to properly support Xeon 55xx. While here, fix a small typo. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core: enrich error information based on memory transaction typeMauro Carvalho Chehab
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core: check if the memory error is fatal or non-fatalMauro Carvalho Chehab
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core: fix probing on Xeon55xxMauro Carvalho Chehab
Xeon55xx fails to probe with this error message: EDAC DEBUG: in drivers/edac/i7core_edac.c, line at 1660: MC: drivers/edac/i7core_edac.c: i7core_init() EDAC i7core: Device not found: dev 00:00.0 PCI ID 8086:2c41 i7core_edac: probe of 0000:00:14.0 failed with error -22 This is due to the fact that, on Xeon35xx (and i7core), device 00.0 has PCI ID 8086:2c40. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: some fixes at memory error parserMauro Carvalho Chehab
m->bank is not related to the memory bank but, instead, to the MCA Error register bank. Fix it accordingly. While here, improves the comments for Nehalem bank. A later fix is needed, in order to get bank/rank information from MCA error log. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: decode mcelog error and send it via edac interfaceMauro Carvalho Chehab
Enriches mcelog error by using the encoded information at MCE status and misc registers (IA32_MCx_STATUS, IA32_MCx_MISC). Some fixes are still needed here, in order to properly fill the EDAC fields. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: maps all sockets as if ther are one MC controllerMauro Carvalho Chehab
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: add support for more than one MC socketMauro Carvalho Chehab
Some Nehalem architectures have more than one MC socket. Socket 0 is located at bus 255. Currently, it is using up to 2 sockets, but increasing it to a larger number is just a matter of increasing MAX_SOCKETS definition. This seems to be required for properly support of Xeon 55xx. Still needs testing with Xeon 55xx. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: Add a code to probe Xeon 55xx busMauro Carvalho Chehab
This code changes the detection procedure of i7core_edac. Instead of directly probing for MC registers, it probes for another register found on Nehalem. If found, it tries to pick the first MC PCI BUS. This should work fine with Xeon 35xx, but, on Xeon 55xx, this is at bus 254 and 255 that are not properly detected by the non-legacy PCI methods. The new detection code scans specifically at buses 254 and 255 for the Xeon 55xx devices. This code has not tested yet. After working, a change at the code will be needed, since the i7core is not yet ready for working with 2 sets of MC. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: Adds write unlock to MC registersMauro Carvalho Chehab
The public Intel Xeon 5500 volume 2 datasheet describes, on page 53, session 2.6.7 a register that can lock/unlock Memory Controller the configuration register, called MC_CFG_CONTROL. Adds support for it in the hope that software error injection would work. With my tests with Xeon 35xx, there's still something missing. With a program that does sequencial bit writes at dev 0.0, sometimes, it produces error injection, after unblocking the MC_CFG_CONTROL (and, sometimes, it just locks my testing machine). I'll try later to discover by trial and error what's the register that solves this issue on Xeon 35xx. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: Add edac_mce glueMauro Carvalho Chehab
Adds a glue code to allow i7core to work with mcelog. With the glue, i7core registers itself on edac_mce. At mce, when an error is detected, it calls all registered drivers (in this case, i7core), for EDAC error handling. TODO: It currently just prints the MCE error log using about the same format as mce panic messages. The error message should be enhanced with mcelog userspace info and converted into the proper EDAC format, to feed the EDAC error counts. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10edac/Kconfig: edac_mce can't be moduleMauro Carvalho Chehab
Since mcelog is bool, edac_mce glue should also be bool, or otherwise will not work. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10edac_mce: Add an interface driver to report mce errors via edacMauro Carvalho Chehab
edac_mce module is an interface module that gets mcelog data and forwards to any registered edac module that expects to receive data via mce. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: CodingStyle fixesMauro Carvalho Chehab
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: fill csrows edac sysfs infoMauro Carvalho Chehab
csrows is still fake, since we can't identify its representation with Nehalem registers. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: Memory info fixes and preparation for properly filling cswrow dataMauro Carvalho Chehab
Now, memory size is properly displayed: EDAC i7core: DOD Max limits: DIMMS: 2, 1-ranked, 8-banked EDAC i7core: DOD Max rows x colums = 0x4000 x 0x400 EDAC i7core: Memory channel configuration: EDAC i7core: Ch0 phy rd0, wr0 (0x063f7c31): 2 ranks, UDIMMs EDAC i7core: dimm 0 (0x00000288) 1024 Mb offset: 0, numbank: 8, numrank: 1, numrow: 0x4000, numcol: 0x400 EDAC i7core: dimm 1 (0x00001288) 1024 Mb offset: 4, numbank: 8, numrank: 1, numrow: 0x4000, numcol: 0x400 EDAC i7core: Ch1 phy rd1, wr1 (0x063f7c31): 2 ranks, UDIMMs EDAC i7core: dimm 0 (0x00000288) 1024 Mb offset: 0, numbank: 8, numrank: 1, numrow: 0x4000, numcol: 0x400 EDAC i7core: Ch2 phy rd3, wr3 (0x063f7c31): 2 ranks, UDIMMs EDAC i7core: dimm 0 (0x00000288) 1024 Mb offset: 0, numbank: 8, numrank: 1, numrow: 0x4000, numcol: 0x400 Still, as the way to retrieve csrows info is not known, it does a mapping of what's available to csrows basic unit at edac core. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: Get more info about the memory DIMMsMauro Carvalho Chehab
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: Add more information about each active dimmMauro Carvalho Chehab
Thanks-to: Aristeu Rozanski <aris@redhat.com> for part of the code Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: Improve error handlingMauro Carvalho Chehab
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: Properly fill struct csrow_infoMauro Carvalho Chehab
Thanks-to: Aristeu Rozanski <aris@redhat.com> for part of the code Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: Add additional tests for error detectionMauro Carvalho Chehab
Properly check the number of channels and improve probing error detection Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: Add a memory check routine, based on device 3 function 4Mauro Carvalho Chehab
This function appears only on Xeon 5500 datasheet. Yet, testing with a Xeon 3503 showed that this is also implemented on other Nehalem processors. At the first read, MC_TEST_ERR_RCV1 and MC_TEST_ERR_RCV0 can contain any value. Modify CE error logic to update the error count only after the second read. An alternative approach would be to do a write at rcv0 and rcv1 registers, but it seemed better to keep they untouched, since BIOS might eventually assume that they are exclusive for their usage. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: need mci->edac_check, otherwise module removal doesn't workMauro Carvalho Chehab
There are some locking troubles with edac_core: if you don't declare an edac_check, module may suffer from soft lock. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: A few fixes at error injection codeMauro Carvalho Chehab
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: Show read/write virtual/physical channel associationMauro Carvalho Chehab
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: Registers all supported MC functionsMauro Carvalho Chehab
Now, it will try to register on all supported Memory Controller functions. It should be noticed that dev3, function 2 is present only on chips with Registered DIMM's, according to the datasheet. So, the driver doesn't return -ENODEV is all functions but this one were successfully registered and enabled: EDAC i7core: Registered device 8086:2c18 fn=3 0 EDAC i7core: Registered device 8086:2c19 fn=3 1 EDAC i7core: Device not found: PCI ID 8086:2c1a (dev 3, func 2) EDAC i7core: Registered device 8086:2c1c fn=3 4 EDAC i7core: Registered device 8086:2c20 fn=4 0 EDAC i7core: Registered device 8086:2c21 fn=4 1 EDAC i7core: Registered device 8086:2c22 fn=4 2 EDAC i7core: Registered device 8086:2c23 fn=4 3 EDAC i7core: Registered device 8086:2c28 fn=5 0 EDAC i7core: Registered device 8086:2c29 fn=5 1 EDAC i7core: Registered device 8086:2c2a fn=5 2 EDAC i7core: Registered device 8086:2c2b fn=5 3 EDAC i7core: Registered device 8086:2c30 fn=6 0 EDAC i7core: Registered device 8086:2c31 fn=6 1 EDAC i7core: Registered device 8086:2c32 fn=6 2 EDAC i7core: Registered device 8086:2c33 fn=6 3 EDAC i7core: Driver loaded. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: Add more status functions to EDAC driverMauro Carvalho Chehab
This patch were co-authored with Aristeu Rozanski. Signed-off-by: Aristeu Sergio <arozansk@redhat.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: Add error insertion code for NehalemMauro Carvalho Chehab
Implements set_inject_error() with the low-level code needed to inject memory errors at Nehalem, and adds some sysfs nodes to allow error injection The next patch will add an API for error injection. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10i7core_edac: Add an EDAC memory controller driver for Nehalem chipsetsMauro Carvalho Chehab
This driver is meant to support i7 core/i7core extreme desktop processors and Xeon 35xx/55xx series with integrated memory controller. It is likely that it can be expanded in the future to work with other processor series based at the same Memory Controller design. For now, it has just a few MCH status reads. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-04-30edac, mce: Fix wrong mask and macro usageBorislav Petkov
Correct two mishaps which prevented reporting error type (CECC vs UECC) and extended error description. Cc: <stable@kernel.org> # 32.x, 33.x Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-22edac, mce: Filter out invalid valuesBorislav Petkov
Print the CPU associated with the error only when the field is valid. Cc: <stable@kernel.org> # .32.x .33.x Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2010-03-12edac: e752x: add dram scrubbing supportPeter Tyser
Add support to scrub DRAM using the e752x integrated memory scrubbing engine. The e7320/7520/e7525 chipsets support scrubbing at one rate while the i3100 chipset supports a normal and fast rate. A similar patch was originally sent back in 2008: http://sourceforge.net/mailarchive/forum.php?thread_name=1204835866.25206.70.camel@localhost.localdomain&forum_name=bluesmoke-devel This version has the following updates: - Use 16-bit PCI config cycles to access MCHSCRB register e7320/7520/e7525 docs say register is 16bits wide, i3100 says 8. I tested 16bits on the i3100 to be safe. - Recalcuate and round actual scrub rates The changes have been tested on an i3100-based board. Signed-off-by: Peter Tyser <ptyser@xes-inc.com> Signed-off-by: Doug Thompson <dougthompson@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12edac: e752x fsb eccKonstantin Olifer
FSB parity is only supported on the Xeon processor. Previously it was incorrectly enabled for the Celeron as well. Signed-off-by: Konstantin Olifer <kolifer@gmail.com> Signed-off-by: Doug Thompson <dougthompson@xmission.com> Cc: Peter Tyser <ptyser@xes-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12edac: mpc85xx use resource_size instead of raw mathH Hartley Sweeten
Use resource_size() instead of arithmetic. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Doug Thompson <dougthompson@xmission.com> Acked-by: Dave Jiang <djiang@mvista.com> Cc: Peter Tyser <ptyser@xes-inc.com> Cc: Kumar Gala <galak@gate.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12edac: mpc85xx improve SDRAM error reportingPeter Tyser
Add the ability to detect the specific data line or ECC line which failed when printing out SDRAM single-bit errors. An example of a single-bit SDRAM ECC error is below: EDAC MPC85xx MC1: Err Detect Register: 0x80000004 EDAC MPC85xx MC1: Faulty data bit: 59 EDAC MPC85xx MC1: Expected Data / ECC: 0x7f80d000_409effa0 / 0x6d EDAC MPC85xx MC1: Captured Data / ECC: 0x7780d000_409effa0 / 0x6d EDAC MPC85xx MC1: Err addr: 0x00031ca0 EDAC MPC85xx MC1: PFN: 0x00000031 Knowning which specific data or ECC line caused an error can be useful in tracking down hardware issues such as improperly terminated signals, loose pins, etc. Note that this feature is only currently enabled for 64-bit wide data buses, 32-bit wide bus support should be added. I don't have any 32-bit wide systems to test on. If someone has one and is willing to give this patch a shot with the check for a 64-bit data bus removed it would be much appreciated and I can re-submit with both 32 and 64 bit buses supported. Signed-off-by: Peter Tyser <ptyser@xes-inc.com> Signed-off-by: Doug Thompson <dougthompson@xmission.com> Cc: Kumar Gala <galak@gate.crashing.org> Cc: Dave Jiang <djiang@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12edac: mpc85xx mask ecc syndrome correctlyPeter Tyser
With a 64-bit wide data bus only the lowest 8-bits of the ECC syndrome are relevant. With a 32-bit wide data bus only the lowest 16-bits are relevant on most architectures. Without this change, the ECC syndrome displayed can be mildly confusing, eg: EDAC MPC85xx MC1: syndrome: 0x25252525 When in reality the ECC syndrome is 0x25. A variety of Freescale manuals say a variety of different things about how to decode the CAPTURE_ECC (syndrome) register. I don't have a system with a 32-bit bus to test on, but I believe the change is correct. It'd be good to get an ACK from someone at Freescale about this change though. Signed-off-by: Peter Tyser <ptyser@xes-inc.com> Signed-off-by: Doug Thompson <dougthompson@xmission.com> Cc: Kumar Gala <galak@gate.crashing.org> Cc: Dave Jiang <djiang@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>