From 95e9590bfee2df447c8f4c0fd799e8c514beca80 Mon Sep 17 00:00:00 2001 From: Denys Vlasenko Date: Tue, 10 Dec 2013 13:07:35 +0100 Subject: [ABRT PATCH 24/27] doc/MCE_readme.txt: new file - documentation about MCE handling Signed-off-by: Denys Vlasenko Related to rhbz#1032077 Signed-off-by: Jakub Filak --- doc/MCE_readme.txt | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 86 insertions(+) create mode 100644 doc/MCE_readme.txt diff --git a/doc/MCE_readme.txt b/doc/MCE_readme.txt new file mode 100644 index 0000000..ed5b627 --- /dev/null +++ b/doc/MCE_readme.txt @@ -0,0 +1,86 @@ + Background + +MCEs can be fatal (they panic kernel) or not. +Fatal MCE are delivered as exception#18. +Non-fatal ones sometimes are delivered as exception#18; other times +they are silently recorded in magic MSRs, CPU is not alerted. +Linux kernel periodically (up to 5 mins interval) reads those MSRs +and if MCE is seen there, it is piped in binary form through +/dev/mcelog to whoever listens on it. (Such as mcelog tool in +--daemon mode; but cat Are those magic MSR registers cleared when read via /dev/mcelog? + +Yes. + +> Without mcelog utility, we can directly read only binary form, right? +> Not nice, but still useful, right? +> (could be transferred to nice text form on other machine). + +No, raw /dev/mcelog data is not easy to interpret on other machine. +In fact, it can't be used by mcelog tool even on the same machine. +Technical reason is that mcelog uses an obscure ioctl on /dev/mcelog +in order to know the size of binary blob with MCE information. +When run on a file, ioctl fails, and mcelog bombs out. + +Looks like without mcelog running and processing /dev/mcelog data, +non-fatal MCE's can't be easily decoded with currently existing tools. + +mcelog tool can be configured to write log to /var/log/mcelog +(RHEL6 does that) or to syslog (RHEL7 does that). + + + How ABRT catches MCEs + +Fatal MCEs are caught as any fatal kernel panic is caught - as a vmcore. +The oops text, which goes to "backtrace" element, will be the decoded +MCE message from kernel log buffer. + +Non-fatal MCEs are caught as kernel oopses. +If "Machine check events logged" message is seen in "dmesg" element, +we assume it's a MCE, and create "not-reportable" element with suitable +explanation. +Then we check whether /var/log/mcelog exists, +or whether system log contains "mcelog: Hardware event", +and create a "comment" element with explanatory text, followed by +last 20 lines from either of those files. + + + How to test MCEs + +There is an MCE injection tool and a kernel module, both named mce-inject. +(The tool comes from mce-test project, may be found in ras-utils RHEL7 package). +The script I used is: + +modprobe mce-inject +sync & +sleep 1 +sync +# This can crash the machine: +echo "Injecting MCE from file $1" +mce-inject "$1" +echo "Exitcode:$?" + +It requires files which describe MCE to simulate. I grabbed a few examples +from mce-test.tar.gz (source tarball of mce-test project). +I used this this file to cause a non-fatal MCE: + +CPU 0 BANK 2 +STATUS VAL OVER EN + +And this one to cause a fatal one: + +CPU 0 BANK 4 +MCGSTATUS MCIP +STATUS FATAL S +RIP 12343434 +MISC 11 + +(Not sure what failures exactly they imitate, maybe there are better examples). -- 1.8.3.1