mcelog不会启动PUIAS 6.4 amd硬件

伙计们,

我是一个总的Linux n00b。 我试图在运行PUIAS 6.4(i86_64)的计算节点上部署mcelog,

[root@lov3 edac]# uname -a Linux lov3.mylab.org 2.6.32-358.18.1.el6.x86_64 #1 SMP Tue Aug 27 22:40:32 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux 

AMD硬件上免费的Red Hat 6.4版本

 [root@lov3 mcelog]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 4 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 21 Model: 2 Stepping: 0 CPU MHz: 1400.000 BogoMIPS: 4999.30 Virtualization: AMD-V L1d cache: 16K L1i cache: 64K L2 cache: 2048K L3 cache: 6144K NUMA node0 CPU(s): 0-7 NUMA node1 CPU(s): 8-15 NUMA node2 CPU(s): 16-23 NUMA node3 CPU(s): 24-31 NUMA node4 CPU(s): 32-39 NUMA node5 CPU(s): 40-47 NUMA node6 CPU(s): 48-55 NUMA node7 CPU(s): 56-63 

我的mcelog.conf文件或多或less是默认的,我想运行mcelog作为守护进程并logging错误。 当我启动mcelog

 [root@lov3 mcelog]# mcelog --config-file mcelog.conf AMD Processor family 21: Please load edac_mce_amd module. 

然而模块是存在的

 [root@lov3 mcelog]# locate edac_mce_amd.ko /lib/modules/2.6.32-358.18.1.el6.x86_64/kernel/drivers/edac/edac_mce_amd.ko /lib/modules/2.6.32-358.el6.x86_64/kernel/drivers/edac/edac_mce_amd.ko 

并加载

 [root@lov3 edac]# lsmod | grep mce edac_mce_amd 14705 1 amd64_edac_mod 

有什么我可以做mcelog工作吗? 我发现的唯一参考是这个线程

http://lists.centos.org/pipermail/centos/2012-November/130226.html

当你使用CPU系列21时,消息是显而易见的:你可以看到下面的代码:

 mcelog.c of the mcelog-1.0pre3_20110718-0.14.el6 package show where the cpu family of greater than 15 returns 0 to is_cpu_supported(): 416 int is_cpu_supported(void) 417 { 418 enum { 419 VENDOR = 1, 420 FAMILY = 2, 421 MODEL = 4, 422 MHZ = 8, 423 FLAGS = 16, 424 ALL = 0x1f 425 } seen = 0; 426 FILE *f; 427 static int checked; 428 429 if (checked) 430 return 1; 431 checked = 1; 432 433 f = fopen("/proc/cpuinfo","r"); 434 if (f != NULL) { 435 int family = 0; 436 int model = 0; 437 char vendor[64] = { 0 }; 438 char *line = NULL; 439 size_t linelen = 0; 440 double mhz; 441 442 while (getdelim(&line, &linelen, '\n', f) > 0 && seen != ALL) { 443 if (sscanf(line, "vendor_id : %63[^\n]", vendor) == 1) 444 seen |= VENDOR; 445 if (sscanf(line, "cpu family : %d", &family) == 1) 446 seen |= FAMILY; 447 if (sscanf(line, "model : %d", &model) == 1) 448 seen |= MODEL; 449 /* We use only Mhz of the first CPU, assuming they are the same 450 (there are more sanity checks later to make this not as wrong 451 as it sounds) */ 452 if (sscanf(line, "cpu MHz : %lf", &mhz) == 1) { 453 if (!cpumhz_forced) 454 cpumhz = mhz; 455 seen |= MHZ; 456 } 457 if (!strncmp(line, "flags", 5) && isspace(line[6])) { 458 processor_flags = line; 459 line = NULL; 460 linelen = 0; 461 seen |= FLAGS; 462 } 463 464 } 465 if (seen == ALL) { 466 if (!strcmp(vendor,"AuthenticAMD")) { 467 if (family == 15) 468 cputype = CPU_K8; 469 if (family >= 15) <----------- 470 fprintf(stderr, "AMD Processor family %d: Please load edac_mce_amd module.\n", f amily); 471 return 0; 472 } else if (!strcmp(vendor,"GenuineIntel")) 473 cputype = select_intel_cputype(family, model); 474 /* Add checks for other CPUs here */ 475 } else { 476 Eprintf("warning: Cannot parse /proc/cpuinfo\n"); 477 } 478 fclose(f); 479 free(line); 480 } else 481 Eprintf("warning: Cannot open /proc/cpuinfo\n"); 482 483 return 1; 484 } 

mcelog在AMD CPU或更新版本上mcelog.c (如mcelog.c family >= 15 )。 AMD EPYC处理器也存在同样的问题。

使用内核模块edac_mce_amd而不是mcelog,它将把MCE日志放入内核日志中,该日志应该通过系统日志logging在磁盘上。 这次可能是mcelog为你加载了这个模块,但是我build议以另一种方式加载它,比如在基于Debian的Linux上的/etc/initramfs-tools/modules文件和update-initramfs -u

但我找不到任何说这种日志格式的东西…所以这里是从Linux源代码放在一起的猜测…

include/linux/printk.h ,我们看到:

 #define HW_ERR "[Hardware Error]: " 

drivers/edac/mce_amd.c ,我们看到像pr_emerge(HW_ERR ...)那样的输出:

 pr_emerg(HW_ERR "MC0 Error: "); 

更多的行与pr_cont(...)但没有HW_ERR

所以我想你可以在你的日志中查找"[Hardware Error]:" 。 也许行也会说edac_mce_amd

这是我认为会logging第一个pr_emerg,而不是pr_cont部分(见这里 )的规则。 在这里,我设置了一个rsyslog.d规则,查找"[Hardware Error]:" 。 但是这将与edac_mce_amd模块以外的东西相匹配。

vim /etc/rsyslog.d/09-edac_mce_amd.conf

 if ($syslogfacility-text == 'kern') and \ ($msg contains '[Hardware Error]:') \ then -/var/log/edac_mce_amd.log #uncomment this to also remove it from the other files #& stop 

只有第一行对我来说已经足够了,因为我将设置一个监视脚本来简单地检查文件大小是否为零。如果有人知道如何正确执行,请发表评论。