如何find哪个内存有CE错误?

我想我的服务器有一个内存有错误,我想知道如何find它是哪一个。

服务器型号:Supermicro 6072R-EN3RFT

内存:128 GB

安装了最新更新的CentOS 7

mcelog说:

:[ 883.230897] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR :[ 883.230904] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: cc0001c7000800c1 :[ 883.230906] EDAC sbridge MC0: TSC 0 :[ 883.230908] EDAC sbridge MC0: ADDR b71b18000 :[ 883.230909] EDAC sbridge MC0: MISC 908401000200e8c :[ 883.504829] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1469612575 SOCKET 0 APIC 0 :[ 883.504841] mce: [Hardware Error]: Machine check events logged :[ 883.606151] EDAC MC0: 7 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xb71b18 offset:0x0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:1 rank:1) :[ 899.306134] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR :[ 899.306143] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: cc000207000800c1 :[ 899.306145] EDAC sbridge MC0: TSC 0 :[ 899.306148] EDAC sbridge MC0: ADDR c71b19000 :[ 899.306150] EDAC sbridge MC0: MISC 908410000200e8c :[ 899.306153] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1469612590 SOCKET 0 APIC 0 :[ 899.306172] mce: [Hardware Error]: Machine check events logged :[ 899.644814] EDAC MC0: 8 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xc71b19 offset:0x0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:1 rank:1) :[ 901.190512] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 :[ 901.190528] {1}[Hardware Error]: It has been corrected by h/w and requires no further action :[ 901.190533] {1}[Hardware Error]: event severity: corrected :[ 901.190538] {1}[Hardware Error]: Error 0, type: corrected :[ 901.190541] {1}[Hardware Error]: fru_text: CorrectedErr :[ 901.190546] {1}[Hardware Error]: section_type: memory error :[ 901.190549] [Firmware Warn]: error section length is too small :[ 4916.540282] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR :[ 4916.540290] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: cc000287000800c1 :[ 4916.540292] EDAC sbridge MC0: TSC 0 :[ 4916.540294] EDAC sbridge MC0: ADDR b743ff000 :[ 4916.540296] EDAC sbridge MC0: MISC 908400800240e8c :[ 4916.540298] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1469616606 SOCKET 0 APIC 0 :[ 4916.540313] mce: [Hardware Error]: Machine check events logged :[ 4916.540340] EDAC MC0: 10 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xb743ff offset:0x0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:1 rank:1) 

我尝试了以下内容:

 grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:669 /sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow0/ch3_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow0/ch2_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow0/ch3_ce_count:0 

这是否意味着,我有8个插槽,每个16 GB,第一个插槽包含错误的内存?

任何想法哪一个是内存模块有错误? 我不是系统pipe理员,所以我不知道如何继续…

亲切的问候

  • 将4GB RAM分配给虚拟机,但Linux只显示大约3GB的内存
  • 无法连接到远程服务器上的MySql服务器
  • CentOS - 两个驱动器的硬件raid 1没有被fdisk视为单个驱动器
  • 如何更改Linux服务启动/启动顺序?
  • Crontab - 不寻常的configuration - 跳过一个小的每周时间窗口
  • Gitlab - 不能:创build新项目,列出团队成员,添加SSH密钥或克隆/推送
  • One Solution collect form web for “如何find哪个内存有CE错误?”

    我希望您的DIMM插槽可能被标记为BANK A DIMM 0BANK A DIMM 1等,直到BANK B DIMM 3 。 您可以假设BANK A DIMM 0是问题1,并且假设它们全部相等,请尝试将其与另一个7交换,并重复testing,直到再次产生错误。 如果不同的/sys/devices/system/edac/mc/mc?/csrow0/ch?_ce_count计数器递增,那么您可以合理地确定find了DIMM问题。

    服务器问题集锦,包括 Linux(Ubuntu, Centos,Debian等)和Windows Server服务器.