Debian 6 Xen主机上的ATA错误,但磁盘没问题

我们pipe理(但没有物理访问)的软件RAID1的Debian 6机器正在喷出有关磁盘(ATA1和ATA2)的各种错误。

我不知道这可能是什么。 磁盘似乎很好。 我们没有注意到挂起或任何与服务器服务的网站。

我知道这是一个“可能是什么”的问题,但我真的希望有人能够帮助我。

眼镜:

  1. Debian 6,运行Xen hypervisor
  2. 磁盘:250 GB WDC WD2500AAKX-00U6AA0
  3. NCQ支持并启用:ata2.00:488397168个扇区,多个16:LBA48 NCQ(深度31/32),AA
  4. SATA控制器:Intel公司的Patsburg 6端口SATA AHCI控制器(06版)
  5. 内核:2.6.32-5-xen-amd64
  6. Ram:16GB
  7. Intel(R)Xeon(R)CPU E5-2620 0 @ 2.00GHz

以下是一些错误:

[2013-05-13 21:36:17] ata1.00: exception Emask 0x10 SAct 0x3 SErr 0x400100 action 0x6 frozen [2013-05-13 21:36:17] ata1.00: irq_stat 0x08000000, interface fatal error [2013-05-13 21:36:17] ata1: SError: { UnrecovData Handshk } [2013-05-13 21:36:17] ata1.00: failed command: WRITE FPDMA QUEUED [2013-05-13 21:36:17] ata1.00: cmd 61/08:00:98:1f:5e/00:00:0d:00:00/40 tag 0 ncq 4096 out [2013-05-13 21:36:17] res 40/00:0c:58:3a:62/00:00:0d:00:00/40 Emask 0x10 (ATA bus error) [2013-05-13 21:36:17] ata1.00: status: { DRDY } [2013-05-13 21:36:17] ata1.00: failed command: WRITE FPDMA QUEUED [2013-05-13 21:36:17] ata1.00: cmd 61/08:08:58:3a:62/00:00:0d:00:00/40 tag 1 ncq 4096 out [2013-05-13 21:36:17] res 40/00:0c:58:3a:62/00:00:0d:00:00/40 Emask 0x10 (ATA bus error) [2013-05-13 21:36:17] ata1.00: status: { DRDY } [2013-05-13 21:36:17] ata1: hard resetting link [2013-05-13 21:36:17] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [2013-05-13 21:36:17] ata1.00: configured for UDMA/133 [2013-05-13 21:36:17] ata1: EH complete 

 [2013-05-15 08:58:29] ata1.00: exception Emask 0x10 SAct 0x40f SErr 0x400100 action 0x6 frozen [2013-05-15 08:58:29] ata1.00: irq_stat 0x08000000, interface fatal error [2013-05-15 08:58:29] ata1: SError: { UnrecovData Handshk } [2013-05-15 08:58:29] ata1.00: failed command: WRITE FPDMA QUEUED [2013-05-15 08:58:29] ata1.00: cmd 61/58:00:48:c4:6b/00:00:0d:00:00/40 tag 0 ncq 45056 out [2013-05-15 08:58:29] res 40/00:1c:78:cb:6b/00:00:0d:00:00/40 Emask 0x10 (ATA bus error) [2013-05-15 08:58:29] ata1.00: status: { DRDY } [2013-05-15 08:58:29] ata1.00: failed command: WRITE FPDMA QUEUED [2013-05-15 08:58:29] ata1.00: cmd 61/10:08:78:c8:6b/01:00:0d:00:00/40 tag 1 ncq 139264 out [2013-05-15 08:58:29] res 40/00:1c:78:cb:6b/00:00:0d:00:00/40 Emask 0x10 (ATA bus error) [2013-05-15 08:58:29] ata1.00: status: { DRDY } [2013-05-15 08:58:29] ata1.00: failed command: WRITE FPDMA QUEUED [2013-05-15 08:58:29] ata1.00: cmd 61/b0:10:c8:ca:6b/00:00:0d:00:00/40 tag 2 ncq 90112 out [2013-05-15 08:58:29] res 40/00:1c:78:cb:6b/00:00:0d:00:00/40 Emask 0x10 (ATA bus error) [2013-05-15 08:58:29] ata1.00: status: { DRDY } [2013-05-15 08:58:29] ata1.00: failed command: WRITE FPDMA QUEUED [2013-05-15 08:58:29] ata1.00: cmd 61/58:18:78:cb:6b/00:00:0d:00:00/40 tag 3 ncq 45056 out [2013-05-15 08:58:29] res 40/00:1c:78:cb:6b/00:00:0d:00:00/40 Emask 0x10 (ATA bus error) [2013-05-15 08:58:29] ata1.00: status: { DRDY } [2013-05-15 08:58:29] ata1.00: failed command: WRITE FPDMA QUEUED [2013-05-15 08:58:29] ata1.00: cmd 61/b0:50:c8:c7:6b/00:00:0d:00:00/40 tag 10 ncq 90112 out [2013-05-15 08:58:29] res 40/00:1c:78:cb:6b/00:00:0d:00:00/40 Emask 0x10 (ATA bus error) [2013-05-15 08:58:29] ata1.00: status: { DRDY } [2013-05-15 08:58:29] ata1: hard resetting link [2013-05-15 08:58:29] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [2013-05-15 08:58:29] ata1.00: configured for UDMA/133 [2013-05-15 08:58:29] ata1: EH complete 

 [2013-05-19 01:21:19] ata2.00: exception Emask 0x10 SAct 0x3 SErr 0x400100 action 0x6 frozen [2013-05-19 01:21:19] ata2.00: irq_stat 0x08000000, interface fatal error [2013-05-19 01:21:19] ata2: SError: { UnrecovData Handshk } [2013-05-19 01:21:19] ata2.00: failed command: WRITE FPDMA QUEUED [2013-05-19 01:21:19] ata2.00: cmd 61/58:00:e8:75:93/00:00:12:00:00/40 tag 0 ncq 45056 out [2013-05-19 01:21:19] res 40/00:0c:40:76:93/00:00:12:00:00/40 Emask 0x10 (ATA bus error) [2013-05-19 01:21:19] ata2.00: status: { DRDY } [2013-05-19 01:21:19] ata2.00: failed command: WRITE FPDMA QUEUED [2013-05-19 01:21:19] ata2.00: cmd 61/b0:08:40:76:93/00:00:12:00:00/40 tag 1 ncq 90112 out [2013-05-19 01:21:19] res 40/00:0c:40:76:93/00:00:12:00:00/40 Emask 0x10 (ATA bus error) [2013-05-19 01:21:19] ata2.00: status: { DRDY } [2013-05-19 01:21:19] ata2: hard resetting link [2013-05-19 01:21:19] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [2013-05-19 01:21:19] ata2.00: configured for UDMA/133 [2013-05-19 01:21:19] ata2: EH complete 

SMART不会给出任何错误。 SDA的智能(SDB类似):

 === START OF INFORMATION SECTION === Device Model: WDC WD2500AAKX-00U6AA0 Serial Number: WD-WCC2H0107714 Firmware Version: 15.01H15 User Capacity: 250,059,350,016 bytes ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 145 145 021 Pre-fail Always - 3750 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 9 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3430 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 7 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 6 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2 194 Temperature_Celsius 0x0022 110 108 000 Old_age Always - 33 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged 

和sda界面错误:

 # smartctl -l sataphy /dev/sda smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net General Purpose Logging (GPL) feature set supported SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x0001 2 0 Command failed due to ICRC error 0x0002 2 165 R_ERR response for data FIS 0x0003 2 0 R_ERR response for device-to-host data FIS 0x0004 2 165 R_ERR response for host-to-device data FIS 0x0005 2 0 R_ERR response for non-data FIS 0x0006 2 0 R_ERR response for device-to-host non-data FIS 0x0007 2 0 R_ERR response for host-to-device non-data FIS 0x000a 2 49 Device-to-host register FISes sent due to a COMRESET 0x000b 2 79 CRC errors within host-to-device FIS 0x8000 4 12672920 Vendor specific 

和sdb:

 # smartctl -l sataphy /dev/sdb smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net General Purpose Logging (GPL) feature set supported SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x0001 2 0 Command failed due to ICRC error 0x0002 2 45 R_ERR response for data FIS 0x0003 2 0 R_ERR response for device-to-host data FIS 0x0004 2 45 R_ERR response for host-to-device data FIS 0x0005 2 2 R_ERR response for non-data FIS 0x0006 2 0 R_ERR response for device-to-host non-data FIS 0x0007 2 2 R_ERR response for host-to-device non-data FIS 0x000a 2 46 Device-to-host register FISes sent due to a COMRESET 0x000b 2 22 CRC errors within host-to-device FIS 0x8000 4 12672927 Vendor specific 

握手似乎表明,控制器无法与驱动器交谈。 我会怀疑电子干扰,电缆不好,或者可能是一个坏的控制器。 在后一种情况下,主板需要更换。 你只能通过一个淘汰的过程find罪魁祸首,分别testing每一个硬件。