由于磁盘错误，RAID1重build失败

快速信息：戴尔R410在H700适配器上使用RAID1中的2x500GB驱动器

最近在服务器上的RAID1arrays中的一个驱动器失败，我们称之为驱动器0. RAID控制器将其标记为故障并使其脱机。我用新的（相同的系列和制造商，只是更大）replace故障磁盘，并configuration新的磁盘作为热备份。

从Drive1重build立即开始，1.5小时后，我得到消息，驱动器1失败。服务器没有响应（内核恐慌），需要重新启动。鉴于在这个错误重build的半小时之前大约有40％，我估计新驱动器不同步，并试图重新启动驱动器1。

RAID控制器抱怨了一些丢失的RAIDarrays，但它发现驱动器1上的外部RAIDarrays，并将其导入。服务器启动并运行（从降级RAID）。

这里是磁盘的SMART数据。驱动器0（首先失败的驱动器）

ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 1 3 Spin_Up_Time POS--K 142 142 021 - 3866 4 Start_Stop_Count -O--CK 100 100 000 - 12 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0 7 Seek_Error_Rate -OSR-K 200 200 000 - 0 9 Power_On_Hours -O--CK 086 086 000 - 10432 10 Spin_Retry_Count -O--CK 100 253 000 - 0 11 Calibration_Retry_Count -O--CK 100 253 000 - 0 12 Power_Cycle_Count -O--CK 100 100 000 - 11 192 Power-Off_Retract_Count -O--CK 200 200 000 - 10 193 Load_Cycle_Count -O--CK 200 200 000 - 1 194 Temperature_Celsius -O---K 112 106 000 - 31 196 Reallocated_Event_Count -O--CK 200 200 000 - 0 197 Current_Pending_Sector -O--CK 200 200 000 - 0 198 Offline_Uncorrectable ----CK 200 200 000 - 0 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0 200 Multi_Zone_Error_Rate ---R-- 200 198 000 - 3

和驱动器1（从控制器报告健康的驱动器，直到重build尝试）

 ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 35 3 Spin_Up_Time POS--K 143 143 021 - 3841 4 Start_Stop_Count -O--CK 100 100 000 - 12 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0 7 Seek_Error_Rate -OSR-K 200 200 000 - 0 9 Power_On_Hours -O--CK 086 086 000 - 10455 10 Spin_Retry_Count -O--CK 100 253 000 - 0 11 Calibration_Retry_Count -O--CK 100 253 000 - 0 12 Power_Cycle_Count -O--CK 100 100 000 - 11 192 Power-Off_Retract_Count -O--CK 200 200 000 - 10 193 Load_Cycle_Count -O--CK 200 200 000 - 1 194 Temperature_Celsius -O---K 114 105 000 - 29 196 Reallocated_Event_Count -O--CK 200 200 000 - 0 197 Current_Pending_Sector -O--CK 200 200 000 - 3 198 Offline_Uncorrectable ----CK 100 253 000 - 0 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0 200 Multi_Zone_Error_Rate ---R-- 100 253 000 - 0

在SMART的扩展错误日志中，我发现：

驱动器0只有一个错误

 Error 1 [0] occurred at disk power-on lifetime: 10282 hours (428 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 10 -- 51 00 18 00 00 00 6a 24 20 40 00 Error: IDNF at LBA = 0x006a2420 = 6956064 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 61 00 60 00 f8 00 00 00 6a 24 20 40 00 17d+20:25:18.105 WRITE FPDMA QUEUED 61 00 18 00 60 00 00 00 6a 24 00 40 00 17d+20:25:18.105 WRITE FPDMA QUEUED 61 00 80 00 58 00 00 00 6a 23 80 40 00 17d+20:25:18.105 WRITE FPDMA QUEUED 61 00 68 00 50 00 00 00 6a 23 18 40 00 17d+20:25:18.105 WRITE FPDMA QUEUED 61 00 10 00 10 00 00 00 6a 23 00 40 00 17d+20:25:18.104 WRITE FPDMA QUEUED

但驱动器1有883错误。我只看到最后几个，我看到的所有错误都是这样的：

 Error 883 [18] occurred at disk power-on lifetime: 10454 hours (435 days + 14 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 01 -- 51 00 80 00 00 39 97 19 c2 40 00 Error: AMNF at LBA = 0x399719c2 = 966203842 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 00 80 00 00 00 00 39 97 19 80 40 00 1d+00:25:57.802 READ FPDMA QUEUED 2f 00 00 00 01 00 00 00 00 00 10 40 00 1d+00:25:57.779 READ LOG EXT 60 00 80 00 00 00 00 39 97 19 80 40 00 1d+00:25:55.704 READ FPDMA QUEUED 2f 00 00 00 01 00 00 00 00 00 10 40 00 1d+00:25:55.681 READ LOG EXT 60 00 80 00 00 00 00 39 97 19 80 40 00 1d+00:25:53.606 READ FPDMA QUEUED

鉴于这些错误，有没有什么办法可以重buildRAID，或者我应该做备份，关机服务器，更换新的磁盘和恢复？那么如果我从USB / CD上运行的Linux中将错误的磁盘添加到新的磁盘上呢？

另外，如果有人有更多的经验，这些错误可能是什么原因？蹩脚的控制器或磁盘？磁盘大约1岁，但对于我来说这两个人都会在如此短的时间内死亡是相当令人难以置信的。

实际上，如果磁盘都是来自同一批次的制造商，那么同时出现故障并不令人惊讶。

他们拥有相同的制造工艺，环境和使用模式。这就是为什么我通常会尝试从不同的供应商那里订购相同的型号。

我在这里的首选行动是联系制造商，更换更好的磁盘，从备份恢复。

DD'ing也没有错，但我通常需要尽快获得服务。

在IBM Deskstars惨败的那天，我用了4年的时间，在整整6个星期的时间里，整盘硬盘都坏了。我的资料完好无损。