什么导致SCSI命令超过30秒超时?

系统描述:
CentOS 7。
SuperMicro X9DRW-iF板。
最新的BIOS(和SCU选项ROM),使用SCU选项ROM。
四块300GB的WD硬盘连接到主板的SCU。
六个HDD连接到六个独立的SATA端口。
正常运行的软件为我们的业务。

我build立了这个系统,并在两周内两次,我发现连接到SCU的所有四个驱动器都“不见了”,这意味着我们的挂载点

ls: cannot access disk1: Input/output error ls: cannot access disk2: Input/output error ls: cannot access disk3: Input/output error ls: cannot access disk4: Input/output error 

我从dmesg看到,isci正在放弃命令,试图重置设备等等(dmesg输出在最下面)。

从RedHat控制SCSI命令定时器和设备状态页面 ,我认为这是SCSI命令超时和设备无法重置时的“预期”error handling事件序列。 但是,我不确定是否是总线或主机被重置。 但它会影响SCU上的所有四个驱动器。

我读了这个ServerFault的问题 , 我认为这个问题是一个“循环重置”,但我一直没能find更多关于循环重置的信息。

我validation了这四个驱动器的SCSI命令超时是30秒,这是RHEL / CentOS 7的默认设置。(cat / sys / block / sda / device / timeout)将超时降低到1秒或0秒可以快速导致一个驱动器被禁用,但我很less能够得到一个驱动器故障导致SCU上的所有四个驱动器被禁用。

所以这里是我的一些问题。 最终目标是configuration驱动器,使得这个循环重置或“摆脱所有的SCU驱动器”不会发生。

  1. 什么导致SCSI命令超过30秒? 这似乎非常长。
  2. 什么会导致SCU上的所有四个驱动器受到影响? 这意味着设备重置失败,error handling程序继续重置总线/主机,根据RedHat文章。 什么会导致驱动器重置失败?
  3. 就像我观察到的那样,什么样的事情可能很less发生,每周一次?
  4. SCSI命令超时是否可以接受,还是通过增加超时时间来避免? 对我来说最重要的是保持在线的驱动器; 等待一点从驱动器的反应是不是一个巨大的问题。
  5. 我是否正确地诊断出这是对SCSI命令超时的“预期”响应? (即这是一个SCSI命令超时?)
  6. 什么使得SCU上的驱动器比连接到单个SATA端口的驱动器响应更多/更less? 我没有与他们的问题。
  7. 命令排队与此有关吗? 所有四个驱动器的队列深度都是31。
  8. SCU上驱动器的SCSI命令超时的build议值是什么?
  9. 在一天结束时,我如何重新configuration​​以避免这个问题?

感谢您的帮助。 从第一次失败的Dmesg输出(为简洁起见)复制如下。 请注意,驱动器是sda,b,c,d和ata7,8,9,10。

 17:40:41: isci 0000:08:00.0: isci_task_abort_task: dev = ffff881fce964490 (STP/SATA), task = ffff881fcfc95900, old_request == ffff881fcea13000 17:40:51: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout single 17:41:01: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964490 17:41:01: isci 0000:08:00.0: isci_task_abort_task: SATA/STP request or complete_in_target (0), or IDEV_GONE (0), thus no TMF 17:41:01: isci 0000:08:00.0: isci_task_abort_task: Done; dev = ffff881fce964490, task = ffff881fcfc95900 , old_request == ffff881fcea13000 17:41:01: isci 0000:08:00.0: isci_task_abort_task: dev = ffff881fce964310 (STP/SATA), task = ffff881fd0ce97c0, old_request == ffff881fcf125000 17:41:11: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout single 17:41:21: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964310 17:41:21: isci 0000:08:00.0: isci_task_abort_task: SATA/STP request or complete_in_target (0), or IDEV_GONE (0), thus no TMF 17:41:21: isci 0000:08:00.0: isci_task_abort_task: Done; dev = ffff881fce964310, task = ffff881fd0ce97c0 , old_request == ffff881fcf125000 17:43:50: INFO: task app:20227 blocked for more than 120 seconds. 17:43:50: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 17:43:50: app D ffff881fffcd3680 0 20227 13315 0x00000080 17:43:50: ffff883fcb15f8e0 0000000000000082 ffff883fcb15ffd8 0000000000013680 17:43:50: ffff883fcb15ffd8 0000000000013680 ffff881fd1434fa0 ffff883fc03d13b0 17:43:50: 7fffffffffffffff ffff881fd1434fa0 0000000000000002 0000000000000000 17:43:50: Call Trace: 17:43:50: [<ffffffff81609e39>] schedule+0x29/0x70 17:43:50: [<ffffffff81607d79>] schedule_timeout+0x209/0x2d0 17:43:50: [<ffffffff81609312>] __down_common+0xd2/0x14a 17:43:50: [<ffffffffa02292dd>] ? _xfs_buf_find+0x16d/0x2c0 [xfs] 17:43:50: [<ffffffff816093a7>] __down+0x1d/0x1f 17:43:50: [<ffffffff8109d311>] down+0x41/0x50 17:43:50: [<ffffffffa02290dc>] xfs_buf_lock+0x3c/0xd0 [xfs] 17:43:50: [<ffffffffa02292dd>] _xfs_buf_find+0x16d/0x2c0 [xfs] 17:43:50: [<ffffffffa022945a>] xfs_buf_get_map+0x2a/0x180 [xfs] 17:43:50: [<ffffffffa0229eec>] xfs_buf_read_map+0x2c/0x140 [xfs] 17:43:50: [<ffffffffa028d809>] xfs_trans_read_buf_map+0x2d9/0x4a0 [xfs] 17:43:50: [<ffffffffa02723cd>] xfs_read_agi+0x9d/0x110 [xfs] 17:43:50: [<ffffffffa0272474>] xfs_ialloc_read_agi+0x34/0xd0 [xfs] 17:43:50: [<ffffffffa0273138>] xfs_dialloc+0xe8/0x270 [xfs] 17:43:50: [<ffffffffa02757c1>] xfs_ialloc+0x71/0x6a0 [xfs] 17:43:50: [<ffffffffa0242b77>] ? kmem_zone_alloc+0x77/0x100 [xfs] 17:43:50: [<ffffffffa0275e6a>] xfs_dir_ialloc+0x7a/0x280 [xfs] 17:43:50: [<ffffffff81609222>] ? down_write+0x12/0x30 17:43:50: [<ffffffffa027656a>] xfs_create+0x48a/0x680 [xfs] 17:43:50: [<ffffffffa0237abb>] xfs_vn_mknod+0xbb/0x1e0 [xfs] 17:43:50: [<ffffffffa0237bf6>] xfs_vn_mkdir+0x16/0x20 [xfs] 17:43:50: [<ffffffff811d26c7>] vfs_mkdir+0xb7/0x160 17:43:50: [<ffffffff811d853f>] SyS_mkdirat+0x6f/0xe0 17:43:50: [<ffffffff811d85c9>] SyS_mkdir+0x19/0x20 17:43:50: [<ffffffff81614a29>] system_call_fastpath+0x16/0x1b 17:43:50: INFO: task app:20228 blocked for more than 120 seconds. 17:43:50: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 17:43:50: app D ffff881fffc93680 0 20228 13315 0x00000080 17:43:50: ffff883fcef7b9e8 0000000000000082 ffff883fcef7bfd8 0000000000013680 17:43:50: ffff883fcef7bfd8 0000000000013680 ffff883fcc1d0b60 ffff881fffc93f48 17:43:50: ffff881fd07ea800 ffff883fcc1d0b60 0000000000000000 0000000000000000 17:43:50: Call Trace: 17:43:50: [<ffffffff8160a13d>] io_schedule+0x9d/0x140 17:43:50: [<ffffffff81204593>] do_blockdev_direct_IO+0xc03/0x2620 17:43:50: [<ffffffffa0222e60>] ? xfs_get_blocks+0x20/0x20 [xfs] 17:43:50: [<ffffffff81206005>] __blockdev_direct_IO+0x55/0x60 17:43:50: [<ffffffffa0222e60>] ? xfs_get_blocks+0x20/0x20 [xfs] 17:43:50: [<ffffffffa0222780>] ? xfs_finish_ioend_sync+0x30/0x30 [xfs] 17:43:50: [<ffffffffa0221b2a>] xfs_vm_direct_IO+0xda/0x180 [xfs] 17:43:50: [<ffffffffa0222e60>] ? xfs_get_blocks+0x20/0x20 [xfs] 17:43:50: [<ffffffffa0222780>] ? xfs_finish_ioend_sync+0x30/0x30 [xfs] 17:43:50: [<ffffffff8115872d>] generic_file_direct_write+0xcd/0x190 17:43:50: [<ffffffffa0299636>] xfs_file_dio_aio_write+0x215/0x254 [xfs] 17:43:50: [<ffffffffa022ee0d>] xfs_file_aio_write+0x13d/0x150 [xfs] 17:43:50: [<ffffffff811c64cd>] do_sync_write+0x8d/0xd0 17:43:50: [<ffffffff811c6c6d>] vfs_write+0xbd/0x1e0 17:43:50: [<ffffffff811c76b8>] SyS_write+0x58/0xb0 17:43:50: [<ffffffff81614a29>] system_call_fastpath+0x16/0x1b ... <Many more sequences like the above.>... 17:54:21: ata7.00: exception Emask 0x0 SAct 0x7e0000ff SErr 0x0 action 0x6 frozen 17:54:21: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen 17:54:21: ata8.00: failed command: FLUSH CACHE EXT 17:54:21: ata8.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 30 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 17:54:21: ata8.00: status: { DRDY } 17:54:21: ata8: hard resetting link 17:54:21: ata9.00: exception Emask 0x0 SAct 0x1fff SErr 0x0 action 0x6 frozen 17:54:21: ata10.00: exception Emask 0x0 SAct 0x1fff SErr 0x0 action 0x6 frozen 17:54:21: ata9.00: failed command: WRITE FPDMA QUEUED 17:54:21: ata10.00: failed command: WRITE FPDMA QUEUED 17:54:21: ata9.00: cmd 61/50:00:40:08:00/00:00:00:00:00/40 tag 0 ncq 40960 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 17:54:21: ata10.00: cmd 61/50:00:40:08:00/00:00:00:00:00/40 tag 0 ncq 40960 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 17:54:21: ata9.00: status: { DRDY } 17:54:21: ata10.00: status: { DRDY } 17:54:21: ata9.00: failed command: WRITE FPDMA QUEUED 17:54:21: ata10.00: failed command: WRITE FPDMA QUEUED 17:54:21: ata9.00: cmd 61/18:00:08:08:00/00:00:00:00:00/40 tag 1 ncq 12288 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 17:54:21: ata10.00: cmd 61/18:00:08:08:00/00:00:00:00:00/40 tag 1 ncq 12288 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 17:54:21: ata9.00: status: { DRDY } 17:54:21: ata10.00: status: { DRDY } 17:54:21: ata9.00: failed command: WRITE FPDMA QUEUED 17:54:21: ata10.00: failed command: WRITE FPDMA QUEUED 17:54:21: ata9.00: cmd 61/02:00:01:08:00/00:00:00:00:00/40 tag 2 ncq 1024 out res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 17:54:21: ata10.00: cmd 61/02:00:01:08:00/00:00:00:00:00/40 tag 2 ncq 1024 out res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 17:54:21: ata9.00: status: { DRDY } 17:54:21: ata10.00: status: { DRDY } 17:54:21: ata9.00: failed command: WRITE FPDMA QUEUED 17:54:21: ata10.00: failed command: WRITE FPDMA QUEUED 17:54:22: ata7.00: status: { DRDY } 17:54:22: ata7: hard resetting link 17:54:24: perf interrupt took too long (2510 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 17:54:31: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout all 17:54:31: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout all 17:54:31: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout all 17:54:32: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout all 17:54:56: isci 0000:08:00.0: isci_port_perform_hard_reset: iport = ffff881fce962ab8; hard reset failed (0x21) 17:54:56: isci 0000:08:00.0: isci_port_perform_hard_reset: iport = ffff881fce962bb0; hard reset failed (0x21) 17:54:56: isci 0000:08:00.0: isci_port_perform_hard_reset: iport = ffff881fce9629c0; hard reset failed (0x21) 17:54:57: isci 0000:08:00.0: isci_port_perform_hard_reset: iport = ffff881fce9628c8; hard reset failed (0x21) 17:55:06: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964310 17:55:06: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964490 17:55:06: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce9643d0 17:55:06: sas: ata8: end_device-0:1: reset failed (errno=-16) 17:55:06: sas: ata10: end_device-0:3: reset failed (errno=-16) 17:55:06: ata10: hard resetting link 17:55:06: ata8: hard resetting link 17:55:06: sas: ata9: end_device-0:2: reset failed (errno=-16) 17:55:06: ata9: hard resetting link 17:55:07: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964250 17:55:07: sas: ata7: end_device-0:0: reset failed (errno=-16) 17:55:07: ata7: hard resetting link ...<More of the above.>... 17:56:22: sas: ata10: end_device-0:3: reset failed (errno=-16) 17:56:22: ata10: reset failed, giving up 17:56:22: sas: ata8: end_device-0:1: reset failed (errno=-16) 17:56:22: ata8: reset failed, giving up 17:56:22: ata8.00: disabled 17:56:22: ata8.00: device reported invalid CHS sector 0 17:56:22: ata8: EH complete 17:56:22: ata10.00: disabled 17:56:22: ata10.00: device reported invalid CHS sector 0 17:56:22: ata10.00: device reported invalid CHS sector 0 17:56:22: ata10.00: device reported invalid CHS sector 0 17:56:22: ata10.00: device reported invalid CHS sector 0 17:56:22: ata10.00: device reported invalid CHS sector 0 17:56:22: ata10.00: device reported invalid CHS sector 0 17:56:22: ata10.00: device reported invalid CHS sector 0 17:56:22: ata10.00: device reported invalid CHS sector 0 17:56:22: ata10.00: device reported invalid CHS sector 0 17:56:22: ata10.00: device reported invalid CHS sector 0 17:56:22: ata10.00: device reported invalid CHS sector 0 17:56:22: ata10.00: device reported invalid CHS sector 0 17:56:22: ata10.00: device reported invalid CHS sector 0 17:56:22: ata10: EH complete 17:56:22: sas: ata9: end_device-0:2: reset failed (errno=-16) 17:56:22: ata9: reset failed, giving up 17:56:22: ata9.00: disabled 17:56:22: ata9.00: device reported invalid CHS sector 0 17:56:22: ata9.00: device reported invalid CHS sector 0 17:56:22: ata9.00: device reported invalid CHS sector 0 17:56:22: ata9.00: device reported invalid CHS sector 0 17:56:22: ata9.00: device reported invalid CHS sector 0 17:56:22: ata9.00: device reported invalid CHS sector 0 17:56:22: ata9.00: device reported invalid CHS sector 0 17:56:22: ata9.00: device reported invalid CHS sector 0 17:56:22: ata9.00: device reported invalid CHS sector 0 17:56:22: ata9.00: device reported invalid CHS sector 0 17:56:22: ata9.00: device reported invalid CHS sector 0 17:56:22: ata9.00: device reported invalid CHS sector 0 17:56:22: ata9.00: device reported invalid CHS sector 0 17:56:22: ata9: EH complete 17:56:22: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964250 17:56:22: sas: ata7: end_device-0:0: reset failed (errno=-16) 17:56:22: ata7: reset failed, giving up 17:56:22: ata7.00: disabled 17:56:22: ata7.00: device reported invalid CHS sector 0 17:56:22: ata7.00: device reported invalid CHS sector 0 17:56:22: ata7.00: device reported invalid CHS sector 0 17:56:22: ata7.00: device reported invalid CHS sector 0 17:56:22: ata7.00: device reported invalid CHS sector 0 17:56:22: ata7.00: device reported invalid CHS sector 0 17:56:22: ata7.00: device reported invalid CHS sector 0 17:56:22: ata7.00: device reported invalid CHS sector 0 17:56:22: ata7.00: device reported invalid CHS sector 0 17:56:22: ata7.00: device reported invalid CHS sector 0 17:56:22: ata7.00: device reported invalid CHS sector 0 17:56:22: ata7.00: device reported invalid CHS sector 0 17:56:22: ata7.00: device reported invalid CHS sector 0 17:56:22: ata7.00: device reported invalid CHS sector 0 17:56:22: ata7: EH complete 17:56:22: sd 0:0:0:0: [sda] 17:56:22: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK 17:56:22: sd 0:0:0:0: [sda] CDB: 17:56:22: Write(10): 2a 00 08 70 08 01 00 00 02 00 17:56:22: end_request: I/O error, dev sda, sector 141559809 17:56:22: sd 0:0:0:0: [sda] 17:56:22: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK 17:56:22: XFS (sda1): metadata I/O error: block 0x8700001 ("xfs_buf_iodone_callbacks") error 5 numblks 1 17:56:22: XFS (sdc1): metadata I/O error: block 0x19500040 ("xfs_buf_iodone_callbacks") error 5 numblks 16 17:56:22: XFS (sdd1): metadata I/O error: block 0x19500040 ("xfs_buf_iodone_callbacks") error 5 numblks 16 17:56:22: XFS (sda1): metadata I/O error: block 0x8700002 ("xfs_buf_iodone_callbacks") error 5 numblks 1 17:56:22: XFS (sdc1): metadata I/O error: block 0x19500050 ("xfs_buf_iodone_callbacks") error 5 numblks 16 17:56:22: XFS (sdd1): metadata I/O error: block 0x19500050 ("xfs_buf_iodone_callbacks") error 5 numblks 16 17:56:22: Buffer I/O error on device sdb1, logical block 35424025 17:56:22: lost page write due to I/O error on sdb1