我的服务器上有两个硬盘驱动器设置为RAID 1阵列(
Linux,使用mdadm的软件RAID),其中一个刚刚让我在syslog中显示“存在”:
Nov 23 02:05:29 h2 kernel: [7305215.338153] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 Nov 23 02:05:29 h2 kernel: [7305215.338178] ata1.00: irq_stat 0x40000008 Nov 23 02:05:29 h2 kernel: [7305215.338197] ata1.00: Failed command: READ FPDMA QUEUED Nov 23 02:05:29 h2 kernel: [7305215.338220] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in Nov 23 02:05:29 h2 kernel: [7305215.338221] res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F> Nov 23 02:05:29 h2 kernel: [7305215.338287] ata1.00: status: { DRDY ERR } Nov 23 02:05:29 h2 kernel: [7305215.338305] ata1.00: error: { UNC } Nov 23 02:05:29 h2 kernel: [7305215.358901] ata1.00: configured for UDMA/133 Nov 23 02:05:32 h2 kernel: [7305218.269054] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 Nov 23 02:05:32 h2 kernel: [7305218.269081] ata1.00: irq_stat 0x40000008 Nov 23 02:05:32 h2 kernel: [7305218.269101] ata1.00: Failed command: READ FPDMA QUEUED Nov 23 02:05:32 h2 kernel: [7305218.269125] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in Nov 23 02:05:32 h2 kernel: [7305218.269126] res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F> Nov 23 02:05:32 h2 kernel: [7305218.269196] ata1.00: status: { DRDY ERR } Nov 23 02:05:32 h2 kernel: [7305218.269215] ata1.00: error: { UNC } Nov 23 02:05:32 h2 kernel: [7305218.341565] ata1.00: configured for UDMA/133 Nov 23 02:05:35 h2 kernel: [7305221.193342] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 Nov 23 02:05:35 h2 kernel: [7305221.193368] ata1.00: irq_stat 0x40000008 Nov 23 02:05:35 h2 kernel: [7305221.193386] ata1.00: Failed command: READ FPDMA QUEUED Nov 23 02:05:35 h2 kernel: [7305221.193408] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in Nov 23 02:05:35 h2 kernel: [7305221.193409] res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F> Nov 23 02:05:35 h2 kernel: [7305221.193474] ata1.00: status: { DRDY ERR } Nov 23 02:05:35 h2 kernel: [7305221.193491] ata1.00: error: { UNC } Nov 23 02:05:35 h2 kernel: [7305221.388404] ata1.00: configured for UDMA/133 Nov 23 02:05:38 h2 kernel: [7305224.426316] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 Nov 23 02:05:38 h2 kernel: [7305224.426343] ata1.00: irq_stat 0x40000008 Nov 23 02:05:38 h2 kernel: [7305224.426363] ata1.00: Failed command: READ FPDMA QUEUED Nov 23 02:05:38 h2 kernel: [7305224.426387] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in Nov 23 02:05:38 h2 kernel: [7305224.426388] res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F> Nov 23 02:05:38 h2 kernel: [7305224.426459] ata1.00: status: { DRDY ERR } Nov 23 02:05:38 h2 kernel: [7305224.426478] ata1.00: error: { UNC } Nov 23 02:05:38 h2 kernel: [7305224.498133] ata1.00: configured for UDMA/133 Nov 23 02:05:41 h2 kernel: [7305227.400583] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 Nov 23 02:05:41 h2 kernel: [7305227.400608] ata1.00: irq_stat 0x40000008 Nov 23 02:05:41 h2 kernel: [7305227.400627] ata1.00: Failed command: READ FPDMA QUEUED Nov 23 02:05:41 h2 kernel: [7305227.400649] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in Nov 23 02:05:41 h2 kernel: [7305227.400650] res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F> Nov 23 02:05:41 h2 kernel: [7305227.400716] ata1.00: status: { DRDY ERR } Nov 23 02:05:41 h2 kernel: [7305227.400734] ata1.00: error: { UNC } Nov 23 02:05:41 h2 kernel: [7305227.472432] ata1.00: configured for UDMA/133
从我到目前为止所读到的内容来看,我不确定读取错误是否意味着硬盘驱动器正在死亡(到目前为止没有写入错误).我以前遇到过硬盘驱动器错误,而且总是有错误,无法写入日志中的特定扇区.这次不行.
我应该更换驱动器吗?还有别的东西会导致问题吗?
我已经安排了一个智能长时间测试,将在几个小时内完成.我希望这会给我更多信息.
更新:奇迹发生了.详情如下:
我正在备份该机器上的一些文件,准备更换有故障的驱动器.然后,当我复制那些巨大的文件时,我收到了这个logcheck电子邮件:
Security Events for kernel =-=-=-=-=-=-=-=-=-=-=-=-=- Nov 23 17:16:24 h2 kernel: [7359837.963597] end_request: I/O error,dev sdb,sector 1202093816 Nov 23 17:16:41 h2 kernel: [7359855.196334] end_request: I/O error,sector 1202093816 System Events =-=-=-=-=-=-= Nov 23 17:14:06 h2 kernel: [7359700.193114] ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 Nov 23 17:14:06 h2 kernel: [7359700.193139] ata2.00: irq_stat 0x40000008 Nov 23 17:14:06 h2 kernel: [7359700.193158] ata2.00: Failed command: READ FPDMA QUEUED Nov 23 17:14:06 h2 kernel: [7359700.193180] ata2.00: cmd 60/08:00:58:03:aa/00:00:47:00:00/40 tag 0 ncq 4096 in Nov 23 17:14:06 h2 kernel: [7359700.193181] res 41/40:08:58:03:aa/00:00:47:00:00/00 Emask 0x409 (media error) <F> Nov 23 17:14:06 h2 kernel: [7359700.193247] ata2.00: status: { DRDY ERR } Nov 23 17:14:06 h2 kernel: [7359700.193265] ata2.00: error: { UNC } Nov 23 17:14:06 h2 kernel: [7359700.194458] ata2.00: configured for UDMA/133
哎呀!我的头发,如果我的剃光头上有一些,就站起来了.看,它真的在第二个驱动器上产生坏道.怎么办?有两个故障驱动器,我该怎么办?
我考虑了一下,决定我:
>有一个我怀疑有故障的驱动器
>另一个是我100%肯定对日志中的坏扇区投诉有误.
所以我更换了第二个,而不是我最初发布的问题.我有几个分区,每个分区都设置在不同的RAID上,我希望我能够至少重新同步root和boot,所以我不必重新安装服务器上的所有内容.我可能不得不从备份中恢复巨大的数据分区,但是,我会省去一些工作.
更换了驱动器,启动了resyncs.根和启动分区(大约50GB)非常快速地重新启动.没有错误.我是一个快乐的露营者!
只是为了踢,让我们尝试重新同步巨大的数据分区 – 它大约2TB,上面有500GB的数据.我开始重新同步并观看了一段时间.它似乎需要永远,我把服务器带到网上,让用户使用他们的东西.重新同步发生在后台.而且,你知道什么,大约18个小时后重新同步结束,没有错误.服务器现在完全活着.
我想知道我现在应该更换原来的驱动器.我确定硬盘驱动器的服务器之神正在嘲笑我的屁股.
解决方法
它不会死……它已经死了.
尽快替换它,如果丢失任何数据,则从备份恢复.