3x1t ssd zfs镜像，其中两个出故障，另外一个也有错误如何排除故障？

问题描述

我的笔记本是dell precision 7740，插了5个SSD，一个是Windows 10系统盘，一个是Ubuntu 20.04（我的日常操作系统）系统盘，还有一个ZFS池，包括三个1TB SSD镜像，存储数据。
安装后，我用了一年没有检查状态，因为三个SSD都是新的，而且在我拿到时状态很好。

最后一张口出了点问题，所以用了sudo zpool status -v，发现有两个有问题，剩下的一个也有很多错误。

sudo zpool status -v

pool: tankmain

state: DEGRADED

status: One or more devices has experienced an error resulting in data corruption. Applications may be affected.

action: Restore the file in question if possible. Otherwise restore the entire pool from backup.

see: http://zfsonlinux.org/msg/ZFS-8000-8A

scan: resilvered 91.2G in 0 days 00:20:39 with 0 errors on Wed Feb 10 17:30:23 2021

config:

NAME STATE READ WRITE CKSUM

tankmain DEGRADED 0 0 0

mirror-0 DEGRADED 28 0 0

nvme-PLEXTORPX-1TM9PGN+_P02952500057 DEGRADED 47 0 220 too many errors

nvme-PLEXTORPX-1TM9PGN+_P02952500063 FAULTED 32 0 2 too many errors

nvme-PLEXTORPX-1TM9PGN+_P02952500220 FAULTED 22 0 3 too many errors

我无法相信这些 SSD 将不再起作用。我检查了智能信息，没有异常。写入的数据单元约为70TB，寿命为640TB。我检查了我的数据，其中一些已损坏。然后我重启我的笔记本电脑，重启后：

sudo zpool status -v

pool: tankmain

state: DEGRADED

status: One or more devices is currently being resilvered. The pool will continue to function,possibly in a degraded state.

action: Wait for the resilver to complete.

scan: resilver in progress since Sun Feb 14 15:38:53 2021 40.5G scanned at 251M/s,11.1G issued at 68.8M/s,755G total 23.1G resilvered,1.47% done,0 days 03:04:30 to go

config:

NAME STATE READ WRITE CKSUM

tankmain DEGRADED 0 0 0

mirror-0 DEGRADED 0 0 0

nvme-PLEXTORPX-1TM9PGN+_P02952500057 DEGRADED 0 0 0 too many errors

nvme-PLEXTORPX-1TM9PGN+_P02952500063 ONLINE 0 0 7 (resilvering)

nvme-PLEXTORPX-1TM9PGN+_P02952500220 ONLINE 0 0 11 (resilvering)

完成resilvering后：

sudo zpool status -v

pool: tankmain

state: DEGRADED

status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

action: Determine if the device needs to be replaced,and clear the errors using 'zpool clear' or replace the device with 'zpool replace'.

see: http://zfsonlinux.org/msg/ZFS-8000-9P

scan: resilvered 83.5G in 0 days 00:10:26 with 0 errors on Sun Feb 14 15:49:19 2021

config:

NAME STATE READ WRITE CKSUM

tankmain DEGRADED 0 0 0

mirror-0 DEGRADED 0 0 0

nvme-PLEXTORPX-1TM9PGN+_P02952500057 DEGRADED 2 0 9 too many errors

nvme-PLEXTORPX-1TM9PGN+_P02952500063 ONLINE 0 0 15

nvme-PLEXTORPX-1TM9PGN+_P02952500220 ONLINE 0 0 19

errors: No known data errors

Then I scrubbed this zpool:

sudo zpool status -v

pool: tankmain

state: DEGRADED

status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.

action: Replace the faulted device,or use 'zpool clear' to mark the device repaired.

scan: scrub in progress since Sun Feb 14 15:56:49 2021

209G scanned at 1.76G/s,903M issued at 7.59M/s,755G total

849K repaired,0.12% done,no estimated completion time config:

NAME STATE READ WRITE CKSUM

tankmain DEGRADED 0 0 0

mirror-0 DEGRADED 0 0 0

nvme-PLEXTORPX-1TM9PGN+_P02952500057 DEGRADED 3 0 9 too many errors (repairing)

nvme-PLEXTORPX-1TM9PGN+_P02952500063 FAULTED 32 0 1.90K too many errors (repairing)

nvme-PLEXTORPX-1TM9PGN+_P02952500220 FAULTED 64 0 419 too many errors (repairing)

完成repairing后：

sudo zpool status -v

pool: tankmain

state: DEGRADED

status: One or more devices has experienced an error resulting in data corruption. Applications may be affected.

action: Restore the file in question if possible. Otherwise restore the entire pool from backup.

see: http://zfsonlinux.org/msg/ZFS-8000-8A

scan: scrub repaired 970K in 0 days 00:29:42 with 213 errors on Sun Feb 14 16:26:31 2021 config:

NAME STATE READ WRITE CKSUM

tankmain DEGRADED 0 0 0

mirror-0 DEGRADED 168 0 0

nvme-PLEXTORPX-1TM9PGN+_P02952500057 DEGRADED 327 0 2.34K too many errors

nvme-PLEXTORPX-1TM9PGN+_P02952500063 FAULTED 32 0 690K too many errors

nvme-PLEXTORPX-1TM9PGN+_P02952500220 FAULTED 64 0 682K too many errors

some errors are repaired,but most are not.

然后我重新启动并在 BIOS 中禁用 P02952500057。这次重启到Ubuntu，只挂了两块盘，可以正常读写数据，所有数据都还在，没有损坏。但是 P02952500063 仍然是 DEGRADED，而 P02952500220 是 ONLINE。即使重新启动，一开始都是ONLINE的，但是手动刷洗后，P02952500063再次DEGRADED。 Scrub 可以检测到一些错误，并且都可以修复成功，但是如果再次刷洗，ZFS 仍然可以检测到错误，然后它们都可以修复成功。好像只有一个磁盘在运行，手动清理会将 P02952500063 与 P02952500220 同步一次。我将我的数据转移到了一个安全的地方，销毁了这个 zpool，然后重建了一个，现在使用它没有错误。但我还是担心这种失败会再次发生。

如何找出原因？有什么建议可以避免再次发生这种情况吗？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

linux zfs

3x1t ssd zfs镜像，其中两个出故障，另外一个也有错误如何排除故障？

问题描述

解决方法

相关问答