debian – Linux mdraid RAID 6,磁盘每隔几天随机丢失一次

我有一些运行Debian 8的服务器,配置为RAID6的8x800GB SSD.所有磁盘都连接到闪存为IT模式的LSI-3008.在每个服务器中,我还有一个2磁盘对作为操作系统的RAID1.

当前状态

# dpkg -l|grep mdad
ii  mdadm                          3.3.2-5+deb8u1              amd64        tool to administer Linux MD arrays (software RAID)

# uname -a
Linux R5U32-B 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08) x86_64 GNU/Linux

# more /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md2 : active raid6 sde1[1](F) sdg1[3] sdf1[2] sdd1[0] sdh1[7] sdb1[6] sdj1[5] sdi1[4]
      4687678464 blocks super 1.2 level 6,512k chunk,algorithm 2 [8/7] [U_UUUUUU]
      bitmap: 3/6 pages [12KB],65536KB chunk

md1 : active (auto-read-only) raid1 sda5[0] sdc5[1]
      62467072 blocks super 1.2 [2/2] [UU]
        resync=PENDING

md0 : active raid1 sda2[0] sdc2[1]
      1890881536 blocks super 1.2 [2/2] [UU]
      bitmap: 2/15 pages [8KB],65536KB chunk

unused devices: <none>

# mdadm --detail /dev/md2
/dev/md2:
        Version : 1.2
  Creation Time : Fri Jun 24 04:35:18 2016
     Raid Level : raid6
     Array Size : 4687678464 (4470.52 GiB 4800.18 GB)
  Used Dev Size : 781279744 (745.09 GiB 800.03 GB)
   Raid Devices : 8
  Total Devices : 8
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Tue Jul 19 17:36:15 2016
          State : active,degraded
 Active Devices : 7
Working Devices : 7
 Failed Devices : 1
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : R5U32-B:2  (local to host R5U32-B)
           UUID : 24299038:57327536:4db96d98:d6e914e2
         Events : 2514191

    Number   Major   Minor   RaidDevice State
       0       8       49        0      active sync   /dev/sdd1
       2       0        0        2      removed
       2       8       81        2      active sync   /dev/sdf1
       3       8       97        3      active sync   /dev/sdg1
       4       8      129        4      active sync   /dev/sdi1
       5       8      145        5      active sync   /dev/sdj1
       6       8       17        6      active sync   /dev/sdb1
       7       8      113        7      active sync   /dev/sdh1

       1       8       65        -      faulty   /dev/sde1

问题

每隔1-3天左右,RAID 6阵列会半定期降级.原因是其中一个(任何一个)磁盘出现故障,并出现以下错误

#dmesg -T
[Sat Jul 16 05:38:45 2016] sd 0:0:3:0: attempting task abort! scmd(ffff8810350cbe00)
[Sat Jul 16 05:38:45 2016] sd 0:0:3:0: [sde] CDB:
[Sat Jul 16 05:38:45 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Sat Jul 16 05:38:45 2016] scsi target0:0:3: handle(0x000d),sas_address(0x500304801707a443),phy(3)
[Sat Jul 16 05:38:45 2016] scsi target0:0:3: enclosure_logical_id(0x500304801707a47f),slot(3)
[Sat Jul 16 05:38:46 2016] sd 0:0:3:0: task abort: SUCCESS scmd(ffff8810350cbe00)
[Sat Jul 16 05:38:46 2016] end_request: I/O error,dev sde,sector 2064
[Sat Jul 16 05:38:46 2016] md: super_written gets error=-5,uptodate=0
[Sat Jul 16 05:38:46 2016] md/raid:md2: disk failure on sde1,disabling device.md/raid:md2: Operation continuing on 7 devices.
[Sat Jul 16 05:38:46 2016] RAID conf printout:
[Sat Jul 16 05:38:46 2016]  --- level:6 rd:8 wd:7
[Sat Jul 16 05:38:46 2016]  disk 0,o:1,dev:sdd1
[Sat Jul 16 05:38:46 2016]  disk 1,o:0,dev:sde1
[Sat Jul 16 05:38:46 2016]  disk 2,dev:sdf1
[Sat Jul 16 05:38:46 2016]  disk 3,dev:sdg1
[Sat Jul 16 05:38:46 2016]  disk 4,dev:sdi1
[Sat Jul 16 05:38:46 2016]  disk 5,dev:sdj1
[Sat Jul 16 05:38:46 2016]  disk 6,dev:sdb1
[Sat Jul 16 05:38:46 2016]  disk 7,dev:sdh1
[Sat Jul 16 05:38:46 2016] RAID conf printout:
[Sat Jul 16 05:38:46 2016]  --- level:6 rd:8 wd:7
[Sat Jul 16 05:38:46 2016]  disk 0,dev:sdd1
[Sat Jul 16 05:38:46 2016]  disk 2,dev:sdh1
[Sat Jul 16 12:40:00 2016] sd 0:0:7:0: attempting task abort! scmd(ffff88000d76eb00)

已经尝试过了

我已经尝试了以下内容,没有任何改进:

>将/ sys / block / md2 / md / stripe_cache_size从256增加到16384
>将dev.raid.speed_limit_min从1000增加到50000

需要你的帮助

这些错误是由mdadm配置还是内核或控制器引起的?

更新20160802

遵循ppetraki和其他人的建议:

>使用原始磁盘代替分区

这并不能解决问题
>减少块大小

块大小已被修改为128KB然后64KB,但RAID卷仍然在几天内降级.从dmesg显示与之前的错误类似.我忘了尝试将块大小减少到32KB.
>将RAID数量减少到6个磁盘

我试图破坏现有的RAID,将每个磁盘上的超级块归零并创建具有6个磁盘(原始磁盘)和64KB块的RAID6.减少磁盘RAID的数量似乎使阵列寿命更长,大约在降级前4-7天
>更新驱动程序

我只是将驱动程序更新为Linux_Driver_RHEL6-7_SLES11-12_P12(http://www.avagotech.com/products/server-storage/host-bus-adapters/sas-9300-8e).磁盘错误仍然如下所示

[Tue Aug  2 17:57:48 2016] sd 0:0:6:0: attempting task abort! scmd(ffff880fc0dd1980)
[Tue Aug  2 17:57:48 2016] sd 0:0:6:0: [sdg] CDB:
[Tue Aug  2 17:57:48 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Tue Aug  2 17:57:48 2016] scsi target0:0:6: handle(0x0010),sas_address(0x50030480173ee946),phy(6)
[Tue Aug  2 17:57:48 2016] scsi target0:0:6: enclosure_logical_id(0x50030480173ee97f),slot(6)
[Tue Aug  2 17:57:49 2016] sd 0:0:6:0: task abort: SUCCESS scmd(ffff880fc0dd1980)
[Tue Aug  2 17:57:49 2016] end_request: I/O error,dev sdg,sector 0

就在不久之前,我的阵列已经降级了.这次/ dev / sdf和/ dev / sdg显示错误“尝试任务中止!scmd”

[Tue Aug  2 21:26:02 2016]  
[Tue Aug  2 21:26:02 2016] sd 0:0:5:0: [sdf] CDB:
[Tue Aug  2 21:26:02 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Tue Aug  2 21:26:02 2016] scsi target0:0:5: handle(0x000f),sas_address(0x50030480173ee945),phy(5)
[Tue Aug  2 21:26:02 2016] scsi target0:0:5: enclosure logical id(0x50030480173ee97f),slot(5)
[Tue Aug  2 21:26:02 2016] scsi target0:0:5: enclosure level(0x0000),connector name(     ^A)
[Tue Aug  2 21:26:03 2016] sd 0:0:5:0: task abort: SUCCESS scmd(ffff88103beb5240)
[Tue Aug  2 21:26:03 2016] sd 0:0:5:0: attempting task abort! scmd(ffff88107934e080)
[Tue Aug  2 21:26:03 2016] sd 0:0:5:0: [sdf] CDB:
[Tue Aug  2 21:26:03 2016] Read(10): 28 00 04 75 3b f8 00 00 08 00
[Tue Aug  2 21:26:03 2016] scsi target0:0:5: handle(0x000f),phy(5)
[Tue Aug  2 21:26:03 2016] scsi target0:0:5: enclosure logical id(0x50030480173ee97f),slot(5)
[Tue Aug  2 21:26:03 2016] scsi target0:0:5: enclosure level(0x0000),connector name(     ^A)
[Tue Aug  2 21:26:03 2016] sd 0:0:5:0: task abort: SUCCESS scmd(ffff88107934e080)
[Tue Aug  2 21:26:04 2016] sd 0:0:5:0: [sdf] CDB:
[Tue Aug  2 21:26:04 2016] Read(10): 28 00 04 75 3b f8 00 00 08 00
[Tue Aug  2 21:26:04 2016] mpt3sas_cm0:         sas_address(0x50030480173ee945),phy(5)
[Tue Aug  2 21:26:04 2016] mpt3sas_cm0:         enclosure logical id(0x50030480173ee97f),slot(5)
[Tue Aug  2 21:26:04 2016] mpt3sas_cm0:         enclosure level(0x0000),connector name(     ^A)
[Tue Aug  2 21:26:04 2016] mpt3sas_cm0:         handle(0x000f),ioc_status(success)(0x0000),smid(35)
[Tue Aug  2 21:26:04 2016] mpt3sas_cm0:         request_len(4096),underflow(4096),resid(-4096)
[Tue Aug  2 21:26:04 2016] mpt3sas_cm0:         tag(65535),transfer_count(8192),sc->result(0x00000000)
[Tue Aug  2 21:26:04 2016] mpt3sas_cm0:         scsi_status(check condition)(0x02),scsi_state(autosense valid )(0x01)
[Tue Aug  2 21:26:04 2016] mpt3sas_cm0:         [sense_key,asc,ascq]: [0x06,0x29,0x00],count(18)
[Tue Aug  2 22:14:51 2016] sd 0:0:6:0: attempting task abort! scmd(ffff880931d8c840)
[Tue Aug  2 22:14:51 2016] sd 0:0:6:0: [sdg] CDB:
[Tue Aug  2 22:14:51 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Tue Aug  2 22:14:51 2016] scsi target0:0:6: handle(0x0010),phy(6)
[Tue Aug  2 22:14:51 2016] scsi target0:0:6: enclosure logical id(0x50030480173ee97f),slot(6)
[Tue Aug  2 22:14:51 2016] scsi target0:0:6: enclosure level(0x0000),connector name(     ^A)
[Tue Aug  2 22:14:51 2016] sd 0:0:6:0: task abort: SUCCESS scmd(ffff880931d8c840)
[Tue Aug  2 22:14:52 2016] sd 0:0:6:0: [sdg] CDB:
[Tue Aug  2 22:14:52 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Tue Aug  2 22:14:52 2016] mpt3sas_cm0:         sas_address(0x50030480173ee946),phy(6)
[Tue Aug  2 22:14:52 2016] mpt3sas_cm0:         enclosure logical id(0x50030480173ee97f),slot(6)
[Tue Aug  2 22:14:52 2016] mpt3sas_cm0:         enclosure level(0x0000),connector name(     ^A)
[Tue Aug  2 22:14:52 2016] mpt3sas_cm0:         handle(0x0010),smid(85)
[Tue Aug  2 22:14:52 2016] mpt3sas_cm0:         request_len(0),underflow(0),resid(-8192)
[Tue Aug  2 22:14:52 2016] mpt3sas_cm0:         tag(65535),sc->result(0x00000000)
[Tue Aug  2 22:14:52 2016] mpt3sas_cm0:         scsi_status(check condition)(0x02),scsi_state(autosense valid )(0x01)
[Tue Aug  2 22:14:52 2016] mpt3sas_cm0:         [sense_key,count(18)
[Tue Aug  2 22:14:52 2016] end_request: I/O error,sector 16
[Tue Aug  2 22:14:52 2016] md: super_written gets error=-5,uptodate=0
[Tue Aug  2 22:14:52 2016] md/raid:md2: disk failure on sdg,disabling device. md/raid:md2: Operation continuing on 5 devices.
[Tue Aug  2 22:14:52 2016] RAID conf printout:
[Tue Aug  2 22:14:52 2016]  --- level:6 rd:6 wd:5
[Tue Aug  2 22:14:52 2016]  disk 0,dev:sdc
[Tue Aug  2 22:14:52 2016]  disk 1,dev:sdd
[Tue Aug  2 22:14:52 2016]  disk 2,dev:sde
[Tue Aug  2 22:14:52 2016]  disk 3,dev:sdf
[Tue Aug  2 22:14:52 2016]  disk 4,dev:sdg
[Tue Aug  2 22:14:52 2016]  disk 5,dev:sdh
[Tue Aug  2 22:14:52 2016] RAID conf printout:
[Tue Aug  2 22:14:52 2016]  --- level:6 rd:6 wd:5
[Tue Aug  2 22:14:52 2016]  disk 0,dev:sdf
[Tue Aug  2 22:14:52 2016]  disk 5,dev:sdh

我假设错误“尝试任务中止!scmd”导致数组降级,但不知道是什么导致它.

更新20160806

我已经尝试使用相同的规格设置其他服务器.如果没有mdadm RAID,则每个磁盘都直接安装在ext4文件系统下.一段时间内核日志显示“尝试任务中止!scmd”在某些磁盘上.这个引导/ dev / sdd1错误然后重新安装到只读模式

$dmesg -T
[Sat Aug  6 05:21:09 2016] sd 0:0:3:0: [sdd] CDB:
[Sat Aug  6 05:21:09 2016] Read(10): 28 00 2d 29 21 00 00 00 20 00
[Sat Aug  6 05:21:09 2016] scsi target0:0:3: handle(0x000a),sas_address(0x4433221103000000),phy(3)
[Sat Aug  6 05:21:09 2016] scsi target0:0:3: enclosure_logical_id(0x500304801a5d3f01),slot(3)
[Sat Aug  6 05:21:09 2016] sd 0:0:3:0: task abort: SUCCESS scmd(ffff88006b206800)
[Sat Aug  6 05:21:09 2016] sd 0:0:3:0: attempting task abort! scmd(ffff88019a3a07c0)
[Sat Aug  6 05:21:09 2016] sd 0:0:3:0: [sdd] CDB:
[Sat Aug  6 05:21:09 2016] Read(10): 28 00 08 46 8f 80 00 00 20 00
[Sat Aug  6 05:21:09 2016] scsi target0:0:3: handle(0x000a),slot(3)
[Sat Aug  6 05:21:09 2016] sd 0:0:3:0: task abort: SUCCESS scmd(ffff88019a3a07c0)
[Sat Aug  6 05:21:10 2016] sd 0:0:3:0: attempting device reset! scmd(ffff880f9a49ac80)
[Sat Aug  6 05:21:10 2016] sd 0:0:3:0: [sdd] CDB:
[Sat Aug  6 05:21:10 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Sat Aug  6 05:21:10 2016] scsi target0:0:3: handle(0x000a),phy(3)
[Sat Aug  6 05:21:10 2016] scsi target0:0:3: enclosure_logical_id(0x500304801a5d3f01),slot(3)
[Sat Aug  6 05:21:10 2016] sd 0:0:3:0: device reset: SUCCESS scmd(ffff880f9a49ac80)
[Sat Aug  6 05:21:10 2016] mpt3sas0: log_info(0x31110e03): originator(PL),code(0x11),sub_code(0x0e03)
[Sat Aug  6 05:21:10 2016] mpt3sas0: log_info(0x31110e03): originator(PL),sub_code(0x0e03)
[Sat Aug  6 05:21:11 2016] end_request: I/O error,dev sdd,sector 780443696
[Sat Aug  6 05:21:11 2016] Aborting journal on device sdd1-8.
[Sat Aug  6 05:21:11 2016] EXT4-fs error (device sdd1): ext4_journal_check_start:56: Detected aborted journal
[Sat Aug  6 05:21:11 2016] EXT4-fs (sdd1): Remounting filesystem read-only
[Sat Aug  6 05:40:35 2016] sd 0:0:5:0: attempting task abort! scmd(ffff88024fc08340)
[Sat Aug  6 05:40:35 2016] sd 0:0:5:0: [sdf] CDB:
[Sat Aug  6 05:40:35 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Sat Aug  6 05:40:35 2016] scsi target0:0:5: handle(0x000c),sas_address(0x4433221105000000),phy(5)
[Sat Aug  6 05:40:35 2016] scsi target0:0:5: enclosure_logical_id(0x500304801a5d3f01),slot(5)
[Sat Aug  6 05:40:35 2016] sd 0:0:5:0: task abort: Failed scmd(ffff88024fc08340)
[Sat Aug  6 05:40:35 2016] sd 0:0:5:0: attempting task abort! scmd(ffff88019a12ee00)
[Sat Aug  6 05:40:35 2016] sd 0:0:5:0: [sdf] CDB:
[Sat Aug  6 05:40:35 2016] Read(10): 28 00 27 c8 b4 e0 00 00 20 00
[Sat Aug  6 05:40:35 2016] scsi target0:0:5: handle(0x000c),slot(5)
[Sat Aug  6 05:40:35 2016] sd 0:0:5:0: task abort: SUCCESS scmd(ffff88019a12ee00)
[Sat Aug  6 05:40:35 2016] sd 0:0:5:0: attempting task abort! scmd(ffff88203eaddac0)

更新20160930

控制器固件升级到最新版本(当前)12.00.02后,问题消失了

结论

问题解决

解决方法

这是一个相当大的条纹,8-2 = 6 * 512K = 3MiB;也不是偶数.将您的阵列带到10个磁盘(8个数据2奇偶校验)或低至4 2奇偶校验,每个驱动器的总条带大小为256K或64K.对于未对齐的写入,可能是缓存对你很生气.在尝试重新配置阵列之前,您可以尝试将所有驱动器置于直写模式.

2016年7月20日更新.

此时我确信您的RAID配置是问题所在. 3MiB条纹只是奇数,即使它是一个倍数
您的分区偏移量[1](1MiB)对于任何RAID,SSD或其他方式来说,它只是一个次优的条带大小.它可能正在产生
大量未对齐的写入,这迫使您的SSD释放更多的页面,而不是现有的,这会将其推入垃圾
收藏家不断,缩短了它的使用寿命.驱动器无法获得足够快的可用页面以便写入,因此当您最终将缓存刷新到磁盘(同步写入)时,它实际上会失败.你没有崩溃一致的数组,例如您的数据不安全.

这是我的理论基于可用的信息和我可以花在它上面的时间.您现在拥有成为存储专家的“增长机会”;)

重来.不要使用分区.将系统放在一边并构建一个总条带大小为128K的数组(开始时保守一点).在N个总驱动器的RAID 6配置中,只有N-2个驱动器在任何时间获取数据,其余两个存储奇偶校验信息.因此,如果N = 6,128K条带将需要32K块.你现在应该能够看到为什么8运行RAID 6是一个奇数.

然后在直接模式下对“原始磁盘”运行fio [2]并击败它直到你确信它是可靠的.接下来添加文件系统并通知它底层条带大小(man mkfs.???).再次运行fio,但这次使用文件(或者你将销毁文件系统)并确认数组保持不变.

我知道这是很多“东西”,只需从小处开始,尝试并了解它正在做什么,并坚持下去.像blktrace和iostat这样的工具可以帮助您了解应用程序的编写方式,这将告诉您要使用的最佳条带/块大小.

> https://www.percona.com/blog/2011/06/09/aligning-io-on-a-hard-disk-raid-the-theory/

(我的fio cheatsheet)
2. https://wiki.mikejung.biz/Benchmarking#Fio_Random_Write_and_Random_Read_Command_Line_Examples

相关文章

在Linux上编写运行C语言程序,经常会遇到程序崩溃、卡死等异...
git使用小结很多人可能和我一样,起初对git是一无所知的。我...
1. 操作系统环境、安装包准备 宿主机:Max OSX 10.10.5 虚拟...
因为业务系统需求,需要对web服务作nginx代理,在不断的尝试...
Linux模块机制浅析 Linux允许用户通过插入模块,实现干预内核...
一、Hadoop HA的Web页面访问 Hadoop开启HA后,会同时存在两个...