linux – 干净ext3分区的输入/输出错误 – 如何检查数据块有什么问题

我在使用HP Raid控制器的CentOS 5服务器(内核版本2.6.18-164.15.1.el5)上的ext3分区上的文件有问题：

hpacucli ctrl all show detail

Smart Array P410 in Slot 1
   Bus Interface: PCI
   ...

HP工具不会报告任何问题.

这是正常的分区ext3,块大小设置为2k,它很好 – fsck输出：

fsck 1.39 (29-May-2006)
Pass 1: Checking inodes,blocks,and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

文件inode也可以：

File: `name.xxx'
Size: 3126962       Blocks: 6124       IO Block: 4096   regular file
Device: 6851h/26705d    Inode: 64579729    Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2014-07-28 09:02:59.000000000 -0400
Modify: 2014-07-28 09:02:59.000000000 -0400
Change: 2014-07-28 09:02:59.000000000 -0400

我无法执行的操作之一是文件复制：

> cp /long_path/name.xxx .
 cp: reading `/long_path.name.xxx': Input/output error

为了找出问题所在,我运行dd来复制文件：

> dd if=/long_path/name.xxx bs=2048 of=test
 dd: reading `/long_path/name.xxx': Input/output error
 222+0 records in
 222+0 records out
 454656 bytes (455 kB) copied,0.042867 seconds,10.6 MB/s

所以我猜这个问题出现在223文件块中.

Debugfs应该有助于在磁盘上找到该块

debugfs  -R "stat name.xxx" /dev/sdf
debugfs 1.39 (29-May-2006)
Inode: 64579729   Type: regular    Mode:  0644   Flags: 0x0   Generation: 2900468317
User:     0   Group:     0   Size: 3126962
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 6124
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x53d64a03 -- Mon Jul 28 09:02:59 2014
atime: 0x53d64a03 -- Mon Jul 28 09:02:59 2014
mtime: 0x53d64a03 -- Mon Jul 28 09:02:59 2014
BLOCKS:
(0):130402311,(1-4):130402844-130402847,(5-6):130484033-130484034,(7):130484036,(8-10):130484049-130484051,(11):130484055,(IND):130761221,(12-13):130761222-130761223,(14):130763791,(15):130763942,(16):130765268,(17-23):130838937-130838943,(24-46):130853946-130853968,(47-48):130855126-130855127,(49):130855215,(50-53):130856428-130856431,(54-104):130856533-130856583,(105-341):130856748-130856984,...
[MORE BLOCKS]     
....
TOTAL: 1531

所以我猜有问题的数据在130856866块.

如何获得有关该块的更多信息？我运行了坏块,并列出了坏块.我的猜测是我必须将块数乘以2(文件系统块大小为2K,而badblocks默认使用1K).另外badblocks检查磁盘,而不是分区,所以也许我应该添加一些偏移(该磁盘上有一个分区,所以可能没有).

> fdisk -l /dev/sdf

Disk /dev/sdf: 2000.3 GB,2000365379584 bytes
255 heads,63 sectors/track,243197 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
       Device Boot      Start         End      Blocks   Id  System
/dev/cciss/c0d5p1   *       1      243197  1953479871   83  Linux

我还想过使用smartd.我应该寻找什么？

Error counter log:
       Errors Corrected by           Total   Correction     Gigabytes    Total
           ECC          rereads/    errors   algorithm      processed    uncorrected
       fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0     1457         0  2887405961          0      65948.712          18
write:         0        0         0         0          0      15056.493           0
verify:        0        1         0  361901613          0       3591.720           0

Non-medium error count:      226

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
   Description                              number   (hours)
# 1  Background long   Failed in segment -->       -   34479          16845361 [0x3 0x11 0x0]
# 2  Background short  Completed                   -      44                 - [-   -    -]
# 3  Background short  Completed                   -      39                 - [-   -    -]
# 4  Background long   Completed                   -       6                 - [-   -    -]

Long (extended) Self Test duration: 18500 seconds [308.3 minutes]

Background scan results log
Status: scan is active
  Accumulated power on time,hours:minutes 34541:56 [2072516 minutes]
  Number of background scans performed: 1139,scan progress: 38.18%
  Number of background medium scans performed: 1139

 #  when        lba(hex)    [sk,asc,ascq]    reassign_status
 1 19215:06  0000000000014c61  [3,11,0]   Recovered via rewrite in-place
 2 19215:07  0000000000014c66  [3,0]   Recovered via rewrite in-place
 3 19413:28  0000000001010a31  [3,0]   Require Write or Reassign Blocks command
 4 19943:24  000000000001ea99  [3,0]   Recovered via rewrite in-place
 5 20152:23  00000000000232b8  [3,0]   Recovered via rewrite in-place
 6 31229:34  810000004087f984  [3,0]   Require Write or Reassign Blocks command
 7 33021:51  810000004087ba85  [3,0]   Require Write or Reassign Blocks command
 8 33021:51  000000004087ba9f  [3,0]   Require Write or Reassign Blocks command
 9 33021:52  000000004087bad6  [3,0]   Require Write or Reassign Blocks command
10 33029:43  000000004087baa5  [3,0]   Require Write or Reassign Blocks command
11 33055:27  000000004087bac3  [3,0]   Require Write or Reassign Blocks command
12 33244:40  810000004087f9d6  [3,0]   Require Write or Reassign Blocks command
13 33431:58  990000004087f105  [0,0]   Reassignment by disk failed
14 33480:17  00000000463d7713  [3,0]   Require Write or Reassign Blocks command
15 33480:19  00000000463d7723  [3,0]   Require Write or Reassign Blocks command
16 33480:20  00000000463d7725  [3,0]   Require Write or Reassign Blocks command
17 33480:28  81000000463d774e  [3,0]   Require Write or Reassign Blocks command
18 33686:17  8100000044e50edc  [3,0]   Require Write or Reassign Blocks command
19 34154:17  81000000432bef27  [3,0]   Require Write or Reassign Blocks command
20 34463:43  810000001f32decd  [3,0]   Require Write or Reassign Blocks command
21 34463:43  0000000030080000  [3,0]   Require Write or Reassign Blocks command

我应该如何与我的初始问题结合smartctl输出(或智能运行的任何其他输出).

硬盘软件也不应该解决这个问题吗？

BTW.我发现以下链接有助于理解’debugs -R’输出.也许link对其他人有用.

UPDATE

进一步研究我发现与有问题的inode相关的操作(如上面的cp命令)会触发内核日志中的以下行：

kernel: cciss: cmd ffff810037e00000 has CHECK CONDITION sense key = 0x3

‘sense key’是’status’和SCSI标准的一部分(list here和更多描述here).

解决方法

所以,为了解决这个问题,我做了以下工作.

取你的块号,乘以4并加一个

(130856866 * 4) + 1 = 523427465

这表示报告的扇区产生I / O错误.块大小为2k,扇区为512字节.额外的一个额外考虑了分区的起始扇区偏移量.

要与SMART关联,我们需要将现在的值转换为十六进制.

$printf "0x%x\n" 523427465
0x1f32de89

现在,当您将其与SMART显示的内容相关联时,一条线路可疑地接近……

20 34463:43  810000001f32decd  [3,0]   Require Write or Reassign Blocks command

多远了？

$bc -l
bc 1.06.95
Copyright 1991-1994,1997,1998,2000,2004,2006 Free Software Foundation,Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
obase=16
ibase=16
1F32DECD-1F32DE89
44

这只是在34816和32768字节之间,但我们不能说哪个扇区在构成块的四个扇区中受损.

如果我不得不冒险猜测,我会说可能在同一地址周围有大量的块会报告I / O错误(假设raid条带化的大小为32k或者其他).

此外,如果RAID从另一个磁盘获取块块,则读取可能无法解决问题.写入必须传播到RAID1设置中的所有磁盘,这样可能会导致写入失败但读取成功.此外,如果我们假设RAID卡的块大小为32k,我们还可以假设损坏的块加上SMART报告的块都被该盘上发生的任何事情损坏.它只是SMART测试从第一个32k的好磁盘和下一个32k的坏磁盘读取.

现代硬盘保留“储备部门”,用新的部门位置取代这样的受损部门.看到你现在正在看到这个,并且从磁盘重新分配磁盘失败的消息我会说磁盘已经用完了.

在做某事方面;这有点棘手. LBA寻址是对下面真实磁盘的抽象.您需要确定导致此问题的磁盘,在RAID阵列中将其取消并替换它.

在任何情况下,你都有一个坏磁盘,你应该尽快替换它.

linux – 干净ext3分区的输入/输出错误 – 如何检查数据块有什么问题

解决方法

相关文章