bad blocks, what to do?

Discussion:

(too old to reply)

Eric

2010-05-22 18:44:45 UTC

Hope there's enough traffic on this group.

I was backing up the root partition on an Ultra10 (Solaris 8) when
ufsdump quit with a sigbus
fault:
DUMP: Writing 32 Kilobyte records
DUMP: Date of this level 0 dump: Sat 22 May 2010 12:03:55 PM CDT
DUMP: Date of last level 0 dump: the epoch
DUMP: Dumping /dev/rdsk/c0t0d0s7 (xxxxxx.xxxxx.xxx:/) to /dev/rmt/
0c.
DUMP: Mapping (Pass I) [regular files]
DUMP: Mapping (Pass II) [directories]
DUMP: Estimated 4679110 blocks (2284.72MB).
DUMP: Dumping (Pass III) [directories]
DUMP: Dumping (Pass IV) [regular files]
DUMP: 43.15% done, finished in 0:13
DUMP: SIGBUS() ABORTING!
DUMP: SIGBUS() ABORTING!
DUMP: Error reading command pipe: Error 0
DUMP: The ENTIRE dump is aborted.

The console displayed the following:
# May 22 12:19:05 xxxxxx.xxxxx.xxx dada: WARNING: /***@1f,0/***@1,1/
***@3/***@0,0 (dad0):
May 22 12:19:05 xxxxxx.xxxxx.xxx Uncorrectable data Error: Block
f93140
May 22 12:19:05 xxxxxx.xxxxx.xxx
May 22 12:19:06 xxxxxx.xxxxx.xxx dada: WARNING: /***@1f,0/***@1,1/
***@3/***@0,0 (dad0):
May 22 12:19:06 xxxxxx.xxxxx.xxx disk not responding to selection
May 22 12:19:06 xxxxxx.xxxxx.xxx
May 22 12:19:06 xxxxxx.xxxxx.xxx dada: dad0: disk okay
May 22 12:19:10 xxxxxx.xxxxx.xxx dada: WARNING: /***@1f,0/***@1,1/
***@3/***@0,0 (dad0):
May 22 12:19:10 xxxxxx.xxxxx.xxx Uncorrectable data Error: Block
f93150
May 22 12:19:10 xxxxxx.xxxxx.xxx
May 22 12:19:11 xxxxxx.xxxxx.xxx dada: WARNING: /***@1f,0/***@1,1/
***@3/***@0,0 (dad0):
May 22 12:19:11 xxxxxx.xxxxx.xxx disk not responding to selection
May 22 12:19:11 xxxxxx.xxxxx.xxx
May 22 12:19:11 xxxxxx.xxxxx.xxx dada: dad0: disk okay
May 22 12:19:16 xxxxxx.xxxxx.xxx dada: WARNING: /***@1f,0/***@1,1/
***@3/***@0,0 (dad0):
May 22 12:19:16 xxxxxx.xxxxx.xxx Uncorrectable data Error: Block
f93140
May 22 12:19:16 xxxxxx.xxxxx.xxx
May 22 12:19:17 xxxxxx.xxxxx.xxx dada: WARNING: /***@1f,0/***@1,1/
***@3/***@0,0 (dad0):
May 22 12:19:17 xxxxxx.xxxxx.xxx disk not responding to selection
May 22 12:19:17 xxxxxx.xxxxx.xxx
May 22 12:19:17 xxxxxx.xxxxx.xxx dada: dad0: disk okay
May 22 12:19:20 xxxxxx.xxxxx.xxx dada: WARNING: /***@1f,0/***@1,1/
***@3/***@0,0 (dad0):
May 22 12:19:20 xxxxxx.xxxxx.xxx Uncorrectable data Error: Block
f93150
May 22 12:19:20 xxxxxx.xxxxx.xxx
May 22 12:19:22 xxxxxx.xxxxx.xxx dada: WARNING: /***@1f,0/***@1,1/
***@3/***@0,0 (dad0):
May 22 12:19:22 xxxxxx.xxxxx.xxx disk not responding to selection
May 22 12:19:22 xxxxxx.xxxxx.xxx
May 22 12:19:22 xxxxxx.xxxxx.xxx dada: dad0: disk okay

What I gather is that there are two bad blocks at absolute addresses
0xf93140 and 0xf93150.
That I should use "format" to verify and repair these blocks. Since
this is on the root partition,
I'll need to boot off some other disk.

Am I on the right track? Also, is there anything else I need to be
aware of? Any tips to make
this job go quickly and safely?

This "dad0" device is my ide hard drive, correct?

Also, is there a way to find out what files are associated with these
blocks?

TIA,
eric

Doug McIntyre

2010-05-22 18:56:21 UTC

Permalink

Post by Eric
Hope there's enough traffic on this group.
I was backing up the root partition on an Ultra10 (Solaris 8) when
ufsdump quit with a sigbus
DUMP: Writing 32 Kilobyte records
DUMP: Date of this level 0 dump: Sat 22 May 2010 12:03:55 PM CDT
DUMP: Date of last level 0 dump: the epoch
DUMP: Dumping /dev/rdsk/c0t0d0s7 (xxxxxx.xxxxx.xxx:/) to /dev/rmt/
0c.
DUMP: Mapping (Pass I) [regular files]
DUMP: Mapping (Pass II) [directories]
DUMP: Estimated 4679110 blocks (2284.72MB).
DUMP: Dumping (Pass III) [directories]
DUMP: Dumping (Pass IV) [regular files]
DUMP: 43.15% done, finished in 0:13
DUMP: SIGBUS() ABORTING!
DUMP: SIGBUS() ABORTING!
DUMP: Error reading command pipe: Error 0
DUMP: The ENTIRE dump is aborted.

Backup immediately all your data that you want to save. You may not
have the choice to ufsdump the whole thing, grab what you can by any
means you can. It should be fairly straightforward to grab your home area.
Most everything else probably can be restored in some fashion later.
Probably grab /etc/ for most of the config files there. Its small anyway.

Post by Eric
May 22 12:19:05 xxxxxx.xxxxx.xxx Uncorrectable data Error: Block
May 22 12:19:06 xxxxxx.xxxxx.xxx disk not responding to selection

Bad news, your disk is on its way to hard-drive heaven.

Post by Eric
That I should use "format" to verify and repair these blocks. Since
this is on the root partition,
I'll need to boot off some other disk.

format won't be able to repair these blocks. IDE drives (and any
modern drive made in at least the last 15 years) are smart, and have
already mapped out bad blocks long before the OS sees them. If the
drive can't autocorrect a bad block, the drive is going to be history
very soon. Replace the drive and restore as soon as possible.

Post by Eric
Am I on the right track? Also, is there anything else I need to be
aware of? Any tips to make
this job go quickly and safely?

Sure, backup as much as you can and replace the bad hard drive.

Post by Eric
This "dad0" device is my ide hard drive, correct?

Yes it is.

Post by Eric
Also, is there a way to find out what files are associated with these
blocks?

Not easily.

Eric

2010-05-22 21:56:28 UTC

Permalink

Post by Doug McIntyre

Post by Eric
May 22 12:19:05 xxxxxx.xxxxx.xxx Uncorrectable data Error: Block
May 22 12:19:06 xxxxxx.xxxxx.xxx disk not responding to selection

Bad news, your disk is on its way to hard-drive heaven.

Post by Eric
That I should use "format" to verify and repair these blocks. Since
this is on the root partition,
I'll need to boot off some other disk.

Post by Eric
Am I on the right track? Also, is there anything else I need to be
aware of? Any tips to make
this job go quickly and safely?

Sure, backup as much as you can and replace the bad hard drive.

Post by Eric
This "dad0" device is my ide hard drive, correct?

Yes it is.

Post by Eric
Also, is there a way to find out what files are associated with these
blocks?

Not easily.

Well, crap, it figures.
Thanks for the warning.

BTW, do you know, or can you point me towards some info on the ins and
outs
of format's analyze function? The present S8 I'm using is limited to
32GB
and all I have are a couple of old disks that fit the bill and I want
to have
some confidence that they'll work (for a while at least). What I'm
looking
for is stuff like what settings will give me an accurate test in the
shortest
amount of time and what's the difference between the various tests and
why
would you choose one over another? That sort of thing.

Thanks again,
eric

Richard B. Gilbert

2010-05-23 00:45:00 UTC

Permalink

Post by Doug McIntyre

Post by Eric
May 22 12:19:05 xxxxxx.xxxxx.xxx Uncorrectable data Error: Block
May 22 12:19:06 xxxxxx.xxxxx.xxx disk not responding to selection

Bad news, your disk is on its way to hard-drive heaven.

One bad sector does not make a dead disk. The hardware and software
should respond by attempting to copy the contents of the bad sector to a
spare track. If that succeeds and you don't get any more bad sectors,
you are done. Get on with your life!

If you keep getting errors it means your disk's remaining life can be
measured in hours or, at best, days.

Post by Doug McIntyre

Post by Eric
That I should use "format" to verify and repair these blocks. Since
this is on the root partition,
I'll need to boot off some other disk.

Generally, there is NO WAY to "repair" bad blocks. Frequently the
driver and the disk controller can "revector" a block to a "spare"
block. If this works you are good to go. Keep an eye on the error log;
if you get more errors, the disk is probably headed for "disk heaven"
and you need to make a final backup, replace the drive and restore.
<snip>

Eric

2010-05-23 03:16:06 UTC

Permalink

One bad sector does not make a dead disk. The hardware and software
should respond by attempting to copy the contents of the bad sector to a
spare track. If that succeeds and you don't get any more bad sectors,
you are done. Get on with your life!
If you keep getting errors it means your disk's remaining life can be
measured in hours or, at best, days.

Post by Eric
That I should use "format" to verify and repair these blocks. Since
this is on the root partition,
I'll need to boot off some other disk.

Okay, but if the computer automatically reassigned the bad blocks, why
does
ufsdump get these sigbus faults (it did this consistently the last
four or so times I
ran ufsdump, can't say if it was the same blocks). Does ufsdump work
file-wise
like tar or block-wise like dd?

The last zero level I have is now a year old and I can't do any more
because of
this problem. I figure it's only a matter of time before it screws up
my nine levels
as well.

On your second point, since the blocks aren't being repaired then
whatever
data was on it is gone, correct?

As for the "get on with your life" comment, the disk is something like
10+ years old,
been running pretty much 24/7. When the computer goes down the mass-
spec that
it controls goes down with it and then I have a bunch of people
waiting on me. I don't
much care for that, so I do what I can to minimize unscheduled
downtime. It sounds
like I can prepare in a somewhat leisurely manner, so that's good.

regards,
eric

Richard B. Gilbert

2010-05-23 12:22:08 UTC

Permalink

Post by Eric

Post by Richard B. Gilbert
One bad sector does not make a dead disk. The hardware and software
should respond by attempting to copy the contents of the bad sector to a
spare track. If that succeeds and you don't get any more bad sectors,
you are done. Get on with your life!
If you keep getting errors it means your disk's remaining life can be
measured in hours or, at best, days.

Post by Eric
That I should use "format" to verify and repair these blocks. Since
this is on the root partition,
I'll need to boot off some other disk.

Okay, but if the computer automatically reassigned the bad blocks, why
does
ufsdump get these sigbus faults (it did this consistently the last
four or so times I
ran ufsdump, can't say if it was the same blocks). Does ufsdump work
file-wise
like tar or block-wise like dd?
The last zero level I have is now a year old and I can't do any more
because of
this problem. I figure it's only a matter of time before it screws up
my nine levels
as well.
On your second point, since the blocks aren't being repaired then
whatever
data was on it is gone, correct?
As for the "get on with your life" comment, the disk is something like
10+ years old,
been running pretty much 24/7. When the computer goes down the mass-
spec that
it controls goes down with it and then I have a bunch of people
waiting on me. I don't
much care for that, so I do what I can to minimize unscheduled
downtime. It sounds
like I can prepare in a somewhat leisurely manner, so that's good.
regards,
eric

Prepare quickly! That disk might last another six months or it might be
gone in six hours! You've already had all the warning you are likely to
get.

The next sound you hear will be a disk drive tearing itself to pieces!
Try to ensure that your data will not perish with it!

Doug McIntyre

2010-05-24 02:57:28 UTC

Permalink

In my experience, since IDE (and other family) disks automatically
remap bad blocks before presenting to the OS, if the OS sees bad
blocks, it means that the drive is incrementingly getting more bad
blocks than the IDE drive controller can handle. It has a fixed
largish number of bad-block areas to reallocate pools from automatically.
The OS should never see a bad block unless the IDE drive controller can't
automatically handle remapping everything.

Once I see bad sectors in the kernel logs, I've had complete hard
drive failure shortly thereafter. I have not seen many systems limp
along with bad sectors in the kernel log for very long (although
there's been a few).

Richard B. Gilbert

2010-05-24 08:19:57 UTC

Permalink

Post by Doug McIntyre

In my experience, since IDE (and other family) disks automatically
remap bad blocks before presenting to the OS, if the OS sees bad
blocks, it means that the drive is incrementingly getting more bad
blocks than the IDE drive controller can handle. It has a fixed
largish number of bad-block areas to reallocate pools from automatically.
The OS should never see a bad block unless the IDE drive controller can't
automatically handle remapping everything.
Once I see bad sectors in the kernel logs, I've had complete hard
drive failure shortly thereafter. I have not seen many systems limp
along with bad sectors in the kernel log for very long (although
there's been a few).

If I were the OP I would be paying a great deal of attention to my
backups. I'd reseat both ends of the ribbon cable. I would also be
giving serious consideration to replacing the disk drive! At such times
I feel much better if I have a replacement drive on hand!

Eric

2010-05-24 19:36:40 UTC

Permalink

Post by Richard B. Gilbert
If I were the OP I would be paying a great deal of attention to my
backups. I'd reseat both ends of the ribbon cable. I would also be
giving serious consideration to replacing the disk drive! At such times
I feel much better if I have a replacement drive on hand!

Just replaced the disk this morning (didn't think
to reseat the cable though). Restored and applied
patches, doesn't look like the patches broke
anything. _Now_ I can get on with my life.

Thank to all for your help.
eric

David Mathog

2010-07-14 22:48:46 UTC

Permalink

Post by Doug McIntyre
In my experience, since IDE (and other family) disks automatically
remap bad blocks before presenting to the OS, if the OS sees bad
blocks, it means that the drive is incrementingly getting more bad
blocks than the IDE drive controller can handle.

This is a very late response, but....

The disk will automatically fix a problem it sees on a write, since
neither the OS nor you really care
where on the disk it keeps logical block N. However, if as in the
original post there is an error on read
the last thing anybody wants is for the disk to silently swap out the
missing data with who knows what bytes
from a spare block. So it throws the read error and will keep on
doing so until something explicitly (tries
to) write the bad blocks. Once that happens they will swap out and
the read errors will go away - at least
until the next block fails on read. For more information see:

http://smartmontools.sourceforge.net/badblockhowto.html

Regards,

David Mathog