raidz1 problem after removing and inserting hard drive

Discussion:

(too old to reply)

Scott

2012-03-01 19:37:48 UTC

A Sun Ultra-45, connected via SCSI-LVD to a JetStor 416S JBOD
enclosure,
that talks SCSI to the host and holds (12) SATA 1TB drives.

The JBOD enclosure was used to build (1) raidz1 pool, using all (12)
drives,
about six months ago.
# zpool upgrade -v
This system is currently running ZFS version 4

# cat /etc/release
Soalris 10 5/08 s10s_u5wos_10 SPARC

# uname -a
SunOS bahamas 5.10 Generic_127127-11 sun4u sparc SUNW, A70

Yesterday I started a zpool scrub on the pool.
About 20 minutes into it I pushed a drive enclosure release button,
ejecting a drive (c2t1d0).
I didn't realize the drive was a member of the raidz1 pool.
10 seconds later I re-inserted the drive.
ZFS started resilvering the pool, pushing the one spare drive into
service (c2t2d5).

This resilvering is taking some time; it is expected to finish late
today.

When issuing
# format -e
I could see the drive c2t1d0, but the string representing the disk,
"Hitachi-HDS721010KLA330-R001-931.51GB"
was instead showing a string indicating 0GB.

The physical path for c2t1d0 is
/***@1e,600000/***@0/***@3/***@0/***@8/***@1,0

I was seeing a lot of kern.warning in /var/adm/messages,

WARNING: /***@1e,600000/***@0/***@3/***@0/***@8/***@1,0 (sd6)
Corrupt label; wrong magic number

The load on the system was up around 12, and was sluggish to respond
to keyboard
and mouse.

I issued:
zpool status -x
and saw (roughly)
raid-412S
raidz1
c2t0d0 online
spare DEGRADED
c2t1d0 UNAVAIL (sd6)
c2t2d5 ONLINE
c2t1d1 ONLINE
c2t1d2 ONLINE
c2t1d4 ONLINE
c2t1d5 ONLINE
c2t2d0 ONLINE
c2t2d1 ONLINE
c2t2d2 ONLINE
c2t1d3 ONLINE
c2t1d4 ONLINE
spares
c2t2d5 INUSE (sd24)

After about a half-hour, and failing to get the /var/adm/messages
kern.warning to
decrease in frequency, I issued:

# cfgadm -c unconfigure c2::dsk/c2t1d0

That succeeded, the drive was reported offline in /var/adm/messages,
and the kern.warning
messages stopped.

I then tried:

# cfgadm -c configure c2::dsk/c2t1d0

and I get
cfgadm: Hardware specific failure: failed to configure SCSI device: I/
O error

Using cfgadm with a -f option does not change the output.

When I issue:

# cfgadm -l c2::dsk/c2t1d0
I see
Ap_Id Type Receptacle Occupant Condition
c2::dsk/c2t1d0 disk Connected unconfigured unknown

# zpool status
pool: raid-412S
state: DEGRADED
status: One or mor devices could not be opened. Sufficient replicas
exist
for the pool to continue functioning in a degraded state.
Action: Attach the missing device and online it using 'zpool
online'.
see: http://www.sun.com/msg/ZFS-8000-D3
scrub: resilver in progress, 76.75% done, 5h27m to go

Disk's label: Hitachi-HDS721010KLA330-R001-931.51GB
Hitachi Deskstar 1TB 7200rpm SATA 3.0Gb/s P/N 0A35155 Aug-2007
S/N PAG89X7E

Could I get some help on how to get the disk connected again?
At this time I don't think the disk could be burned out just because
I ejected it then inserted it back into the JBOD enclosure 10 seconds
later.

John D Groenveld

2012-03-02 20:36:02 UTC

Permalink

Post by Scott
Yesterday I started a zpool scrub on the pool.
About 20 minutes into it I pushed a drive enclosure release button,
ejecting a drive (c2t1d0).

Whoops.

Post by Scott
I didn't realize the drive was a member of the raidz1 pool.
10 seconds later I re-inserted the drive.
ZFS started resilvering the pool, pushing the one spare drive into
service (c2t2d5).
This resilvering is taking some time; it is expected to finish late
today.

Did it finish?

If so,
# zpool replace raid-412S c2t1d0 c2t2d5
And then add c2t1d0 back as your new spare:
# zpool add raid-412S spare c2t1d0

John
***@acm.org

Scott

2012-03-03 01:51:24 UTC

Permalink

Post by John D Groenveld

Post by Scott
Yesterday I started a zpool scrub on the pool.
About 20 minutes into it I pushed a drive enclosure release button,
ejecting a drive (c2t1d0).

Whoops.

Did it finish?

It finished (it took about 28 hours). All lights are quiet on the
JBOD front :)

Post by John D Groenveld
If so,
# zpool replace raid-412S c2t1d0 c2t2d5
# zpool add raid-412S spare c2t1d0

I could try that (thanks).
But, the c2t1d0, though visible at the "zpool status" level, is not
there
at the "format -e" level.
Further complicating that, it's listed as unconfigured at the cfgadm
level:
# cfgadm -l c2::dsk/c2t1d0
I see
Ap_Id Type Receptacle Occupant Condition
c2::dsk/c2t1d0 disk Connected unconfigured unknown

I supposed I'm prejudiced a little by what I want to do vs. what
you're saying
to do, because I'm thinking it was working this way before so it
should work
this way again.

Setting that aside for a moment but staying at the lower layer, what I
think I need to focus on is getting the
disk visible at a lower layer, so that it will show its face at the
format -e layer.
I am thinking after I can get it to show up there then I can proceed
to issue zpool commands.
I can't see it either at the prtconf -v layer.
Does this sound correct?

Regards, Scott

John D Groenveld

2012-03-03 14:09:02 UTC

Permalink

Post by Scott
Setting that aside for a moment but staying at the lower layer, what I
think I need to focus on is getting the
disk visible at a lower layer, so that it will show its face at the
format -e layer.
I am thinking after I can get it to show up there then I can proceed
to issue zpool commands.
I can't see it either at the prtconf -v layer.
Does this sound correct?

Which HBA are you using to connect to the JBOD?
$ prtconf -D

Perhaps there's bug that's preventing you from configuring the
c2::dsk/c2t1d0 without bouncing your host.

John
***@acm.org

Scott

2012-03-06 02:51:52 UTC

Permalink

The JetStor JBOD enclosure also does RAID, and it seems that the disk
I'm
having problems with was uniquely configured to be a Volume, a RAID-0.
The rest of the disks are configured as "pass through devices".
(the enclosure takes SATA drives and presents them as scsi-attached.)

I reconfigured the disk in question to be like the rest, issued
# cfgadm -v -c configure c2

and got the device back, though under a different device tree:
old: /***@1e,600000/***@0/***@3/***@0/***@8/***@1,0
new: /***@1e,600000/***@0/***@3/***@0/***@8/***@0,1

It got a new device name.
old: /dev/dsk/c2t1d0
new: /dev/dsk/c2t0d1

The data, including the four vdevs, were still on the drive:
# zdb -l /dev/dsk/c2t0d1s0
(Lists 4 labels)
so it wouldn't allow me to
# zpool replace raid-412s c2t1d0 c2t0d1

I talked with tech support, who told me I needed to
read doc ID 1005473.1, which says you have to overwrite a used drive
with zeroes in order to use it.
Well, a 1TB drive, writing at 1kB dd block size, would take a very
long time, so I wrote a small script to just overwrite the four vdevs
(two on the beginning of the drive and two on the end).

I then could issue the above zpool replace command.
It's resilvering; probably in another 30 hours the pool will switch
from DEGRADED to ONLINE.

Thanks for the help.

Regards, Scott