Unoffical empeg BBS

Quick Links: Empeg FAQ | RioCar.Org | Hijack | BigDisk Builder | jEmplode | emphatic
Repairs: Repairs

Topic Options
#356509 - 25/11/2012 10:56 Classic one: bitten by RAID :(
julf
veteran

Registered: 01/10/2001
Posts: 1307
Loc: Amsterdam, The Netherlands
Ouch. How much more textbook can you get?

Have 4-drive RAID server as main storage (and staging post for long-term backups - yes, RAID is not a backup, except temporarily). So last week I got a couple of SMART notices of recoverable errors on one of the disks. Time to replace it... Get new disk, pull old - but get bitten by the usual inconsistent mapping of logical to physical drives, so pull the wrong one. And having a brain fart, put it back. Rebuild triggered.

Oh well, the failing disk hasn't failed yet, so just have to wait for rebuild to finish before replacing the right disk. Except... Yes, you know where this is going... At "96.7% complete", there is a non-recoverable bad sector and the second disk gets failed out of the array, with rebuild of array not completed. frown

OK, now running ddrescue to recover as much as I can from the disk with bad sectors before I do anything else...

Top
#356512 - 25/11/2012 12:15 Re: Classic one: bitten by RAID :( [Re: julf]
mlord
carpal tunnel

Registered: 29/08/2000
Posts: 14496
Loc: Canada
Then patch the kernel (Linux, right?) to just ignore the bad sector and continue, instead of voiding the entire fricken array, and you can then recover nearly all of the data.

Next time, use unRAID rather than RAID. Or mhddfs.
Or *something* (anything) other than horribly unrobust RAID.
It simply is not suitable for huge TB+ drives at home.

Cheers

Top
#356514 - 25/11/2012 12:47 Re: Classic one: bitten by RAID :( [Re: mlord]
julf
veteran

Registered: 01/10/2001
Posts: 1307
Loc: Amsterdam, The Netherlands
Originally Posted By: mlord
Then patch the kernel (Linux, right?) to just ignore the bad sector and continue, instead of voiding the entire fricken array, and you can then recover nearly all of the data.


Yes, definitely tempted. But not entirely trivial (until now I have had no need to look at the kernel RAID code).

Quote:
Next time, use unRAID rather than RAID. Or mhddfs.
Or *something* (anything) other than horribly unrobust RAID.
It simply is not suitable for huge TB+ drives at home.


Have to agree - it's the classic problem of starting out using something that worked OK under the then prevailing conditions, and then doing small upgrades without biting the bullet and replacing the thing...

Top
#356515 - 25/11/2012 13:18 Re: Classic one: bitten by RAID :( [Re: julf]
peter
carpal tunnel

Registered: 13/07/2000
Posts: 4180
Loc: Cambridge, England
Originally Posted By: julf
pull the wrong one. And having a brain fart, put it back. Rebuild triggered

Why a brain fart? It was game over then anyway, wasn't it? Unless your array has a "whoops, sorry, didn't mean to eject that" button, which seems unlikely, especially if any writes have happened in the interim, then even copying the degraded array off to a known good location would have failed 96.7% of the way through reading the duff disk.

Peter

Top
#356516 - 25/11/2012 13:47 Re: Classic one: bitten by RAID :( [Re: peter]
julf
veteran

Registered: 01/10/2001
Posts: 1307
Loc: Amsterdam, The Netherlands
Originally Posted By: peter
Why a brain fart? It was game over then anyway, wasn't it?


Well, a very quick stopping/unmounting of the array might just have saved the situation, if there was no dirty buffers to write out...

Also, the disk that got pulled was of course 100% OK at that point, it was just that the other disks considered the pulled disk as failed/unclean - patching that (instead of allowing reconstruction) would probably have been a way out.

Top
#356525 - 26/11/2012 14:41 Re: Classic one: bitten by RAID :( [Re: mlord]
julf
veteran

Registered: 01/10/2001
Posts: 1307
Loc: Amsterdam, The Netherlands
Originally Posted By: mlord
Next time, use unRAID rather than RAID. Or mhddfs.


I guess unRAID is not available just as a file system - you have to run the complete dedicated server/utility OS?

I don't see how mhddfs solves the "reliable redundancy" issue.

Might have to look into ZFS...

Top
#356541 - 26/11/2012 21:21 Re: Classic one: bitten by RAID :( [Re: julf]
mlord
carpal tunnel

Registered: 29/08/2000
Posts: 14496
Loc: Canada
Originally Posted By: julf
I don't see how mhddfs solves the "reliable redundancy" issue.

It doesn't. It solves the "make one big filesystem from a bunch of drives" problem, without losing everything when one drive goes bad.

unRAID is similar, except they add a parity drive in parallel with the data drives, permitting loss of a single drive with no data loss. And loss of subsequent drives without losing everything.

Yeah, pity unRAID wants to be standalone (or in a VM).

Top
#356546 - 27/11/2012 11:39 Re: Classic one: bitten by RAID :( [Re: mlord]
julf
veteran

Registered: 01/10/2001
Posts: 1307
Loc: Amsterdam, The Netherlands
Right, ZFS looks like the best solution right now.

Anyway, I am a happy bunny - (g)ddrescue managed, after a couple of tries, to read all blocks off the failing disk. Replaced failing disk with ddrescued copy, forced a resync, and everything is hunky dory again. For now.

Top
#356548 - 27/11/2012 16:14 Re: Classic one: bitten by RAID :( [Re: julf]
drakino
carpal tunnel

Registered: 08/06/1999
Posts: 7868
Please share your experience with ZFS when you implement it. I've been thinking of doing similar here, but haven't implemented it yet. There is even a ZFS stack that has been evolving a bit for OS X that I'm following.

Top
#356552 - 27/11/2012 17:18 Re: Classic one: bitten by RAID :( [Re: drakino]
julf
veteran

Registered: 01/10/2001
Posts: 1307
Loc: Amsterdam, The Netherlands
Will do!

Top
#356553 - 27/11/2012 18:46 Re: Classic one: bitten by RAID :( [Re: drakino]
andy
carpal tunnel

Registered: 10/06/1999
Posts: 5916
Loc: Wivenhoe, Essex, UK
I haven't tried the OSX ZFS stack myself, but this is a quote from a friend who tried it back in June:

"Currently shuffling all the data off my Zevo zfs formatted drive on my MBP so I can reformat it back with bad old HFS+

I can cause system crashes that are directly caused by Zevo not handling particularly intensive bouts of disk activity on large numbers of tiny files. Which pretty much describes how Aperture hits its database. Which is pretty mission critical for me - could do without any instability, least of all something that hits my photo databases."
_________________________
Remind me to change my signature to something more interesting someday

Top