The Treehouse Blog


on Mar.17, 2007, under Computers, Linux

Warning – long and boring. This is
as much for my reference than anything else.

It started Friday – several weeks
ago. In the evening, one of my drives, a 250GB SATA, threw some
errors. The RAID5 wasn’t terribly concerned, it corrected the reads
and was happy. There were probably less than a dozen errors, and it
didn’t kick the drive from the array. I made a note of it, but
didn’t bother kicking it out manually.

Early Saturday, a 400GB drive dropped
out of the array. This drive does this from time to time, going
utterly unresponsive but fine upon reboot. I re-added it to the
array, and it began to sync up. Chris and I went to the new Circuit
City in Chambersburg, as he needed to get a power supply for
debugging a box lockup issue. I decided to buy one of those
new-fangled DVD burner thingies, as it was probably about time I had
one. Upon getting home, my array was not happy. I had 3 active
members on a 5 device RAID5. Rebuilding the 400GB had sent the
ailing 250GB over the edge, kicking them both out of the array. It’s
a curious thing to see in /proc/mdstat. The metadevice stayed
active, but degraded. Ext3 freaked out and dropped to read-only. I
really would have expected the metadevice to deactivate under those
conditions… or better yet, be very reluctant to kick a drive from
an already degraded array. If only I had kicked the 250GB manually,
this would have been a bit less stressful. So, then the contingency
planning starts. Do I force the array back together, and try
resyncing the 400 again? Will the 250 be so badly corrupted that it
makes more sense to force the mostly-current 400 back in the array
instead of the 250? Should I dd the 250 to another drive, since dd
should at least keep going instead of giving up on the errored
sectors? Not pleasant thoughts or options. SMART data was
indicating that the temperature of the drive was over 60C–hotter
than the box’s CPU. I moved it to another machine for diagnostics,
which didn’t turn up anything. The Hardware_ECC_Recovered was
varying rapidly (not that that necessarily means anything…), so I
decided it was time to be replaced. I ordered a 500G (WD5000YS) and
another Promise SATA-II TX4 PCI card from Newegg. Later that night,
I put the 250G back in the box and tried the resync again. I watched
the resync all night (something like 4am), waiting for it to either
fail, or complete. I wanted to boot the 250G from the array at
completion, so this wouldn’t happen again. Yes, I could have and
should have scripted it. I was worried about my data! The resync
completed successfully with no errors. Seems the 250G was much
happier after it had flagged its bad sectors.

On Sunday, I really couldn’t do
anything about the array, so I started down the second storage path
of death for the week: the DVD-R drive. I installed it in my
desktop, fired up k3b, burned a backup DVD of several years of
photos, and it seemed fine. But I could mount it anywhere. Turns
out that it (k3b and/or growisofs) wants to burn DVD+Rs as unclosed
multi-session discs. Fine. Turned that off, and burned myself
another one. It was fine. It was nice to have something work for

On Monday evening, feeling lucky from
the day before, I tried burning some more photos to DVD, but it was
not to be. IDE errors would start spewing into dmesg, growisofs
(which had elevated itself to a nice of -20) began consuming the
entire machine, making it unusable. I tried different speed
settings, just about any option k3b had to offer. I moved the IDE
cable to a different controller, tried changing cables, anything…
DMA settings, I looked for firmware, but the thing is a no-name OEM
drive probably originally from Lite-On, but their firmware won’t load
on it, and the site supposedly having the firmware genericizer was
down. Of course, I gave up at some point and burned something in
Windows which was fine… ARRGH!

Tuesday was supposed to be the day of
productivity. The new drive and controller arrived, and I installed
them. I spent a little time tooling the partition table and began
the resync. The mirror resync’d very quickly at 30-40MB/s. The
RAID5 resync stayed around 27MB/s when the system was idle, but
dropped considerably otherwise. The old setup would only resync
around 20MB/s, and was otherwise usable. But at 27MB/s, the system
crawled, yet wasn’t using up 100% CPU. I think this is the surreal
PCI bus exhaustion experience… 27*5=135, and 133 is the maximum for
a 33MHZ, 32-bit, PCI bus. But many of my PCI devices (including the
northbridge), are 66MHz capable, and from what I’ve read, 33MHz
devices shouldn’t be holding back the 66MHz ones entirely, but I
couldn’t find out how to test/debug this further. Later I found out
that the 66MHz-capable bit doesn’t mean very much, and what you
really need is a 66MHz-capable PCI bus – which mine isn’t. Myth
wasn’t happy about this, as ivtv wasn’t getting read from fast
enough. The system otherwise felt very sluggish. I left the box up
to resync overnight. I fought with the DVD drive some more, too.

Wednesday morning, I checked on the
status of the resync. But the box had locked up. I rebooted it and
checked the logs. The resync did complete, but sometime later, there
was an unhandled interrupt on the IRQ shared between a SATA
controller and the video card. Linux then disabled the IRQ, causing
all of the drives to fall out incrementally. I brought the box back
up, and had to force the array to be “clean” so that it would
re-assemble (echo clean > /sys/block/md0/md/array_state … there
appears to be no mdadm equivalent of this action. And you have to
write something to the device then for the superblocks to get
updated.) I eventually and experimentally determined that running
SMART commands on the new 500G drive is what causes the unhandled
interrupt. It takes time for the problem to manifest though –
maybe it’s a race condition somewhere. I haven’t found anything in
the kernel mailing lists about this, so I will have to research
further and maybe post about it.

As far as the DVD drive, I tried
different media with equally mixed results. I eventually returned it
to Circuit City and bought a better and cheaper Samsung from Newegg.
It seems to work much better… no spewing of IDE messages. I almost
made myself buy a SATA one, but I didn’t want to buy yet another SATA
controller and risk more problems with compatibility.

And the 250G seems fine now, so it
didn’t get tossed either. Sigh.

Comments are closed.

March 2007


Content Copyright © 2004 - 2019 Brady Alleman. All Rights Reserved.

As an Amazon Associate I earn from qualifying purchases.