rentzsch.com: tales from the red shed

Linux Kernel Seize

Suck
Scenario setup: I went to a client's ISP to take his second (currently dark) firewall live. This was to serve two purposes: for another mail server (for office use, his current handles list server traffic) and with two live firewalls and two IPs, could perform remote recovery in case of single firewall failure.

Hilarity ensures.

Since I anticipated minimal downtime (on the order of 10 seconds), we could do this in the afternoon and not the early morning. I don't like getting up early (grin).

Tuesday, 1:30p
Install parallel firewall. Forget that it has same internal IP address as original firewire. New firewall eats traffic for half a hour until we realize we're not getting hits on master server any more.

Configure parallel firewall. All's working, leave ISP and hit CompUSA for printer ink cartridges (since I'm already out-and-about). Unbeknownst to me, the master server's kernel seized a minute after I leave. That is, the machine locked at the kernel level. Even hitting Caps Lock/Num Lock doesn't toggle keyboard LED. This was no kernel panic, either -- nothing was dumped to screen or disk.

Tuesday, 5:00p
After get back into office, learn site is down. Drive back to ISP (trip #2). Found seized master, yank second firewall, and force reboot. Boot fsck unhappy. Had to run fsck manually, and consent to inode clearing. Not liking I'm zeroing out chunks of the disk, but not much of a choice right now.

Did second firewall kill master? That's all that changed. But 2nd slave box didn't hiccup, why did master? Try plugging 2nd firewall back in. Master seizes again, same way. Slave stays happy, again. 2nd firewall yanked again, master rebooted. Boot fsck hangs on 32.2% progress. Reboot. Cleared, but must fsck 2nd mail spool drive. Both drives clean, master boots, services restart successfully. Watch box for five minutes. Seems happy. Drive back to office.

While driving back to office, master seizes again. It's quitting time, and can execute emergency master/slave switchover plan remotely, so I drive home and help along switchover process.

Tuesday, 7:00p
Issue: I learned late Monday night, before all of this, the database sync between master and slave happens every 24 hours. I'd like at least once per hour. Even this is incomplete, but we haven't implemented clustering and/or replication yet. Anyway, database backup on slave is old. It's inadvisable to restart DB apps as then we'd have to manually synchronize old master database against newer slave database. Ick.

The website's manager says ISP is open till 10pm. He wants to fix this now and not suffer ~14 hours of DB app downtime. I put clothes back on, pick him up at office, copy current tarball of the static site to TiBook (permission issues apparently meant part of static site backup failed) and go back to ISP (trip #3).

Tuesday, 8:50p
Back at ISP. Mission: bring up master enough to copy databases to slave and/or TiBook. Master boot fails -- fsck must be run manually again. Try to, but INIT and then fsck segfaults on manual invocation. Rebooting causes hang right after "freeing unused kernel memory" stage. Try original kernel from Red Hat, it actually gets to manual fsck but fails with signal 11 (segfault) again.

We need bootable CDs to fix this, but left them at office. Race back to office.

Tuesday, 9:46p
Back at ISP (trip #4), Red Hat 7 and 8 CDs in hand, 30 minutes until ISP closes door. Cue Mission: Impossible theme. Boot rescue CD. Try copying fresh /sbin/init to boot disk. cksums reveal nothing is same size as anything else. Ick, whatever. Doesn't help anyway. Reboot rescue CD, now both "new" /sbin/init and /sbin/init.original are gone! Filesystem is in sorry state. fsck.ext3 tells me both disks are "clean". It lies.

Nothing works, I upload static site tarball, and yank hard drives and take back home. Idea: use my FireWire ATA adapter to mount the disk on my TiBook and use Virtual PC magic to have Linux work with disk.

Tuesday, 11:30p
Dropped off website manager, go home and try VPC trick. Nogo: VPC won't mount disks, just disk images it made itself. VPC is a Carbon app, which means there's no way I can force it to look at /dev/disk2 in raw block mode. Oh well. Too bad Mac OS X doesn't do ext2 or ext3.

Will go back to ISP tomorrow morning when they open.

Wednesday, 7:11a
Get into office, stolen Dell in tow. Yank out disk, plug in expendable disk. Wipe and install Red Hat 8. Foolishly install Gnome thinking multiple terminal windows and GUI editors will save time later on.

RH8 installed, power down and install old master boot disk on second ATA. Bring Dell back up. Gnome sucks big time. Corrupted drawing, impossible to read. Windows with controls off the screen, unresizeable dialogs, impossible to hit "OK" and "Cancel" buttons. Try failsafe login thinking basic screen settings will be happy, nogo. Instead takes down entire machine when I attempt to exit. Full-screen bash is better, that's how bad it is. Control-Alt-F1 brings it up.

Can't bring 2nd disk (old master) online. BIOS sees it, early kernel sees it, but it disappears in later kernel stage. "/dev/hdc1 is not valid block device". Try setting access mode explicitly in BIOS. From auto to CHS. Nogo. From CHS to LBA. Nogo. This won't work.

Learn about http://sourceforge.net/projects/ext2fsx/ Early alpha ext2 support for Mac OS X. Pull from cvs. Again. Again. Again. SourceForge CVS server flaky. Got it. Try to build PBX project. Missing headers. Need headers from Darwin. No time to track down dependancy crap. Bye bye.

Drive to ISP with drives in tow (trip #5).

Wednesday, 9:49a
Back at ISP. Wipe expendable disk with RH8 again. Plug in 2nd disk, boot from fresh install disk. Successfully mount disk. Oops, this is unimportant mail spool disk. Power down and swap and boot again. Messages spool by, too fast to read. Auto restart. Repeats. Repeats. Too fast to read. Have digital camera (it's always in my bag) take photo of screen, pop CF card into TiBook which fires iPhoto.

Errors: EXT3-fs: recovery required on readonly file system and then write access will be enabled during recovery and then recovery complete. Sounds good until trying to remount root filesystem read write... EXT3-fs warning: mounting unchecked fs, running e2fsck is recommended.

This happens early in boot, not controlled by fstab. Try single user mode. Nogo. Kernel is attempting automount and failing, and can't stop it. Can't boot with disk attached, and can't dynamically attach after boot. Catch 22.

Given the 2U space, the IDE cables aren't long enough to attach two disks and CD at same time. Detach boot disk, attach CD and boot from CD, rescue mode. Disk mounted, read-only, with unspecified errors. Doesn't look too bad, can see disk's contents OK. Tell ifconfig to get network setup right. Can ping slave now. I have tar. I have ftp. I have ssh/scp. Let's go.

Trying to scp Databases directory. scp can't find ssh. Feed it explicit -S switch with ssh path. Nogo, ssh itself fails. Home directory is read-only, so ssh can't write .ssh directory with slave cert info. Ick.

Turn on ftpd on slave. Fails. ftp states 331 Password required for wolf. but doesn't actually wait for password to be entered. Don't know what's wrong, but probably related to readonly boot filesystem.

Try to start sshd on rescue machine. Straight sshd fails. Config file not found. Of course, minimal readonly boot CD doesn't have such config file. Pass sshd explicit config file path to mounted rescue disk with valid sshd config file. Now complains about missing key files referenced in config file. Doesn't matter if I explicitly put alternate valid keypath in commandline switch -- config file always preferred. Suck.

Reboot in rescue read/write mode, so I can mod sshd config file to point to valid keypath. Before boot, run fsck.ext3 on disk. Clean. Force check. Still clean. It's crazy this thing can't boot.

Mod sshd config file, got sshd running. Try to ssh into rescue box. Fails. Set PermitRootLogin to yes. Kill, restart sshd. Nogo. Set PermitEmptyPasswords to yes. Kill, restart. Fail. sshd -d -D. Try to login again. Error message says root user not allowed since /bin/bash does not exist. True, this is sh. bash not loaded on rescue image. /etc/passwd can't be changed to list /bin/sh since it's on readonly image. Another Catch 22. sshd is not going to happen.

It's now 1pm. A bunch of friends have rented a limo. Field trip to the Chicago Auto Show. Come hell or high water, I'm going. Limo leaves at 3pm. I'm not letting this machine beat me. Guy behind me in line at ISP waiting for KVC access to his server. I yank entire machine and bring it back with me to office.

Wednesday, 1:30p
Back at office. Setup master box. We're going to attach two hard drives and CD at same time. Couldn't be done at ISP -- not enough room. Pop case, lay paper over motherboard. Unscrew hard disks, lay across paper. Just enough cable to attach everything, with case popped and drives splayed across motherboard. Ick, but it works.

Boot rescue CD. Figure out which disk is which, via fdisk and df -h. Mount old master boot disk readonly. Success. Mount freshly install disk read/write. Successful. Throw cp -R at Databases directory. Solid disk lights. Three minutes later, realize multiple gigabyte database version of apache log database was there. Stop copy, get a little more selective. Pegging hard drive again, but this time for only 50 seconds.

Wednesday, 2:30p
Power down, eject CD and unplug master boot disk. Tarball databases. 85 megs. Try to upload directly. For some reason, can't get outside local network. Turn on sshd on TiBook. scp Database tarball. Success. Will take a while to upload over cable modem to server, copy to always-on stationary G4. Upload started from there. Email tarball cksum to friend, who will take over slave service restoration.

Finish Time: 3:13p. Limo's not even here yet. I declare victory, and attend the Auto Show. Total static website downtime: ~3.5 hours. Total commerce downtime: ~26 hours.

Postmortem
I believe master's kernel seize was caused by a race condition in ipchains. I have little to back this up except the main differences between master and slave is that slave is a single-processor while master is a dual-processor, opening the SMP bug window. Plus, ipchains is rather state driven, and writing efficient multiprocessor-safe state machines is rather hard. It's easy to slip in an optimization that always works on uniprocessors that can break on multiprocessors under obscure conditions.

In terms of why the master didn't bounce back from the unclean reboots, I have two explanations, or a deadly combination of the two:

  • ext3 suckage. ext3 simply isn't a good "journalled" file system. I think we'll go with IBM's jfs next.
  • DeathStar. The master boot disk was an IBM DeskStar, which has been dubbed the DeathStar by some unlucky folks. Chuck Goolsbee, of digital.forest recently said:
    One notable item... all Xserves use the infamous IBM/Hitatchi Deskstar ATA drives. Around here we call them "DeathStars" as they have the highest failure rate of any hard drive since the era of Rodime & Jasmine. I cannot believe that Apple would ship an *server*, much less an "enterprise class server" with such a drive installed. Sigh.

    I'm unsure if this was a hardware failure. We'll see if the client wants to spend money on determining if the disk was faulty in any way. In any case, a couple of new Segates are now heading our way.

Finally, the rescue mission. First, it was a mistake not to have clustering/replication going from the start. Even at short backup intervals, it's a hack, not a real solution.

If I had known it was going to be ~26 hours out to get the original database back online, I would have put the stale backup into production. That would still require rescuing the original, and then we'd have to diff the original against the stale backup, and then rewrite primary keys and reinsert records into the now live backup. A lot more work, but that work could have proceeded on a nonemergency schedule, which would have saved rescue money (I would need little involvement at my comparably higher rates) and minimized lost sales.

Another issue held up the rescue mission was the general lameness of the rescue environment. The lack of real network connectivity and readonly boot image were showstoppers. We need better tools for this stuff. Better tools would saved hours. I seem to remember folks offering "professional" rescue bootable CDs for Linux, some for free, some for money. Will need to investigate these.

Update: Turns out there were two drives in the broken box. One was an IBM, the other a Seagate. Neither was marked. I assumed that the IBM drive was the root drive, while the Seagate was the perishable mail spool disk. Wrong. It was the Seagate that held root. I'm becoming more convinced it was general ext3 suckage that killed the recovery.

Also Bill Bumgarner and Travis Cripps pointed out Knoppix to me. I'll have to check it out!

Sunday, February 23, 2003
12:00 AM