This is a note to myself in the future, because I apparently keep forgetting.
I have three Linux rack-mounted servers at a co-location facility run by digital forest, a company the name of which I shall embroider in gold thread on a sampler to be mounted above my office computer. I've had a few peculiarities with them recently, including a weird problem in getting PHP to work reliably in specific ways in conjunction with Apache, so I decided to upgrade to Fedora Core 4 from Red Hat 9. Fedora is a project that continues the open-source, freely available Linux distribution that was at the heart of Red Hat's software. Red Hat now focuses on supported-driven commercial versions of Linux designed for enterprises. (In fact, I also bought a copy of their Enterprise Linux distribution for one of these servers, a plan I'm rethinking the deployment of.)
After performing some tests, including a full upgrade of a plain Red Hat 9 installation on an unused Linux box, I went in Sunday night expecting to be there three to four hours in the best case and six to seven in the worst case. Instead, I was there from about 7.45 pm til about 6.45 am.
What went wrong? I forgot a lesson I'd learned before. Despite many people's experiences in having successful Linux upgrades, including using the yum software update tool to upgrade Linux while it was running and then reboot into an entirely upgraded installation, I have rarely had good luck. Linux has poor revert positions. The Red Hat and Fedora installers don't leave an intact system, but rather write over software as they go. Unlike Mac OS X, Archive and Install option, and it's basic behavior of not rewriting boot blocks and replacing items until the installation is complete, Red Hat/Fedora just plow ahead. I should probably look into safer Linux distributions, but Red Hat works fine for me as a platform.
My path should have been to migrate services from one of the boxes to another copying all non-system data. Then wipe that box and install. In the event of failure, I'd still have working services. With a successful install, I'd customize it with my settings and move services back. Repeat as need be.
Instead, I wound up with two servers down and hours of unhappiness in working through the difficulties of sorting out what went wrong. The best thing I had was an installer that let me run "linux rescue," a limited version that let me mount volumes, copy files, and try again.
I had made good backups before starting of critical files on a fourth server, an Apple Xserve, and thus wasn't worried about losing critical data. Less important data has been backed up digital forest and I need to review whether any of those files need retrieval. Probably just a few.
I left in the morning (during rush hour, no less) having gotten my third machine to run all the services save one that the other two handled. This allowed me to bring up my most important Web site. The one I had to leave down with an apology note in response to all requests was isbn.nu, which has a high database dependency.
Before leaving, I'd sent email to Penguin Computing, the source of all goodness I now know, about the problems. By the time I got home, they'd answered my emails, and a series of email and phone calls continued after I woke up from three hours sleep to head back in. With their advice, we determined that there was probably a hardware fault with my highest performance server, which used SCSI instead of IDE. It had a SCSI card not supported in Red Hat 9, so they had installed a customized version. But Fedora Core 4 and the Red Hat commercial version should both have worked fine. (I managed to do a full install of FC4 which hung on reboot at the SCSI load stage, while the commercial Red Hat booted once, refused to join a network, and then hung on reboots.)
They helped get me to a position where I was able to wipe and install one of the servers with Red Hat 9 and get it to a working position. Last night after my boy went to sleep, I turned it into a database server and re-enabled isbn.nu, which is a real moneymaker.
Today, Penguin issued me an RMA for the SCSI-equipped server, and the fine folks at digital forest unracked it, packed it, and shipped it overnight for me. Despite me installing unsupported software, the Penguin folks are going above and beyond by helping out. They may replace the SCSI hardware--I have a 3 year on-site warranty, so they could have sent a tech, but that tech doesn't do software--but they'll also wipe and replace the faulty OS and make sure it works.
So my lesson learned, I say to anyone reading this far and my future self: Copy, wipe, install, restore. No more of this upgrade nonsense for production systems. Life's too short for server room all nighters. The other tip: never try to handle two servers at once. If I'd tried on one and it had failed, I could have moved databases with great ease. Instead, I tried on two at the same time.
All praise to digital forest, for being a great, great place, and having a 24 by 7 network operation center (new since their move to south of Boeing in a wonderful new facility) and Penguin Computing for their prompt and incredibly helpful efforts to get me running.