Suso Logo suso
Support Site
main site     webmail    

I.T. Happens are true stories from the past of horrors and nightmarish situations we've had to deal with over the years. No matter how much you try, sometimes you just can't avoid the crap.

2006-06-18: Arvo shocked to death

This is quoted from an e-mail that was sent out to users after the recovery was complete.

 If you don't know by now, there was a major suso server outage
yesterday starting at 05:30 GMT (1:30am EDT) and going until 23:00 GMT
(7pm EDT), making this the longest suso outage ever and honestly one of
the worst system adminsitration experiences I've ever had.  So it was
pretty bad.  Depsite all that, we didn't lose very much, just any
database data (MySQL or PostgreSQL) from Tuesday has been lost.

  The problem was caused mostly by a faulty power supply in arvo that
literally was smoking when we came in to look at it. You know that smell
of burning electronics?  It seems that the power supply also fried the
motherboard and one of the hard drives on the computer. During the
course of trying to recover, the /var partition also became very
corrupted.  So it was a long day, mostly spent trying to fix hardware

  There are still some services that are having problems, the MySQL
server is bogged down right now because of all the mail coming in and
that is causing some people's websites to display errors.  I will fix
these things this morning.  I think SSH is also having issues.  Any
mail that was sent to you during the outage should come into your
mailbox later because the mail server that it came from should have
queued the message during the outage. 

  As far as data loss goes, since we backup to tape every night and also
do a full backup on Sundays, we only lost one day of data (Tuesday) on
the /var partition, which makes this only relevant to those who have
databases.  There seems to be no other data loss on the other partitions
like /home, /var/log and /.  So all of your website files, logs and mail
should be intact.

  The reason why things took so long is because the hardware wasn't
cooperating, at one point we were going to move the server to another
temporary machine, but that didn't go any better, so we found ways to
make it work on the original server.  This took the first 12 hours to
fix hardware problems.  The second major problem we had was the
corruption on the /var partition.  Over 35,000 files where lost when we
tried to just run a filesystem check.  Also, every time I tried to find
files in the /var partition that were recent so that I recover them,
Linux would hard lock, making recovering these files impossible.  I did
however copy the raw partition as is over to our backup machine so that
I can take a look at it later and possible recover some other things. 

  In the end I had to reformat the /var parition and copy over the
backups from Sunday and Monday.  Tuesday is gone, meaning that any MySQL
or PostgreSQL inserts or updates that were made on Tuesday after about
0600 GMT are lost.

  We've had more problems with this server than any other
suso server ever.  Usually our servers run for as much as a year before
crashing or needing to be rebooted.  Marina is convinced that this is
because we installed it on her birthday.  ;-)

  Anyways, I've had enough of these problems and we don't trust arvo
anymore so we're going to replace it ASAP.  This means that there will
be another outage this month or next.  I'll keep you posted.
  My appologies for the inconvience this outage has caused.   Let me
know if you have any questions.

Later we determined that arvo was on a UPS system that was failing and shorting out power supplies. And it was a nice APC rack mount UPS too. Once we replaced that, the problems stopped.