[YLUG] Server purchase

Sat May 20 10:30:23 BST 2006

Richard G. Clegg wrote:
> I wonder if I could get the advice of the massed ranks of the group for 
> two things here.  I'm trying to get a high-availability server going 
> within the dept of Mathematics here at York.  We've had lots of 
> reliability issues associated with hardware failure recently (three 
> downtime incidents in four weeks after a few years of trouble free 
> running -- three independent hardware failures).

Always the way :)

> 1) Some kind and lovely person from a different dept is going to buy us 
> a shiny new server and I've no real idea what I am looking for.  What I 
> want is (I think) hardware RAID, about 150Gb of diskspace and 2Gb of 
> main memory.  This must all work with Linux (Debian sarge).  I don't 
> need to buy pre-installed, hardware with no OS is better since I would 
> just wipe whatever was on.  (due to harware failures I can now reinstall 
> Debian in 20minutes and get the system back running in an hour, the 
> majority of which is just waiting for copies of stuff to ftp from 
> backup).  I've had a brief scout and the hardware RAID I could see 
> involved stupidly expensive SCSI disks -- is this unavoidable?

Hardware RAID is far better performance than software RAID, and the 
3ware cards are excellent.  Just mirror the disks.  I'm wary of using 
anything like RAID 5 because of how difficult it is to get your data off 
if anything goes really *really* badly wrong (I have been here, and 
would not wish to do this ever again).  Whatever system you get, get 
hotswap drive caddies - it's not much more expensive but saves a lot of 
grief when a disk goes.  Also RTFM for the RAID cards, and test 
rebuilding a mirror by pulling live disks out.  Write down what to do, 
which disk is called which and so on.  Do not risk pulling the wrong 
disk and losing everything in an emergency.

> 2) The idea at the moment seems to be to have two machines, a main 
> machine and a backup and to use a virtual IP address, IP chains and 
> heartbeat to get the machines to switch over on failure.  The mysql 
> database will be mirrored (we've set that up before) and I guess there 
> are other things we might need.  There is also a daily backup to another 
> machine at a geographically remote location (in case of fire/theft of 
> our main servers).  Does this seem a reasonable set up?  Anything I need 
> to think about?

Why do you need a heartbeat and a hot spare?  Can you really not cope 
with even 30 mins downtime in the case of catastrophic failure?  My 
experience is that hot failover is *way* more hassle than it's really 
worth, and introduces a load of complex failure modes that are difficult 
to test.  Keep it simple, unless you really, really can't cope with any 
downtime at all.  In that case, you probably want a load-balanced 
cluster, because it's actually easier to plan for maintaining a running 
cluster than hoping your failover works - which is very hard to test in 
real world operations.

Like some others on this thread I'd say any serious server hardware 
should have hardware RAID as a matter of course, ECC RAM, and an 
overspecified PSU (vendor PSUs are often quite underpowered).

Make sure you enable SMART monitoring and get alerted for problems - 
often SMART will pick up issues nice and early, in time for you to do 
something about it.  The same goes for your RAID.  I have seen setups 
where people had RAID, but then never got notified when one half of a 
mirror died.  And then years later, the other half dies.  duh.

The things that break are the things with moving parts: PSU, disks and 
fans.  RAM goes sometimes, although that's rarer.  Expect the moving 
parts to fail.  MTBF on disks has gone through the floor with their 
increasing capacity.  I used to reckon on 100 years, but now I think 
it's more like 20.

Having a disaster recovery plan is vital, and making sure you have the 
available hardware to recover onto is vital too - if you have a cold 
spare, TEST YOUR RECOVERY PLAN.  Test it at regular intervals too, 
because configuration changes can sometimes introduce incompatibilities, 
and you don't want to discover these when the excrement has hit the fan. 
  "which serial port is the terminal on again?" is not a 2am question.

Finally, and I think more importantly than anything else in the entire 
world of running servers: TEST YOUR BACKUPS.  Make it a once a month 
job, and test it by doing a real restore of some real data.  Nobody 
cares if you can back up - what they care about is that you can restore :D

Cheers,

Doug.

-- 
doug at isotoma.com   / Isotoma, Open Source Software Consulting
Tel: 020 7620 1446 / Mobile: 07879 423002 / Fax: 020 79006980
Skype: dougwinter  / http://www.isotoma.com
Lincoln House, 75 Westminster Bridge Road, London, SE1 7HS