[YLUG] Server purchase
Doug Winter
doug at isotoma.com
Sat May 20 10:30:23 BST 2006
Richard G. Clegg wrote:
> I wonder if I could get the advice of the massed ranks of the group for
> two things here. I'm trying to get a high-availability server going
> within the dept of Mathematics here at York. We've had lots of
> reliability issues associated with hardware failure recently (three
> downtime incidents in four weeks after a few years of trouble free
> running -- three independent hardware failures).
Always the way :)
> 1) Some kind and lovely person from a different dept is going to buy us
> a shiny new server and I've no real idea what I am looking for. What I
> want is (I think) hardware RAID, about 150Gb of diskspace and 2Gb of
> main memory. This must all work with Linux (Debian sarge). I don't
> need to buy pre-installed, hardware with no OS is better since I would
> just wipe whatever was on. (due to harware failures I can now reinstall
> Debian in 20minutes and get the system back running in an hour, the
> majority of which is just waiting for copies of stuff to ftp from
> backup). I've had a brief scout and the hardware RAID I could see
> involved stupidly expensive SCSI disks -- is this unavoidable?
Hardware RAID is far better performance than software RAID, and the
3ware cards are excellent. Just mirror the disks. I'm wary of using
anything like RAID 5 because of how difficult it is to get your data off
if anything goes really *really* badly wrong (I have been here, and
would not wish to do this ever again). Whatever system you get, get
hotswap drive caddies - it's not much more expensive but saves a lot of
grief when a disk goes. Also RTFM for the RAID cards, and test
rebuilding a mirror by pulling live disks out. Write down what to do,
which disk is called which and so on. Do not risk pulling the wrong
disk and losing everything in an emergency.
> 2) The idea at the moment seems to be to have two machines, a main
> machine and a backup and to use a virtual IP address, IP chains and
> heartbeat to get the machines to switch over on failure. The mysql
> database will be mirrored (we've set that up before) and I guess there
> are other things we might need. There is also a daily backup to another
> machine at a geographically remote location (in case of fire/theft of
> our main servers). Does this seem a reasonable set up? Anything I need
> to think about?
Why do you need a heartbeat and a hot spare? Can you really not cope
with even 30 mins downtime in the case of catastrophic failure? My
experience is that hot failover is *way* more hassle than it's really
worth, and introduces a load of complex failure modes that are difficult
to test. Keep it simple, unless you really, really can't cope with any
downtime at all. In that case, you probably want a load-balanced
cluster, because it's actually easier to plan for maintaining a running
cluster than hoping your failover works - which is very hard to test in
real world operations.
Like some others on this thread I'd say any serious server hardware
should have hardware RAID as a matter of course, ECC RAM, and an
overspecified PSU (vendor PSUs are often quite underpowered).
Make sure you enable SMART monitoring and get alerted for problems -
often SMART will pick up issues nice and early, in time for you to do
something about it. The same goes for your RAID. I have seen setups
where people had RAID, but then never got notified when one half of a
mirror died. And then years later, the other half dies. duh.
The things that break are the things with moving parts: PSU, disks and
fans. RAM goes sometimes, although that's rarer. Expect the moving
parts to fail. MTBF on disks has gone through the floor with their
increasing capacity. I used to reckon on 100 years, but now I think
it's more like 20.
Having a disaster recovery plan is vital, and making sure you have the
available hardware to recover onto is vital too - if you have a cold
spare, TEST YOUR RECOVERY PLAN. Test it at regular intervals too,
because configuration changes can sometimes introduce incompatibilities,
and you don't want to discover these when the excrement has hit the fan.
"which serial port is the terminal on again?" is not a 2am question.
Finally, and I think more importantly than anything else in the entire
world of running servers: TEST YOUR BACKUPS. Make it a once a month
job, and test it by doing a real restore of some real data. Nobody
cares if you can back up - what they care about is that you can restore :D
Cheers,
Doug.
--
doug at isotoma.com / Isotoma, Open Source Software Consulting
Tel: 020 7620 1446 / Mobile: 07879 423002 / Fax: 020 79006980
Skype: dougwinter / http://www.isotoma.com
Lincoln House, 75 Westminster Bridge Road, London, SE1 7HS
More information about the York
mailing list