[Scottish] Re: Large Linux environments

Fri Apr 11 21:45:11 BST 2008

Hi Rob,

> how companies manage 
> large numbers of Linux servers (500-700 machines).

Badly?! :-) Just jesting!

> What distros do 
> data-centers use

There was a time when it was unequivocally redhat. These days i don't think it'd
be fair to say that; the playing field has opened up slightly (for the better!).
I know of 2 seperate companies near glasgow going down the ubuntu & suse routes
for medium (100ish node) clusters.

If you've got budget for good support, i've found the suse sales reps are
particularly co-operative on price both times i've used them.

> what management software (Elwell says he uses 
> cfengine).

cfengine (crazy config language though ;-), puppet (quite buggy!), adhoc shell
scripts with SSH & keyed logins. Where i work we use a mixture of these & a
distributed job scheduling tool (and lots of prayers!) for our sizable compute
cluster (2,000ish nodes).

It's not too difficult to come up with something that works well; if you get
stuck feel free to give me a shout, it's been about a year since i last got my
teeth into a decent new environment setup :-)

> How are updates managed and rolled out?

There's quite a few good, paid for solutions out there. I only really have hands
on exp with the sun stuff, and i believe it's only supported for specific 
versions of SuSE &Redhat distros. If going down the debian/ubuntu route, then
you're absolutely laughing. It's trivial to setup a cracking workflow for
patches, that just works (with separate testing & production streams). I believe
the exact same will be possible with Centos/Redhat these days with their
YUM/createrepo tools.

In general though, the flow is as you'd expect: vendor repo -> testing/incoming
repo -> production repo. There'd like also be a "local" repo for your own code.
You assign the prod hosts to get their updates from the prod repo and install
them automatically. You maybe assign some dev/uat/sit hosts to the testing repo.
After you're happy a given patch works (or doesnt) you can promote or pin it
appropriately with 1 command. Its not fire and forget, you need to allocate time
every week for managing the process, but it's rewarding and not frustrating or
too boring when done properly.

> What other 
> considerations are there when running an operation that large? These 
> machines would be accessible via the public internet.

Well, it's genuinely difficult to keep on top of that many hosts if they're
visible from the outside. Keeping them secure would probably keep 2x full timers
on their toes all day :-) In general, lock down everything you can, be as anal
as you're imagination will let you. Setting noexec mount option on /data may
just be the differnce between a success and failure for a scripted attack.

Have a plan b -- how do you keep providing service in the case that some hosts
become compromised?

Hope this helps!

-c