[Gllug] [OTish] Running an oncall rota

Wed Jul 14 07:07:44 UTC 2010

On 14 July 2010 00:35, Steve Parker <steve at steve-parker.org> wrote:
> On 14/07/10 00:11, Martin A. Brooks wrote:
>
> Written to scratch an itch. If you do oncall, or have to manage it, this
> might be of interest:
>
> http://blog.hinterlands.org/2010/07/running-an-oncall-rota/
>
>
> I find this one interesting:
>
> Have realistic expectations of oncall response times. If you need to
> guarantee that problems are attended to within 20-30 minutes then you should
> be running an overnight shift, not oncall.
> Where I'm working, they have 15-minute response times, so people end up
> dumping a half-full trolley of shopping in the supermarket to go home and
> dial in to work, etc.  However, the volume is relatively low (on average
> less than 1 call per shift), which would make it very difficult to justify
> paying a sensible rate for a rota of decent engineers, plus out-of-hours
> compensation, to deal with maybe one issue per day; out-of-hours is 5x18h
> Monday-Friday plus 2x24h weekend. That's 138 hours, or 3.5 full-time
> positions. Add to that, they have a first-line and second-line support, so
> realistically 7 full-time positions to deal with maybe 5 out-of-hours
> incidents per week.

+1

Thankfully I've not had to be on call for over 10 years now. However,
my experience of working for (multi)national organisations with a
substantial server estate is that they require an on-call response
within 15 - 30 minutes. Of those I worked for I cannot imagine them
being receptive to the idea of an overnight shift. You have operations
staff for overnight and to filter alerts and decide who to page etc.
If you have to drop your shopping or be otherwise slightly
inconvenienced? Well, that's why you get standby and call-out charges.

> There is also an Ops team in the Production datacentre, who should deal with
> any initial alerts; what I don't have visibility of, is how many issues they
> deal with, that therefore don't go to 1st/2nd-line support. Realistically,
> an estate of <2000 servers should have very few out-of-hours emergencies;
> surely replacing a mirrored root-disk can wait til morning (unless it fails
> at 6pm Friday at the start of a bank holiday weekend)?

This all depends on things such as whether the servers are all little
commodity boxes or large many-way SMP, and whether you have developers
that write applications/jobs which run away, loop and crash etc. I've
been on call for server estates far smaller than 2,000 in number and
where I've had 5 or 6 calls throughout the night, ensuring that I get
next to no sleep. Yes, this may indicate that there is a problem
elsewhere in the IT organisation, but sometimes (frequently) this is
not in your power to fix, and if a system you are responsible for is
suffering as a result...

Cheers,

Andrew

-- 
Andrew Back
mailto:andrew at osmosoft.com
http://carrierdetect.com
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug