[Gllug] Diagnosing hardware faults

Mon Nov 29 11:23:14 UTC 2010

On 29 November 2010 11:00, John Edwards <john at cornerstonelinux.co.uk> wrote:

>
> For desktops which are used during office hours this is usually
> not noticed, but when you have a server on 24x7 for several years
> you should use ECC RAM to prevent problems. I think someone
> (John Hearns?) wrote about this in more detail several months ago.

Well, I wouldn't claim that ECC RAM prevents problems, but your
summary is accurate!

For x86 type machines you should be looking at mcelog regularly:
http://freshmeat.net/projects/mcelog/

I would say an error per week is acceptable - if you start getting
multiple errors on the same DIMM per day its
time ot swap it out.

The SGI ICE clusters implement the 'worm' kernel module for reporting
memory errors.

Of course, this gets me started on a hobbyhorse. Those of you who
follow the Register will
be used to referring to the Itanic, and laughing along with El Reg
when they make snide comments.
th eItanium has superb RAS features - for instance when I have any
problem in Itanium I can go to /var/log/salinfo/decoded
and get an accurate timestamped report of any CPU which saw a fault,
and NUMA routing chip which saw a fault, and
a detailed report on any DIMM errors - so all you do is send that
report off to field service and an engineers appears as if
by magic (Mr Benn reference there) with the correct part and knows
what socket to fit it on.

Nehalem now implements similar features.

Remembering the talk by Eng Lim on the SGI Altix if you recall it
actively monitors ECC errors - any faulty DIMM banks are known,
and that space in memory is marked bad - so as time goes on a systes
available memory shrinks a tiny little bit.
Preferale to halting the system for a hardware replacement!
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug