[Malvern] FW: Memory Reliability

Ian Pascoe ianpascoe at btinternet.com
Tue Aug 14 20:01:04 BST 2007


Hi Folks

I have had quite a comprehensive answer from Dr Tony Travis on this subject
which I copy below.

#####################

My comments apply to any COTS hardware: Most people don't test their
memory unless they suspect something is wrong. Basically, if you're
running a server you should test the memory otherwise you have no idea
if the server is reliable or not. In the past, people used to 'burn-in'
new computers by running CPU and memory stress tests before using them.

I have some nodes that have never failed a memory test (memtest86 +
memtester) and others that were unreliable when I first tested them. I
replaced the unreliable memory and tested them again before allowing the
suspect nodes to join the cluster. It is common sense to make sure that
your COTS hardware is reliable before using it, that's why I do it.

Using a computer without testing its memory is like driving a car
downhill without testing the brakes. They probably do work, but you
can't be sure until you actually try them. This is called 'defensive'
computing (like 'defensive' programming). Another example is doing
backups, but never checking that you can actually restore them...

#####################

Guy, you were almost right on the way Open Mosics works.  It doesn't
actually clone the central computer  but only the process.  But as you say
if the process hits a compute node with bad memory then that process is
forever corrupt.

In addition, and I didn't realise this, you can have a network of PCs
connected together being used normally but having their spare capacity
stolen by Open Mosics to supplement a central node - clever stuff eh?  Still
digging on what uses people have put this to .... sorry, probably going to
bore the list with even more sensational reports like "The Enquirer" or
"Quibbnler" produces.

E

-----Original Message-----
From: Ian Pascoe [mailto:ianpascoe at btinternet.com]
Sent: 11 August 2007 22:53
To: Malvern at mailman.lug.org.uk
Subject: Memory Reliability


Hi Folks

Came across an interesting Q & A on the Open Mosic's site.

http://howto.x-tend.be/openMosixWiki/index.php/Additions_to_the_FAQ

If I understand the Q & A properly, what is being said is that standard
modern day RAM in off the shelf consumer boxes cannot deal with the long on
times of, in this case computer nodes in a cluster, but must also apply to
other boxes like servers etc.

Anyone agree / disagree with what the article says?

My instant response to this was that it was a load of splutter-bluff, but
having sat down and thought about it I'm now not quite so sure.  For
instance, when you speak to any SysAdmins they generally recommend
re-booting both MS and Linux servers on a regular basis to clear them out
and and therefore making them run better....

E





More information about the Malvern mailing list