[Malvern] Memory Reliability

Sun Aug 12 21:26:19 BST 2007

On Sat, 11 Aug 2007 22:53:29, Ian Pascoe <ianpascoe at btinternet.com> 
wrote:
>Hi Folks
>
>Came across an interesting Q & A on the Open Mosic's site.
>
>http://howto.x-tend.be/openMosixWiki/index.php/Additions_to_the_FAQ
>
>If I understand the Q & A properly, what is being said is that standard
>modern day RAM in off the shelf consumer boxes cannot deal with the long on
>times of, in this case computer nodes in a cluster, but must also apply to
>other boxes like servers etc.
>
>Anyone agree / disagree with what the article says?
>
>My instant response to this was that it was a load of splutter-bluff, but
>having sat down and thought about it I'm now not quite so sure.  For
>instance, when you speak to any SysAdmins they generally recommend
>re-booting both MS and Linux servers on a regular basis to clear them out
>and and therefore making them run better....

The main problem with long 'on' times is not the RAM chips themselves, 
but a software problem called memory leak. When an application finishes 
a task or closes, it should release all the RAM it used back to the OS, 
ready to be used again. But this does not always happen. The OS keeps a 
memory allocation table, of which apps are using which bits of RAM. If 
an entry is not erased when it should be, the OS thinks the memory is 
still in use and will not free it up for re-use.
In its early days, Linux used to suffer this problem. So too did many 
GNU and Linux-based apps. The Linux world got free of this problem some 
time ago.
Some Windows apps, notably IIS and SQL Server, also used to suffer badly 
from this. Windows itself also used to suffer other problems with huge 
temporary files building up on the hard disk, in a very tangled and 
piecemeal way until it lost track of them. Microsoft has cleaned up its 
act only slowly, and many operational systems still suffer (not sure 
about the latest versions).
There is only one cure for a sick box whose RAM has all leaked away to 
where the OS cannot find it, and/or whose temporary files have got in a 
hopeless twist - reboot the OS.
A lesser problem is an application that leaks memory within its own code 
and keeps asking the OS for more and more. Here, you need only restart 
the app regularly. But if you don't know which one it is, or it has a 
customised boot sequence which is set to kick in when the OS is booted, 
then you will probably reboot the OS anyway.

The issue with openMosix clusters is a different one. Here, the OS is 
first copied into RAM on one box, and then cloned across to the many 
other boxes. The RAM in the first box must be perfect, and the OS must 
be copied in there without any glitches creeping in. Otherwise, the 
error would be copied across to every box in the cluster. For such 
high-reliability needs, various special types of error-checking or even 
error-correcting RAM chips are available. These are more costly, and 
usually slower, than the consumer RAM found in everyday PC's.

Sorry I never have time these days to make the meetings. Ubuntu is cool, 
but still no decent vector graphics package or click'n'run .exe install 
under WINE, so Win98SE soldiers on alongside.

-- 
Cheers,
Guy