[Gllug] W2K stability problems
Xander D Harkness
xander at harkness.co.uk
Fri Sep 7 16:27:10 UTC 2001
For those who are currently working in an environment where they are
threatened with M$ implementations on the desktop - I thought that you
would like to read the following ;-) (Great to see that they only have
to reboot the servers every two days - unlike NT4 where it was nightly!)
This is a status report on Terminal Services implementation on Windows 2000.
Cheers
Xander
Over the past 3-4 months we have a number of hard system hangs, where
the system stops and cannot even be contacted from the console.
Initially these were quite rare (1 or 2 a month), but over the last
month the number increased to the point where it would occur 1 or 2
times a week. A call was raised on Microsoft to assist with the
diagnosing the source of the hard hangs.
During the same period we also experienced soft hangs, where new users
would log on but not receive their desktop, but existing users would
continue to function correctly.
The number of these events had increased significantly in the last month.
Following a significant amount of metrics gathering we were able to
identify that the two events are related and are caused by the same problem.
We appear to have a memory leak in the kernel. Particularly in the Page
Pool memory area, the leak looks like to relates to open registry keys
that are not closed.
The reason for the sudden increase in soft and hard hangs is the
increase in the active user population. The default limit for Page Pool
Bytes is 160M. We have been operating just below this and the increase
in users has pushed us over this limit. This caused the soft hangs,
which in turn meant that users would log on to an alternate box, which
would in turn push this box over the edge, causing a soft hang. This
explains the increase in the number of incidents we have seen over the
past month.
To counter this problem we have increased the Page Pool Allocation to
340M. This should prevent the soft hangs, but does not solve the
underlying memory leak. This will give us and Microsoft time to diagnose
the root cause of the memory leak and implement a fix.
Where are we going from here:.
We have now managed to get the Microsoft escalation team involved with
this problem, which improves the quality of resource looking at the
problem and should help us achieve a resolution more quickly.
We are gathering further metrics on the current servers for analysis by
MS. Microsoft are also keen for us to deploy IE 5.5 SP2 and the
Operating system SP2, to bring the environment up to the latest release.
So our action is to continue to gather metrics for analysis and reboot
the servers every other day to clear down the memory leak
--
Gllug mailing list - Gllug at linux.co.uk
http://list.ftech.net/mailman/listinfo/gllug
More information about the GLLUG
mailing list