[Gllug] W2K stability problems

Xander D Harkness xander at harkness.co.uk
Fri Sep 7 16:27:10 UTC 2001


For those who are currently working in an environment where they are 
threatened with M$ implementations on the desktop - I thought that you 
would like to read the following ;-)  (Great to see that they only have 
to reboot the servers every two days - unlike NT4 where it was nightly!)

This is a status report on Terminal Services implementation on Windows 2000.

Cheers
Xander

Over the past 3-4 months we have a number of hard system hangs, where 
the system stops and cannot even be contacted from the console. 
Initially these were quite rare (1 or 2 a month), but over the last 
month the number increased to the point where it would occur 1 or 2 
times a week. A call was raised on Microsoft to assist with the 
diagnosing the source of the hard hangs.

During the same period we also experienced soft hangs, where new users 
would log on but not receive their desktop, but existing users would 
continue to function correctly.

The number of these events had increased significantly in the last month.

Following a significant amount of metrics gathering we were able to 
identify that the two events are related and are caused by the same problem.

We appear to have a memory leak in the kernel. Particularly in the Page 
Pool memory area, the leak looks like to relates to open registry keys 
that are not closed.

The reason for the sudden increase in soft and hard hangs is the 
increase in the active user population. The default limit for Page Pool 
Bytes is 160M. We have been operating just below this and the increase 
in users has pushed us over this limit. This caused the soft hangs, 
which in turn meant that users would log on to an alternate box, which 
would in turn push this box over the edge, causing a soft hang. This 
explains the increase in the number of incidents we have seen over the 
past month.

To counter this problem we have increased the Page Pool Allocation to 
340M. This should prevent the soft hangs, but does not solve the 
underlying memory leak. This will give us and Microsoft time to diagnose 
the root cause of the memory leak and implement a fix.

Where are we going from here:.

We have now managed to get the Microsoft escalation team involved with 
this problem, which improves the quality of resource looking at the 
problem and should help us achieve a resolution more quickly.

We are gathering further metrics on the current servers for analysis by 
MS. Microsoft are also keen for us to deploy IE 5.5 SP2 and the 
Operating system SP2, to bring the environment up to the latest release.

So our action is to continue to gather metrics for analysis and reboot 
the servers every other day to clear down the memory leak

 

 


-- 
Gllug mailing list  -  Gllug at linux.co.uk
http://list.ftech.net/mailman/listinfo/gllug




More information about the GLLUG mailing list