[Gllug] Identifying what processes are waiting on IO

Sun Dec 7 11:09:46 UTC 2008

I have a system whose load is very high for several hours .  It's a quad 
core machine, and I'm seeing load of around 20.

The system becomes somewhat unresponsive during this time - and 
customers are complaining.

Investigating further, I see that the system sees a drop in the amount 
of memory used as buffers by the kernel, and I notice that all 4 cores 
spend a very large amount of time in iowait.

Looking at the disk performance, I see device saturation - the 
percentage  of CPU time during which I/O requests were issued to the 
device is at 100%

I would like to be able to find out what processes are using the disks, 
and waiting on IO - so I can consider either moving them to other 
machines, or consider reconfiguring storage in a way that is better 
suited to the application profile.

I'm using CentOS 5, so I don't have a kernel with IO accounting, nor do 
I have Python 2.5 or 2.6, so I can't use iotop.

I've heard good things about systemtap, but I wouldn't know where to start.

So far all I've done is use lsof and look at the major and minor device 
numbers and mount points - but this really doesn't tell me much.

Any pointers of further troubleshooting gratefully received.

Thanks,

S.
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug