[GLLUG] Monitoring memory usage

Wed Oct 23 16:00:39 UTC 2024

I appreciate this thread. Henrik, I know you specifically asked about 
memory, but processes will suffer from any resource starvation, and one 
kind of resource starvation can bleed into another. This is why 
thrashing is particularly bad, right? Thrashing impacts disk I/O as well 
as memory (assuming you're using a swap file on disk). Maybe that's not 
the best example, but I do think it's useful to consider multiple 
different kinds of resources when scheduling workloads (running processes).

So apologies in advance, I think I've gone off-topic below, looking at 
multiple forms of resource starvation. You could ignore the parts about 
CPU, GPU, network I/O, and disk I/O. Or we could start a new thread.

Henrik wrote:
> Of course, the ultimate solution would be to measure thrashing but 
> that would be tricky and by the time anyone noticed it might be too late.

I agree the point of thrashing is too late. It gets so difficult to do 
any useful manual sysadmin tasks on a Linux server when a process 
/approaches/ the point of thrashing (especially if/once the OOM killer 
is "helping"), right?

Are you working with one or multiple actual Linux servers or desktops, 
or is your original question academic? I'm assuming you're talking about 
one single machine, is that right? Single or multi-user? Are you also 
considering CPU usage?

I like how this thread is approaching monitoring of specific 
services/applications/workloads/programs/processes (I'm fudging and 
treating all these terms as roughly equivalent). Processes can have 
unique resource usage profiles over time. I don't think you'll find one 
memory metric that can meaningfully inform you whether or not the next 
process you'll create will suffer, because it depends on how the process 
you're trying to run behaves, what else is running right then (and what 
it is doing), and what resources you have to work with.

Can you say more about the particular workloads you're trying to 
schedule? Are they bursty, is someone sitting there waiting/watching for 
hopefully not too long, are they I/O heavy, can they be nice'd, can they 
co-exist peacefully... stuff like that. And as others have mentioned: 
sitting in memory is one thing, but paging in and out is another.

Zooming out to all resources in general (not just memory), I want to get 
back to using some combination of metrics to make decisions. It doesn't 
have to be complex to work. Personally I've found it useful enough to do 
rough estimations of available CPU cores and RAM, and what my workloads 
require of both.

For example, say you have a server with 4GB RAM and 4 CPU cores. You 
want to run two services for a family of five. Assume usage on both is 
intermittent--you happen to know it's unlikely they'll all be streaming 
video and uploading and downloading huge files simultaneously.

service   | purpose      | CPU cores | RAM
----------|--------------|-----------|----
jellyfin  | stream music | 2         | 2GB
nextcloud | file share   | 2 | 2GB

This is obviously missing a lot. What about GPUs? What about disk and 
network I/O? Personally I have a feel for these resources in my head for 
the relatively few services I self-host, but if I was being more careful 
I'd look at those too.

Have you heard of PSI (Pressure Stall Information) -- 
https://docs.kernel.org/accounting/psi.html ? It's another "trailing 
indicator" (not a "leading indicator") but maybe that approaches 
something like one or a few useful longitudinal metrics in the manner 
you're seeking.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.lug.org.uk/pipermail/gllug/attachments/20241023/d8dfc7f8/attachment-0001.htm>