[GLLUG] Monitoring memory usage
Adam Monsen
info at selfhostbook.com
Wed Oct 23 16:00:39 UTC 2024
I appreciate this thread. Henrik, I know you specifically asked about
memory, but processes will suffer from any resource starvation, and one
kind of resource starvation can bleed into another. This is why
thrashing is particularly bad, right? Thrashing impacts disk I/O as well
as memory (assuming you're using a swap file on disk). Maybe that's not
the best example, but I do think it's useful to consider multiple
different kinds of resources when scheduling workloads (running processes).
So apologies in advance, I think I've gone off-topic below, looking at
multiple forms of resource starvation. You could ignore the parts about
CPU, GPU, network I/O, and disk I/O. Or we could start a new thread.
Henrik wrote:
> Of course, the ultimate solution would be to measure thrashing but
> that would be tricky and by the time anyone noticed it might be too late.
I agree the point of thrashing is too late. It gets so difficult to do
any useful manual sysadmin tasks on a Linux server when a process
/approaches/ the point of thrashing (especially if/once the OOM killer
is "helping"), right?
Are you working with one or multiple actual Linux servers or desktops,
or is your original question academic? I'm assuming you're talking about
one single machine, is that right? Single or multi-user? Are you also
considering CPU usage?
I like how this thread is approaching monitoring of specific
services/applications/workloads/programs/processes (I'm fudging and
treating all these terms as roughly equivalent). Processes can have
unique resource usage profiles over time. I don't think you'll find one
memory metric that can meaningfully inform you whether or not the next
process you'll create will suffer, because it depends on how the process
you're trying to run behaves, what else is running right then (and what
it is doing), and what resources you have to work with.
Can you say more about the particular workloads you're trying to
schedule? Are they bursty, is someone sitting there waiting/watching for
hopefully not too long, are they I/O heavy, can they be nice'd, can they
co-exist peacefully... stuff like that. And as others have mentioned:
sitting in memory is one thing, but paging in and out is another.
Zooming out to all resources in general (not just memory), I want to get
back to using some combination of metrics to make decisions. It doesn't
have to be complex to work. Personally I've found it useful enough to do
rough estimations of available CPU cores and RAM, and what my workloads
require of both.
For example, say you have a server with 4GB RAM and 4 CPU cores. You
want to run two services for a family of five. Assume usage on both is
intermittent--you happen to know it's unlikely they'll all be streaming
video and uploading and downloading huge files simultaneously.
service | purpose | CPU cores | RAM
----------|--------------|-----------|----
jellyfin | stream music | 2 | 2GB
nextcloud | file share | 2 | 2GB
This is obviously missing a lot. What about GPUs? What about disk and
network I/O? Personally I have a feel for these resources in my head for
the relatively few services I self-host, but if I was being more careful
I'd look at those too.
Have you heard of PSI (Pressure Stall Information) --
https://docs.kernel.org/accounting/psi.html ? It's another "trailing
indicator" (not a "leading indicator") but maybe that approaches
something like one or a few useful longitudinal metrics in the manner
you're seeking.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.lug.org.uk/pipermail/gllug/attachments/20241023/d8dfc7f8/attachment-0001.htm>
More information about the GLLUG
mailing list