[Gllug] clueless at IPMI, hellp! (was Re: Hardware monitoring on IBM X335)

Nix nix at esperi.org.uk
Sun May 17 15:05:30 UTC 2009


On 13 May 2009, Henrik Morsing stated:
> Been looking at monitoring of our xSeries today and it was relatively
> easy getting IPMI to work on the X336s but it appears that the X335s
> are different. Is it really impossible to monitor fan speeds and
> temperature on these systems or has anyone done this?

On a quasi-unrelated front I've just been exposed to IPMI for the first
time and it's somewhat confusing. I have so far completely failed to get
IPMI-over-the-network working thru ipmitool. The BMC has a MAC of its
own, so presumably it should have an IP of its own as well. But it is
unclear what port it advertises service on, and it doesn't appear to
respond to pings on that IP address, or anything else. How does one
configure this?  The motherboard manual and all howtos I've read are
completely opaque and the Intel docs drown you in acronyms.

Is it usually set by setting the IP to the same IP as that of one of the
network cards and letting it snoop some port (which port?) or is it
usually set by picking a different IP (in which case which network card
does it listen on if you have more than one?)

If it uses a different MAC, as it seems to, how do I teach the rest of
the network about it? 'lan set 1 arp respond on' fails:

,----
| ipmitool> lan set 1 arp respond on
| Enabling BMC-generated ARP responses
| LAN Parameter Data does not match!  Write may have failed.
`----

What's going on? Anyone got more clue than me? (That would not be hard.)

The temperature sensors on this box are somewhat demented, as well (note
the classy hostname, this box is brand new and utterly unconfigured so
far):

debian5567:~# ipmitool sensor
CPU0 below Tmax  | 50.000     | degrees C  | ok    | na  | na      | na | na | 0.000 | na

*below* Tmax? So, er, what's the actual temperature? What's Tmax? From this:

CPU1 below Tmax  | -39.000    | degrees C  | cr    | na  | na      | na | na | 0.000 | na

I speculate 89C, which would make this unconnected sensor that isn't there a 128C reading.

CPU0 VCORE       | 0.888      | Volts      | cr    | na  | 0.896   | na | na | 1.496 | na

Obviously this one is *not* critical 'cos the CPU is working flawlessly
and is quite happy to do things like make -j8 GCC build-and-test runs
with no problems at all not attributable to memory balancing problems in
the rather aged kernel this box is running right now. I suspect the
lower threshold is wrong, but currently I have no idea what a Xeon
L5520's actual voltage limits are.

CPU1 VCORE       | 0.928      | Volts      | ok    | na  | 0.896   | na | na | 1.496 | na

(Another unconnected sensor. I'm not sure what the figures on the right
are: min and max figures, sure, but what are all the columns full of 'na' for,
other than repeating my initials time and time again?)

3.3V             | 3.280      | Volts      | ok    | na  | 2.992   | na | na | 3.536  | na
+12V             | 12.000     | Volts      | ok    | na  | 10.272  | na | na | 18.816 | na
VBAT             | 3.072      | Volts      | ok    | na  | 2.992   | na | na | 3.536  | na
5V               | 4.896      | Volts      | ok    | na  | 4.488   | na | na | 5.472  | na

This system has at least two batteries. Which one VBAT is is unclear
(probably the BIOS battery backing rather than the RAID array battery
backing).

Sys.1(CPU 0)     | 2120.000   | RPM        | ok    | na  | 530.000 | na | na | na     | na

A fan. It's odd that this is always shown at *precisely* 2120RPM with no variation.

Sys.2(CPU 1)     | 0.000      | RPM        | cr    | na  | 530.000 | na | na | na     | na
Sys.3(Front 1)   | 0.000      | RPM        | cr    | na  | 530.000 | na | na | na     | na
Sys.4(Front 2)   | 0.000      | RPM        | cr    | na  | 530.000 | na | na | na     | na

More unconnected sensors. I'll have to learn how to use PEF to get rid
of these. (There *is* a front fan, but it's connected to the RAID
controller, which talks via SNMP, something else I've never used, rather
than IPMI.)

Sys.5(Rear 1)    | 530.000    | RPM        | cr    | na  | 530.000 | na | na | na     | na

Another fan with an oddly unchanging RPM value: it's spun down due to
the box being lightly loaded so I suppose it makes sense that it's at
the minimum threshold. (Perhaps this is also connected to the mysterious
green light inside the case that flashes off and on forever even when
the power is off. I'll have to ask the supplier what that's about.)

Sys.6            | 0.000      | RPM        | cr    | na  | 530.000 | na | na | na     | na
Sys.7            | 0.000      | RPM        | cr    | na  | 530.000 | na | na | na     | na
Sys.8            | 0.000      | RPM        | cr    | na  | 530.000 | na | na | na     | na
Sys.9            | 0.000      | RPM        | cr    | na  | 530.000 | na | na | na     | na
Sys.10           | 0.000      | RPM        | cr    | na  | 530.000 | na | na | na     | na

More unconnected sensors.

ID_BTN_STATUS_L  | 0x0        | discrete   | 0x0000| na  | na      | na | na | na     | na
PLTRST2_N        | 0x0        | discrete   | 0x0000| na  | na      | na | na | na     | na

Utterly mysterious.


All this makes the system event logs completely useless: they're full of spam like this:

 234 | 05/17/2009 | 16:02:04 | Voltage #0x0b | Lower Critical going low
 235 | 05/17/2009 | 16:02:05 | Voltage #0x0b | Lower Critical going low
 236 | 05/17/2009 | 16:02:41 | Voltage #0x0b | Lower Critical going low
 237 | 05/17/2009 | 16:02:42 | Voltage #0x0b | Lower Critical going low
 238 | 05/17/2009 | 16:02:47 | Temperature #0x02 | Upper Critical going high

Is this a manufacturing fault or me being an idiot and not understanding
how to configure things? (Very probably the latter: if so, er, how?)
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug




More information about the GLLUG mailing list