[Nottingham] Various kernel oops (& SATA NCQ, IRQ 18, ohci-ehci,
and lockups...)
Martin
martin at ml1.co.uk
Mon Jun 16 19:45:30 BST 2008
Martin wrote:
[---]
> Well, the story rolls on...
>
> The 'fix' appears to have been to clobber the SATA HDDs NCQ down to '1'
> (no NCQ) and to boot with the kernel parameter mem=3584M to avoid ever
> going over the 4 GByte memory boundary. (I'm sure mem=4000M or even
> mem=4096M would work just as well.)
And that was just a small part of the story...
I think I've been on the painful bleedin' edge on this one!
To set the scene:
Gigabyte GA-MA790FX-DS5 (rev. 1.0) AMD 790FX Chipset motherboard;
nVidia 8600 PCIe graphics card;
Hitachi HDP725032GLA360, GM3OA52A, max UDMA/133 sata2 HDD;
(And most recently)
Linux 2.6.24.5-server-1mnb #1 SMP Tue May 27 13:49:03 EDT 2008
x86_64 AMD Athlon(tm) 64 X2 Dual Core Processor 6400+ GNU/Linux
So... The 'fun' and fixes, roughly grouped:
1: BIOS
The factory installed BIOS is "F2 2007/11/23". This can fail to boot.
Also, very strangely, setting to clear the 'opened case' status caused
the CMOS contents to be scrambled. The system was also very
temperamental trying to use the AHCI settings for the sata although that
might be confused with the IDE - SATA clash problems for the Linux
libata kernel module.
Updating to "F5 2008/03/27" seemed more reliable, but then that would
randomly fail it's checksum on boot and revert back to "F2"... Re-update
required, until the next version became available.
Updating to "F6 2008/05/21" and that has been good and stable so far.
Note: if you get a freeze before or immediately after the POST memory
test, and you are sure that your memory and CPU are properly in place,
then upgrade to the latest BIOS version.
See:
http://www.gigabyte.com.tw/Support/Motherboard/BIOS_Model.aspx?ProductID=2694
Or reload the BIOS default settings and try again.
2: Reset
Hard reset (pushing the reset button on the case) does NOT guarantee
reset of the SATA HDDs! You must do a full power off!!
3: NCQ on Hitachi HDP725032GLA360 SATA HDD
For the 2.6.24 kernel, that caused lots of "exception" error messages in
/var/log/messages and ultimately caused the drive to be soft reset and
slow to a crawl or even freeze. Setting NCQ down to 1 (effectively no
NCQ) was a good workaround.
That now appears fixed in the Mandriva 2.6.24.5 kernel. NCQ at the
maximum 31 looks now to be fine.
4: Onboard Gigabit ethernet crash
Yep. Guaranteed quick Oops. Dead. It only needs a little bit of data and
it locks up. I think my ksymoops has sent quite a few reports for that!
Looks like there is a serious problem with the r8169 kernel module and
with either or both of interrupt sharing or with the chipset.
lspci:
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01)
The fix: Simply don't physically connect it to a network and don't
enable it from the OS. Idle, it appears to stay benign.
5: Channel ata7 means death
The motherboard has a total of 8 SATA2 channels, 6 internal and 2
"esata" on the back connectors. Very strangely, any activity on the ata7
channel (as reported in /var/log/messages) will trickle exception
errors. ata7 and ata8 use the Gigabyte sata ASIC. Very strangely, ata8
works fine.
There are some forum comments that this might be an IRQ 18 sharing
problem with the nVidia graphics card. However, the graphics have never
been a problem throughout.
The fix is simply not to plug into that connector.
Looking at the motherboard from the back edge, with the ps2 sockets
leftmost, ata7 is the nearest of the cluster of two purple sata
connectors on the far right.
6: USB ohci means death
Plugging in a USB2 flash memory stick into any of the USB ports calls up
an ehci interface. However, plugging in a USB2 Maxtor external HDD, for
most of the USB ports, will alternately call up an ohci interface and
then a ehci interface on reconnecting (and then the ohci next, and so on).
Using the ohci and transferring data will quickly cause a system freeze.
Using the ehci appears good and has worked fine for many GBytes of data
so far.
The two USB ports in the same block as the esata connectors seem to
alternate ohci - ehci. The other USB ports on the back connectors seem
to be ehci always. The two USB ports on the front of the case seem to
randomly stay ehci or do the alternating trick.
Look in /var/log/messages to see whether you get ehci or ohci called up
for a USB interface.
Further notes:
7: I've disabled the IDE interface in the BIOS because the libata kernel
module did not find a CF card on there in any case. There's also an
error message about there being too many IDE interfaces if the IDE
interface is left enabled...
Meanwhile, an Ubuntu kernel worked fine. See my earlier post. A guess is
that libata isn't being used in Ubuntu.
8: Ensure that there are no physical kinks in the sata leads! You have
gigabit data flying along there and the bits get reflected/mangled at
any sharp kinks...
Example errors are listed below. Good tests are to transfer a few GBytes
using tar over an nfs mount, thrash with bonnie, and use Boinc to keep
the CPUs busy. Install ksymoops for the oops debug.
How do you get this info to the kernel people? I wouldn't want them
wasting time on multiple Oops that are already fixed/avoided!
Good luck for anyone else,
Cheers,
Martin
The system:
lspci
00:00.0 Host bridge: ATI Technologies Inc RD790 Northbridge only dual
slot PCI-e_GFX and HT3 K8 part
00:02.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge
(external gfx0 port A)
00:04.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (PCI
express gpp port A)
00:07.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (PCI
express gpp port D)
00:09.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (PCI
express gpp port E)
00:0a.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (PCI
express gpp port F)
00:12.0 SATA controller: ATI Technologies Inc SB600 Non-Raid-5 SATA
00:13.0 USB Controller: ATI Technologies Inc SB600 USB (OHCI0)
00:13.1 USB Controller: ATI Technologies Inc SB600 USB (OHCI1)
00:13.2 USB Controller: ATI Technologies Inc SB600 USB (OHCI2)
00:13.3 USB Controller: ATI Technologies Inc SB600 USB (OHCI3)
00:13.4 USB Controller: ATI Technologies Inc SB600 USB (OHCI4)
00:13.5 USB Controller: ATI Technologies Inc SB600 USB Controller (EHCI)
00:14.0 SMBus: ATI Technologies Inc SBx00 SMBus Controller (rev 14)
00:14.1 IDE interface: ATI Technologies Inc SB600 IDE
00:14.2 Audio device: ATI Technologies Inc SBx00 Azalia
00:14.3 ISA bridge: ATI Technologies Inc SB600 PCI to LPC Bridge
00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
01:00.0 VGA compatible controller: nVidia Corporation GeForce 8600 GT
(rev a1)
02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E
Gigabit Ethernet Controller (rev 12)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01)
04:00.0 SATA controller: JMicron Technologies, Inc. JMicron 20360/20363
AHCI Controller (rev 02)
04:00.1 IDE interface: JMicron Technologies, Inc. JMicron 20360/20363
AHCI Controller (rev 02)
05:00.0 SATA controller: JMicron Technologies, Inc. JMicron 20360/20363
AHCI Controller (rev 02)
05:00.1 IDE interface: JMicron Technologies, Inc. JMicron 20360/20363
AHCI Controller (rev 02)
06:0e.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23
IEEE-1394a-2000 Controller (PHY/Link)
cat /proc/irq_stats
0: 41 timer
1: 1480 i8042
8: 0 rtc0
9: 1 acpi
12: 2797 i8042
16: 0 ohci_hcd:usb2
16: 2893 HDA Intel
17: 0 ohci_hcd:usb3
17: 0 ohci_hcd:usb5
17: 0 ahci
18: 0 ohci_hcd:usb4
18: 0 ohci_hcd:usb6
18: 0 ahci
18: 16119011 nvidia
19: 1512435 ehci_hcd:usb1
22: 1373655 ahci
22: 3 ohci1394
1274: 236084 eth1
cat /proc/interrupts
CPU0 CPU1
0: 40 1 IO-APIC-edge timer
1: 2 1478 IO-APIC-edge i8042
4: 0 2 IO-APIC-edge
8: 0 0 IO-APIC-edge rtc0
9: 0 1 IO-APIC-fasteoi acpi
12: 7 2791 IO-APIC-edge i8042
16: 5 2888 IO-APIC-fasteoi ohci_hcd:usb2, HDA Intel
17: 0 0 IO-APIC-fasteoi ohci_hcd:usb3,
ohci_hcd:usb5, ahci
18: 1921 16115644 IO-APIC-fasteoi ohci_hcd:usb4,
ohci_hcd:usb6, ahci, nvidia
19: 2993 1509538 IO-APIC-fasteoi ehci_hcd:usb1
22: 2473 1371096 IO-APIC-fasteoi ahci, ohci1394
1274: 386 235623 PCI-MSI-edge eth1
NMI: 0 0 Non-maskable interrupts
LOC: 31231805 30558165 Local timer interrupts
RES: 744523 595624 Rescheduling interrupts
CAL: 2544 1511 function call interrupts
TLB: 7962 3842 TLB shootdowns
TRM: 0 0 Thermal event interrupts
THR: 0 0 Threshold APIC interrupts
SPU: 0 0 Spurious interrupts
ERR: 0
3: NCQ acknowledge example burst of errors prior to kernel bug fix:
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:38:10:bd/00:00:01:00:00/e1 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:40:10:bd/00:00:01:00:00/e1 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: end_request: I/O error, dev sda, sector 25089057
kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:48:f3:80/00:00:01:00:00/e1 Emask 0x40
(internal error)
kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:90:f9:7f/00:00:01:00:00/e1 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: end_request: I/O error, dev sda, sector 25089057
kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:50:10:bd/00:00:01:00:00/e1 Emask 0x40
(internal error)
kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: res 50/00:00:e0:10:bd/00:00:01:00:00/e1 Emask 0x40
(internal error)
kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
5: ata7 connection typical errors:
kernel: ata7.00: exception Emask 0x0 SAct 0x1b SErr 0x180000 action 0x2
frozen
kernel: ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x580100 action 0x2
kernel: res 50/00:00:af:34:54/00:00:1b:00:00/eb Emask 0x10 (ATA
bus error)
kernel: ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x580100 action 0x2
kernel: res 50/00:00:1f:e5:5c/00:00:1b:00:00/eb Emask 0x10 (ATA
bus error)
kernel: ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x780100 action 0x2
kernel: res 50/00:00:df:ee:78/00:00:1b:00:00/eb Emask 0x10 (ATA
bus error)
--
----------------
Martin Lomas
martin at ml1.co.uk
----------------
More information about the Nottingham
mailing list