[Nottingham] Various kernel oops (& SATA NCQ, IRQ 18, ohci-ehci, and lockups...)

Martin martin at ml1.co.uk
Mon Jun 16 19:45:30 BST 2008


Martin wrote:
[---]
> Well, the story rolls on...
> 
> The 'fix' appears to have been to clobber the SATA HDDs NCQ down to '1'
> (no NCQ) and to boot with the kernel parameter mem=3584M to avoid ever
> going over the 4 GByte memory boundary. (I'm sure mem=4000M or even
> mem=4096M would work just as well.)

And that was just a small part of the story...

I think I've been on the painful bleedin' edge on this one!

To set the scene:

Gigabyte GA-MA790FX-DS5 (rev. 1.0) AMD 790FX Chipset motherboard;
nVidia 8600 PCIe graphics card;
Hitachi HDP725032GLA360, GM3OA52A, max UDMA/133 sata2 HDD;

(And most recently)
Linux 2.6.24.5-server-1mnb #1 SMP Tue May 27 13:49:03 EDT 2008
x86_64 AMD Athlon(tm) 64 X2 Dual Core Processor 6400+ GNU/Linux


So... The 'fun' and fixes, roughly grouped:

1: BIOS

The factory installed BIOS is "F2 2007/11/23". This can fail to boot.
Also, very strangely, setting to clear the 'opened case' status caused
the CMOS contents to be scrambled. The system was also very
temperamental trying to use the AHCI settings for the sata although that
might be confused with the IDE - SATA clash problems for the Linux
libata kernel module.

Updating to "F5	2008/03/27" seemed more reliable, but then that would
randomly fail it's checksum on boot and revert back to "F2"... Re-update
required, until the next version became available.

Updating to "F6	2008/05/21" and that has been good and stable so far.


Note: if you get a freeze before or immediately after the POST memory
test, and you are sure that your memory and CPU are properly in place,
then upgrade to the latest BIOS version.
See:
http://www.gigabyte.com.tw/Support/Motherboard/BIOS_Model.aspx?ProductID=2694

Or reload the BIOS default settings and try again.


2: Reset

Hard reset (pushing the reset button on the case) does NOT guarantee
reset of the SATA HDDs! You must do a full power off!!


3: NCQ on Hitachi HDP725032GLA360 SATA HDD

For the 2.6.24 kernel, that caused lots of "exception" error messages in
/var/log/messages and ultimately caused the drive to be soft reset and
slow to a crawl or even freeze. Setting NCQ down to 1 (effectively no
NCQ) was a good workaround.

That now appears fixed in the Mandriva 2.6.24.5 kernel. NCQ at the
maximum 31 looks now to be fine.


4: Onboard Gigabit ethernet crash

Yep. Guaranteed quick Oops. Dead. It only needs a little bit of data and
it locks up. I think my ksymoops has sent quite a few reports for that!

Looks like there is a serious problem with the r8169 kernel module and
with either or both of interrupt sharing or with the chipset.

lspci:
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01)

The fix: Simply don't physically connect it to a network and don't
enable it from the OS. Idle, it appears to stay benign.


5: Channel ata7 means death

The motherboard has a total of 8 SATA2 channels, 6 internal and 2
"esata" on the back connectors. Very strangely, any activity on the ata7
channel (as reported in /var/log/messages) will trickle exception
errors. ata7 and ata8 use the Gigabyte sata ASIC. Very strangely, ata8
works fine.

There are some forum comments that this might be an IRQ 18 sharing
problem with the nVidia graphics card. However, the graphics have never
been a problem throughout.

The fix is simply not to plug into that connector.

Looking at the motherboard from the back edge, with the ps2 sockets
leftmost, ata7 is the nearest of the cluster of two purple sata
connectors on the far right.


6: USB ohci means death

Plugging in a USB2 flash memory stick into any of the USB ports calls up
an ehci interface. However, plugging in a USB2 Maxtor external HDD, for
most of the USB ports, will alternately call up an ohci interface and
then a ehci interface on reconnecting (and then the ohci next, and so on).

Using the ohci and transferring data will quickly cause a system freeze.

Using the ehci appears good and has worked fine for many GBytes of data
so far.

The two USB ports in the same block as the esata connectors seem to
alternate ohci - ehci. The other USB ports on the back connectors seem
to be ehci always. The two USB ports on the front of the case seem to
randomly stay ehci or do the alternating trick.

Look in /var/log/messages to see whether you get ehci or ohci called up
for a USB interface.


Further notes:

7: I've disabled the IDE interface in the BIOS because the libata kernel
module did not find a CF card on there in any case. There's also an
error message about there being too many IDE interfaces if the IDE
interface is left enabled...

Meanwhile, an Ubuntu kernel worked fine. See my earlier post. A guess is
that libata isn't being used in Ubuntu.


8: Ensure that there are no physical kinks in the sata leads! You have
gigabit data flying along there and the bits get reflected/mangled at
any sharp kinks...



Example errors are listed below. Good tests are to transfer a few GBytes
using tar over an nfs mount, thrash with bonnie, and use Boinc to keep
the CPUs busy. Install ksymoops for the oops debug.


How do you get this info to the kernel people? I wouldn't want them
wasting time on multiple Oops that are already fixed/avoided!

Good luck for anyone else,

Cheers,
Martin



The system:

lspci
00:00.0 Host bridge: ATI Technologies Inc RD790 Northbridge only dual
slot PCI-e_GFX and HT3 K8 part
00:02.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge
(external gfx0 port A)
00:04.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (PCI
express gpp port A)
00:07.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (PCI
express gpp port D)
00:09.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (PCI
express gpp port E)
00:0a.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (PCI
express gpp port F)
00:12.0 SATA controller: ATI Technologies Inc SB600 Non-Raid-5 SATA
00:13.0 USB Controller: ATI Technologies Inc SB600 USB (OHCI0)
00:13.1 USB Controller: ATI Technologies Inc SB600 USB (OHCI1)
00:13.2 USB Controller: ATI Technologies Inc SB600 USB (OHCI2)
00:13.3 USB Controller: ATI Technologies Inc SB600 USB (OHCI3)
00:13.4 USB Controller: ATI Technologies Inc SB600 USB (OHCI4)
00:13.5 USB Controller: ATI Technologies Inc SB600 USB Controller (EHCI)
00:14.0 SMBus: ATI Technologies Inc SBx00 SMBus Controller (rev 14)
00:14.1 IDE interface: ATI Technologies Inc SB600 IDE
00:14.2 Audio device: ATI Technologies Inc SBx00 Azalia
00:14.3 ISA bridge: ATI Technologies Inc SB600 PCI to LPC Bridge
00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
01:00.0 VGA compatible controller: nVidia Corporation GeForce 8600 GT
(rev a1)
02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E
Gigabit Ethernet Controller (rev 12)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01)
04:00.0 SATA controller: JMicron Technologies, Inc. JMicron 20360/20363
AHCI Controller (rev 02)
04:00.1 IDE interface: JMicron Technologies, Inc. JMicron 20360/20363
AHCI Controller (rev 02)
05:00.0 SATA controller: JMicron Technologies, Inc. JMicron 20360/20363
AHCI Controller (rev 02)
05:00.1 IDE interface: JMicron Technologies, Inc. JMicron 20360/20363
AHCI Controller (rev 02)
06:0e.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23
IEEE-1394a-2000 Controller (PHY/Link)


cat /proc/irq_stats
  0:         41   timer
  1:       1480   i8042
  8:          0   rtc0
  9:          1   acpi
 12:       2797   i8042
 16:          0   ohci_hcd:usb2
 16:       2893   HDA Intel
 17:          0   ohci_hcd:usb3
 17:          0   ohci_hcd:usb5
 17:          0   ahci
 18:          0   ohci_hcd:usb4
 18:          0   ohci_hcd:usb6
 18:          0   ahci
 18:   16119011   nvidia
 19:    1512435   ehci_hcd:usb1
 22:    1373655   ahci
 22:          3   ohci1394
1274:     236084   eth1


cat /proc/interrupts
           CPU0       CPU1
  0:         40          1   IO-APIC-edge      timer
  1:          2       1478   IO-APIC-edge      i8042
  4:          0          2   IO-APIC-edge
  8:          0          0   IO-APIC-edge      rtc0
  9:          0          1   IO-APIC-fasteoi   acpi
 12:          7       2791   IO-APIC-edge      i8042
 16:          5       2888   IO-APIC-fasteoi   ohci_hcd:usb2, HDA Intel
 17:          0          0   IO-APIC-fasteoi   ohci_hcd:usb3,
ohci_hcd:usb5, ahci
 18:       1921   16115644   IO-APIC-fasteoi   ohci_hcd:usb4,
ohci_hcd:usb6, ahci, nvidia
 19:       2993    1509538   IO-APIC-fasteoi   ehci_hcd:usb1
 22:       2473    1371096   IO-APIC-fasteoi   ahci, ohci1394
1274:        386     235623   PCI-MSI-edge      eth1
NMI:          0          0   Non-maskable interrupts
LOC:   31231805   30558165   Local timer interrupts
RES:     744523     595624   Rescheduling interrupts
CAL:       2544       1511   function call interrupts
TLB:       7962       3842   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
SPU:          0          0   Spurious interrupts
ERR:          0


3: NCQ acknowledge example burst of errors prior to kernel bug fix:

kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:38:10:bd/00:00:01:00:00/e1 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:40:10:bd/00:00:01:00:00/e1 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: end_request: I/O error, dev sda, sector 25089057
kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:48:f3:80/00:00:01:00:00/e1 Emask 0x40
(internal error)
kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:90:f9:7f/00:00:01:00:00/e1 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: end_request: I/O error, dev sda, sector 25089057
kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:50:10:bd/00:00:01:00:00/e1 Emask 0x40
(internal error)
kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:af:ea:42/00:00:25:00:00/e5 Emask 0x40
(internal error)
kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel:          res 50/00:00:e0:10:bd/00:00:01:00:00/e1 Emask 0x40
(internal error)
kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0


5: ata7 connection typical errors:

kernel: ata7.00: exception Emask 0x0 SAct 0x1b SErr 0x180000 action 0x2
frozen

kernel: ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x580100 action 0x2
kernel:          res 50/00:00:af:34:54/00:00:1b:00:00/eb Emask 0x10 (ATA
bus error)

kernel: ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x580100 action 0x2
kernel:          res 50/00:00:1f:e5:5c/00:00:1b:00:00/eb Emask 0x10 (ATA
bus error)

kernel: ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x780100 action 0x2
kernel:          res 50/00:00:df:ee:78/00:00:1b:00:00/eb Emask 0x10 (ATA
bus error)






-- 
----------------
Martin Lomas
martin at ml1.co.uk
----------------




More information about the Nottingham mailing list