[Gllug] exim segfault puzzle

Sat Dec 22 09:38:38 UTC 2007

Exim has been segfaulting on my main server for a few days now.  This
is a real puzzler because I haven't upgraded or changed the software,
and the folks who run the hardware claim that it's not a hardware
problem[1].

It always segfaults when delivering the same mail: an automatically
generated delivery failure message for one mail account which it is
forwarding for (always the same mail account, always segfaults on the
return message, not the original message).  The original message is,
naturally, spam.

I've deleted the message(s) which cause the segfault, only to have the
same problem reoccur some hours later when another faulty return
message is generated.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1078686048 (LWP 23605)]
0x400ec4f2 in __ham_copy_item () from /usr/lib/libdb3.so.3
(gdb) bt
#0  0x400ec4f2 in __ham_copy_item () from /usr/lib/libdb3.so.3
#1  0x400ebb37 in __ham_split_page () from /usr/lib/libdb3.so.3
#2  0x400e06f9 in __ham_c_dup () from /usr/lib/libdb3.so.3
#3  0x400e02e3 in __ham_c_dup () from /usr/lib/libdb3.so.3
#4  0x400c2e77 in __db_c_put () from /usr/lib/libdb3.so.3
#5  0x400beb55 in __db_put () from /usr/lib/libdb3.so.3
#6  0x08052101 in ?? ()
#7  0x080ca748 in ?? ()
#8  0x00000000 in ?? ()

Always seems to segfault in the same place.

I thought I'd cracked it last night, assuming it was because my spam
folder was getting large and had crossed some limit such as 2 or 4 GB,
but when it segfaults it's not holding any large files open:

# ls -l /proc/23605/fd
total 0
lrwx------ 1 root root 64 2007-12-22 09:19 0 -> /dev/pts/2
lrwx------ 1 root root 64 2007-12-22 09:19 1 -> /dev/pts/2
lrwx------ 1 root root 64 2007-12-22 09:25 10 -> /var/spool/exim/db/retry.lockfile
lrwx------ 1 root root 64 2007-12-22 09:25 11 -> /var/spool/exim/db/retry
lrwx------ 1 root root 64 2007-12-22 09:19 2 -> /dev/pts/2
lr-x------ 1 root root 64 2007-12-22 09:19 3 -> pipe:[843558]
l-wx------ 1 root root 64 2007-12-22 09:19 4 -> pipe:[843558]
lr-x------ 1 root root 64 2007-12-22 09:19 5 -> /usr/sbin/exim
lrwx------ 1 root root 64 2007-12-22 09:19 6 -> /var/spool/exim/input/1J5uHV-0006Tb-00-D
l-wx------ 1 root root 64 2007-12-22 09:19 7 -> /var/spool/exim/msglog/1J5uHV-0006Tb-00
l-wx------ 1 root root 64 2007-12-22 09:19 8 -> /var/spool/exim/input/1J5uHV-0006Tb-00-J
lrwx------ 1 root root 64 2007-12-22 09:19 9 -> /var/log/exim/mainlog

I've copied the message files here:

http://www.annexia.org/1J5uHV-0006Tb-00
http://www.annexia.org/1J5uHV-0006Tb-00-D
http://www.annexia.org/1J5uHV-0006Tb-00-J (empty file)
http://www.annexia.org/1J5uHV-0006Tb-00-H

Memory usage when exim segfaults is negligible, so it doesn't seem to
be a problem with running out of memory (or disk space for that
matter).

Nothing notable in the logfile.

I've no idea where to look on this now.  Is it something about the
message?  Some sort of data-driven problem in exim ...?

Rich.

[1] And if it was a hardware thing, why would it just be exim which
segfaults?

-- 
Richard Jones
Red Hat
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug