Bug 199700 - mail-filter/dspam segfaults if postgres is used
Bug#: 199700 Product:  Gentoo Linux Version: unspecified Platform: All
OS/Version: Linux Status: RESOLVED Severity: major Priority: P2
Resolution: FIXED Assigned To: mrness@gentoo.org Reported By: jmjoseph@andrew.cmu.edu
Component: Applications
URL: 
Summary: mail-filter/dspam segfaults if postgres is used
Keywords:  
Status Whiteboard: 
Opened: 2007-11-19 20:23 0000
Description:   Opened: 2007-11-19 20:23 0000
If the postgres driver is used, dspam segfaults on very many messages.  I
initially expected a migration from mysql to postgres to be at fault, but
training new users (with no information in the database) also fails.  I did not
experience these crashes when the mysql driver was used.  This defect is
serious enough that dspam cannot be used.  

I have attached one such crash-prone message, include a little information from
gdb, and list contents of dspam.debug below.  I'd greatly appreciate any help
with this issue and would be happy to assist in any way.

For the record, I am running dspam 3.8.0-r7 along with a postgres 8.0.13
server.  Note that the particular crash below occurs only when showing factors,
but disabling this only delays the crash to the next time CTX->factors->first
is accessed.

Note that I had originally posted this to the dspam-dev mailing list, but found
the project to be inactive.

GDB:
Short story: CTX->factors->first->ptr at dspam.c:3092 does not appear to be
initialized
----------------------------------------------------------------
Starting program: /usr/bin/dspam --debug --user jacob --mode=notrain --client
--stdout --deliver=innocent,spam <
/home/jacob/dbg_mail-linux-cluster-bounces@redhat.com-1194995056
[Thread debugging using libthread_db enabled]
[New Thread -1212545344 (LWP 17038)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1212545344 (LWP 17038)]
add_xdspam_headers (CTX=0x80705e0, ATX=0xbfe51b18) at dspam.c:3092
3092                    snprintf(scratch, sizeof(scratch), "%s, %2.5f",
(gdb) list
3087                node_ft = c_nt_first(CTX->factors, &c_ft);
3088                while(node_ft != NULL) {
3089                  struct dspam_factor *f = (struct dspam_factor *)
node_ft->ptr;
3090                  if (f) {
3091                    strlcat(data, ",\n\t", sizeof(data));
3092                    snprintf(scratch, sizeof(scratch), "%s, %2.5f",
3093                             f->token_name, f->value);
3094                    strlcat(data, scratch, sizeof(data));
3095                  }
3096                  node_ft = c_nt_next(CTX->factors, &c_ft);
(gdb) bt
#0  add_xdspam_headers (CTX=0x80705e0, ATX=0xbfe51b18) at dspam.c:3092
#1  0x08053442 in process_message (ATX=0xbfe51b18, message=0x8062bb8,
    username=0x8062930 "jacob", result_string=0xbfe519d0) at dspam.c:727
#2  0x08054295 in process_users (ATX=0xbfe51b18, message=0x8062b10)
    at dspam.c:1797
#3  0x08054f30 in main (argc=Cannot access memory at address 0x0
) at dspam.c:258
(gdb) print CTX->factors
$11 = (struct nt *) 0x806e688
(gdb) print *CTX->factors
$12 = {
  first = 0x806e570,
  insert = 0x616d6c69,
  items = 1700146542,
  nodetype = 1869181810
}
(gdb) print *CTX->factors->first
$13 = {
  ptr = 0x38,
  next = 0x20
}
-----------------------------------------------------------

dspam.debug
-----------------------------------------------------------
17124: [11/14/2007 00:19:02] No QuarantineAgent option found. Using standard
quarantine.
17124: [11/14/2007 00:19:02] DSPAM Instance Startup
17124: [11/14/2007 00:19:02] input args: /usr/bin/dspam --debug --user jacob
--mode=notrain --client --stdout --deliver=innocent,spam
17124: [11/14/2007 00:19:02] pass-thru args:
17124: [11/14/2007 00:19:02] processing user jacob
17124: [11/14/2007 00:19:02] uid = 0, euid = 0, gid = 0, egid = 503
17124: [11/14/2007 00:19:02] loading preferences for user jacob
17124: [11/14/2007 00:19:02] Loading preferences for uid 680
17124: [11/14/2007 00:19:02] Loading preferences for uid 0
17124: [11/14/2007 00:19:02] default preferences empty. reverting to dspam.conf
preferences.
17124: [11/14/2007 00:19:02] Loading preferences from dspam.conf
17124: [11/14/2007 00:19:02] using /var/spool/dspam/opt-in/local/jacob.dspam as
path
17124: [11/14/2007 00:19:02] using /var/spool/dspam/opt-out/local/jacob.nodspam
as path
17124: [11/14/2007 00:19:02] sedation level set to: 0
17124: [11/14/2007 00:19:02] Connecting to 127.0.0.1:3310 for virus check
17124: [11/14/2007 00:19:02] Loading 278 BNR patterns
17124: [11/14/2007 00:19:02] bnr reported snr of 6.597
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.05_0.05_0.40_
0.01000 0s 3i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.05_0.40_0.40_
0.01000 0s 7i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.05_0.05_0.05_
0.01000 0s 41i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.00_0.00_0.05_
0.01000 0s 38i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.05_0.40_0.05_
0.01000 0s 3i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.40_0.05_0.40_
0.01000 0s 7i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.40_0.40_0.40_
0.01000 0s 8i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.t| 0.01000 0s 46i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.40_0.05_0.05_
0.01000 0s 3i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.40_0.40_0.05_
0.01000 0s 7i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.00_0.05_0.05_
0.01000 0s 38i
17124: [11/14/2007 00:19:02] Whitelist threshold: 10
17124: [11/14/2007 00:19:02] [burton] [0.001190] is+> (1frq, 0s, 614i)
17124: [11/14/2007 00:19:02] [burton] [0.002290] >+I (2frq, 8s, 2550i)
17124: [11/14/2007 00:19:02] [burton] [0.002290] >+I (2frq, 8s, 2550i)
17124: [11/14/2007 00:19:02] [burton] [0.002382] >+The (2frq, 3s, 919i)
17124: [11/14/2007 00:19:02] [burton] [0.002382] >+The (2frq, 3s, 919i)
17124: [11/14/2007 00:19:02] [burton] [0.002888] https+// (1frq, 11s, 2778i)
17124: [11/14/2007 00:19:02] [burton] [0.003946] X-Mailman-Version*2.1.5 (1frq,
29s, 5355i)
17124: [11/14/2007 00:19:02] [burton] [0.004760] >+> (60frq, 38s, 5812i)
17124: [11/14/2007 00:19:02] [burton] [0.004760] >+> (60frq, 38s, 5812i)
17124: [11/14/2007 00:19:02] [burton] [0.005031] wrote+> (2frq, 39s, 5643i)
17124: [11/14/2007 00:19:02] [burton] [0.005031] wrote+> (2frq, 39s, 5643i)
17124: [11/14/2007 00:19:02] [burton] [0.006364] https (1frq, 26s, 2970i)
17124: [11/14/2007 00:19:02] [burton] [0.006518] List-Post*<mailto (1frq, 62s,
6913i)
17124: [11/14/2007 00:19:02] [burton] [0.006790] List-Help*request (1frq, 55s,
5886i)
17124: [11/14/2007 00:19:02] [burton] [0.008260] Errors-To*bounces (1frq, 67s,
5885i)
17124: [11/14/2007 00:19:02] [burton] [0.008264] Sender*bounces (1frq, 67s,
5882i)
17124: [11/14/2007 00:19:02] [burton] [0.008533] Url*redhat (1frq, 0s, 85i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] List-Help*cluster+request
(1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] List-Post*cluster+redhat.com>
(1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634]
List-Subscribe*<https+//www.redhat.com/mailman/listinfo/linux (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] cluster+redhat (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] Subject*[Linux+cluster] (1frq,
0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] list+Linux (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] Sender*linux+cluster (1frq,
0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634]
List-Subscribe*//www.redhat.com/mailman/listinfo/linux+cluster> (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] redhat+com (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] Return-Path*cluster+bounces
(1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] Burton-Bayesian Probability: 0.000000 Samples: 27
17124: [11/14/2007 00:19:02] no factors specified; using default
17124: [11/14/2007 00:19:02] Result Confidence: 0.99
17124: [11/14/2007 00:19:02] [burton] [0.001190] is+> (1frq, 0s, 614i)
17124: [11/14/2007 00:19:02] [burton] [0.002290] >+I (2frq, 8s, 2550i)
17124: [11/14/2007 00:19:02] [burton] [0.002290] >+I (2frq, 8s, 2550i)
17124: [11/14/2007 00:19:02] [burton] [0.002382] >+The (2frq, 3s, 919i)
17124: [11/14/2007 00:19:02] [burton] [0.002382] >+The (2frq, 3s, 919i)
17124: [11/14/2007 00:19:02] [burton] [0.002888] https+// (1frq, 11s, 2778i)
17124: [11/14/2007 00:19:02] [burton] [0.003946] X-Mailman-Version*2.1.5 (1frq,
29s, 5355i)
17124: [11/14/2007 00:19:02] [burton] [0.004760] >+> (60frq, 38s, 5812i)
17124: [11/14/2007 00:19:02] [burton] [0.004760] >+> (60frq, 38s, 5812i)
17124: [11/14/2007 00:19:02] [burton] [0.005031] wrote+> (2frq, 39s, 5643i)
17124: [11/14/2007 00:19:02] [burton] [0.005031] wrote+> (2frq, 39s, 5643i)
17124: [11/14/2007 00:19:02] [burton] [0.006364] https (1frq, 26s, 2970i)
17124: [11/14/2007 00:19:02] [burton] [0.006518] List-Post*<mailto (1frq, 62s,
6913i)
17124: [11/14/2007 00:19:02] [burton] [0.006790] List-Help*request (1frq, 55s,
5886i)
17124: [11/14/2007 00:19:02] [burton] [0.008260] Errors-To*bounces (1frq, 67s,
5885i)
17124: [11/14/2007 00:19:02] [burton] [0.008264] Sender*bounces (1frq, 67s,
5882i)
17124: [11/14/2007 00:19:02] [burton] [0.008533] Url*redhat (1frq, 0s, 85i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] List-Help*cluster+request
(1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] List-Post*cluster+redhat.com>
(1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634]
List-Subscribe*<https+//www.redhat.com/mailman/listinfo/linux (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] cluster+redhat (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] Subject*[Linux+cluster] (1frq,
0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] list+Linux (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] Sender*linux+cluster (1frq,
0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634]
List-Subscribe*//www.redhat.com/mailman/listinfo/linux+cluster> (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] redhat+com (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] Return-Path*cluster+bounces
(1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] Burton-Bayesian Probability: 0.000000 Samples: 27
17124: [11/14/2007 00:19:02] Result Confidence: 0.99
17124: [11/14/2007 00:19:02] BNR Decision Concurs
17124: [11/14/2007 00:19:02] total processing time: 0.08533s
17124: [11/14/2007 00:19:02] saving signature as 473a8546171241968551123
17124: [11/14/2007 00:19:02] libdspam returned probability of 0.000000
17124: [11/14/2007 00:19:02] message result: NOT SPAM
--------------------------------------------------------------

------- Comment #1 From Jacob Joseph 2007-11-19 20:24:36 0000 -------
Created an attachment (id=136430) [details]
An example message that crashes dspam

------- Comment #2 From Alin Năstac 2007-11-20 05:04:51 0000 -------
Please attach your dspam.conf. I also need the output of "equery uses
mail-filter/dspam" command. 

------- Comment #3 From Jacob Joseph 2007-11-20 21:06:26 0000 -------
I have attached my dspam.conf, as well as a (1-message) training corpus that
will crash from an empty database.  In particular, clearing the user 'jacob'
from the database, and running dspam_train on the attached tarball crashes
immediately. I am running 'dspam_train jacob t_spam t_nonspam'.

Thanks for the assistance.  Note that I have received a reply on the
dspam-users mailing list, but no resolution as of yet.  I will post here if any
is found.

# equery uses mail-filter/dspam
[ Searching for packages matching mail-filter/dspam... ]
[ Colour Code : set unset ]
[ Legend : Left column  (U) - USE flags from make.conf              ]
[        : Right column (I) - USE flags packages was installed with ]
[ Found these USE variables for mail-filter/dspam-3.8.0-r7 ]
 U I
 + + clamav        : Adds support for Clam AntiVirus software (usually with a
plugin)
 + + daemon        : Enable support for DSPAM to run in --daemon mode
 + + debug         : Enable extra debug codepaths, like asserts and extra
output. If you want to get meaningful backtraces see
http://www.gentoo.org/proj/en/qa/backtraces.xml .
 - - large-domain  : Builds for large domain rather than for domain scale
 + + ldap          : Adds LDAP support (Lightweight Directory Access Protocol)
 + + mysql         : Adds mySQL Database support
 + + postgres      : Adds support for the postgresql database
 - - sqlite        : Adds support for sqlite - embedded sql database
 - - syslog        : Enables support for syslog
 - - user-homedirs : Build with user homedir support
 - - virtual-users : Build with virtual-users support

------- Comment #4 From Jacob Joseph 2007-11-20 21:06:48 0000 -------
Created an attachment (id=136538) [details]
offending dspam.conf

------- Comment #5 From Jacob Joseph 2007-11-20 21:08:01 0000 -------
Created an attachment (id=136539) [details]
training corpus

This tarball contains a single good message.  When dspam_train is used with it,
and an empty database schema, I observe the described segmentation fault
immediately.

------- Comment #6 From Alin Năstac 2007-11-22 09:31:40 0000 -------
I am unable to reproduce.

Test environment:
  - postgres-8.0.13 installed with USE="doc kerberos nls pam perl python
readline ssl tcl test xml zlib
  - dspam-3.8.0-r7 instaled with USE="clamav daemon ldap mysql postgres sqlite"
  - setup postgresql server using emerge --config postgresql
  - start postgresql server
  - install dspam; replace dspam.conf with the attached conf
  - start dspam service
  - run dspam_train mrness t_spam t_nonspam

Result: work as expected

Output:
dspam_train mrness t_spam t_nonspam
Taking Snapshot...
mrness
    TP:     0 TN:     0 FP:     0 FN:     0 SC:     0 NC:     0
Training t_nonspam / t_spam corpora...
[test: nonspam] 1195138440.M148448P15692V0000000 result: PASS
TRAINING COMPLETE

Training Snapshot:
mrness
    TP:     0 TN:     1 FP:     0 FN:     0 SC:     0 NC:     0
    SHR:  100.00%       HSR:    0.00%       OCA:  100.00%

Overall Statistics:
mrness
    TP:     0 TN:     1 FP:     0 FN:     0 SC:     0 NC:     0
    SHR:  100.00%       HSR:    0.00%       OCA:  100.00%


Did I've missed something?

------- Comment #7 From Alin Năstac 2007-11-22 09:36:53 0000 -------
Ah, I forgot the step between "install dspam" and "start dspam":
  - configure postgresql database by running emerge --config dspam

------- Comment #8 From steveb 2007-11-28 23:35:54 0000 -------
@Jacob Joseph: Could you modify your ebuild and add the verbose debug option to
it and re-emerge DSPAM and then post the output of the new debug log?

Changes needed to be done:
                        $(use_enable syslog) \
                        $(use_enable debug) \
                        $(use_enable debug bnr-debug) \
+                       $(use_enable debug verbose-debug) \
                        --enable-long-usernames \
                        --with-dspam-group=dspam \
                        --with-dspam-home-group=dspam \

------- Comment #9 From Jacob Joseph 2007-12-03 23:22:35 0000 -------
Steve, Alin, thank you for your testing.  

I have tested further mysql on this and other machines.  I have tested with
3.8.0-r7 and 3.8.0-r8, with verbose debugging on.  With repeated compilations,
and repeated runs, it does seem that dspam, and dspam_train crash only
sometimes.  The backtrace is always as previously posted.  While the posted
training corpus sometimes does not cause a crash, some other message always
does.  For obvious reasons, I cannot post all of my training corpus to fully
illustrate this.

I will attach the verbose debug log shortly.

------- Comment #10 From Jacob Joseph 2007-12-03 23:24:15 0000 -------
Created an attachment (id=137657) [details]
verbose debug output during a crash

------- Comment #11 From Jacob Joseph 2007-12-03 23:25:10 0000 -------
Created an attachment (id=137659) [details]
Message associated with the above verbose output

Command line used:
/usr/bin/dspam --user jacob --deliver=summary --stdout <
'1195501649.M638339P24186V0000000000000302I002DB2C4_5967.jjoseph.org,S=29974:2,S'

------- Comment #12 From steveb 2007-12-04 01:13:22 0000 -------
(In reply to comment #9 #10 #11)
> 
Are you fixed on PGSql or would you consider moving to MySQL as an option?

The reason fro the crashes with MySQL are probably:
- In DSPAM you MaxMessageSize (let's abbreviate this to MMS) xxx
- In MySQL you have max_allowed_packet (let's abbreviate this to MAP) xxx
- You got a mail with total size (let's abbreviate this to TS) xxx

Now the problem is that DSPAM will create tokens and data for the message you
get. In one point it time it will insert the data into MySQL. Since one part of
the data inserted into MySQL is in binary (a blob) DSPAM will call MySQL C API
functions to escape the binary data. This C API calls need maximum double
amount of memory plus one byte then the real binary data has. Now the problem
starts if:
((size of binary data) * 2 + 1) + (space needed for the insert SQL commands) >
MAP

Another problem is if:
((MMS * 2 + 1) > MAP

This is one reason why DSPAM in the current release can crash.

Another reason is that when DSPAM does tokenize the mail it searches the
storage backend for tokens maching the new tokenized tokens. It creates a huge
SQL query to search for the tokens. The generation of the SQL query can grow
pretty big. Depending how much tokens you have for the user or any group the
user belongs to. If you have a lot of tokens DSPAM can easy create a SQL query
bigger then MAP and fail when executing that query. Unfortunately DSPAM then
closes the handle to the storage backend and becomes unusable until you restart
the daemon. Or if you don't use the client/server (or agent) mode it will just
crash the DSPAM binary and not deliver any result.

Does this sound logical to you?

I have taken some time to fix that issue in the MySQL driver. But not in
PostgreSQL. I first want to get MySQL running and then focus on PostgreSQL.

If you are not limited in what storage engine you use in DSPAM, then I would
offer you to post here a patch for MySQL driver in DSPAM. I am testing since 3
days this driver on my end and it works. But I am not 100% finished with it. If
you can wait then I would be very thankful. I promise to soon post the patch
here.

Is that okay with you?

In the mean time you could switch to MySQL and increase max_allowed_packet to
lets say 128MB or any thing above 60MB and then test again. I am very confident
that you will have less crashes. Do decrease the MaxMessageSize as well. Try to
stay with max_allowed_packet to at least MaxMessageSize * 3.

btw: The message you posted has already DSPAM headers. If you want to train
correctly, then you need to wipe out the DSPAM headers and the DSPAM signature
in the body (if you have any).

Running something like this here should clean most of the irrelevant stuff from
your corpus:
for foo in *
do
        sed -i
"/^\(X\-Quarantine\-ID:\|X\-OSBF\-Lua\-Score:\|X\-CRM114\-[a-zA-Z]*:\|X\-DKIM:\|X\-Virus\-Scanned:\|X\-Greylist:\|X\-DCC\-.*\-Metrics:\|X\-Virus\-Status:\|X\-Delivery\-Agent:\|Received\-SPF:\|X\-policyd\-weight:\|X\-Spam\-[^:]*:\)
.*$/d;/^X\-Amavis\-OS\-Fingerprint:/,+1d;/^X\-DSPAM\-Result\:/,/^X\-DSPAM\-Signature:
[0-9a-f,]*$/d;s/^Subject: \(ADV\|UNS\): /Subject: /;s/^Subject: \[SPAM\][\t
]*/Subject: /;s/^Subject: \[\(\+\|\-\)\{1,2\}\] /Subject:
/;s/\!DSPAM:[0-9]\{0,9\},\{0,1\}[0-9a-f]\{1,32\}\!//g" ${foo}
done

// SteveB

------- Comment #13 From Jacob Joseph 2007-12-04 01:29:41 0000 -------
Steve,
I do appreciate your detailed reply.  As I mentioned in my first message, all
of the tests mentioned here do not occur when mysql is used.  Indeed, I only
encountered any of this upon migrating to postgres, my preferred database. 

Without fault to dspam, I would be surprised to see that such things as the
maximum packet size would cause intermittent failure.  It sounds very much like
memory corruption to me.  Additionally, postgres reports no other error than
that the client dropped the connection.  When the dspam daemon is used, it will
continue to service 'dspam --client...' with other messages.

As for the dspam headers being present in the message already, I mean these
messages only as an example of failure I was seeing on new, incoming messages. 
I'd be happy to retest my corpus after their removal to either show this is a
bug, or eliminate it as a possibility.

My goal in posting this bug was two-fold.  First, I'd like dspam to work for
me.  Additionally, there is little point in someone else suffering similar
troubles.

I still would very much appreciate further suggestions to diagnose the problem
at hand.  I'm unsure what more can be done to reproduce it your systems.

Thanks again. ~Jacob

------- Comment #14 From Alin Năstac 2007-12-07 09:01:38 0000 -------
Unfortunatelly no one (including upstream) were able to reproduce your
segfault. Since even you don't have a safe way to reproduce the bug I can only
conclude 2 things:
  a) you are affected by some kind of race (btw, how many processors you
have?). I had in the past some bad experiences with gdb when I tried to debug
multi-threaded code, therefore I must ask you to compare gdb results with
strace.
  b) your problem is caused by an external factor. Please verify your hardware
by running a complete memtest pass.

If you don't detect a hardware problem, please give complete step-by-step
instructions for reproducing the segfault starting from an empty database. If
you cannot obtain a segfault every time you try it, give us a frequency
estimate .

------- Comment #15 From steveb 2007-12-09 17:47:40 0000 -------
(In reply to comment #0)
> GDB:
> Short story: CTX->factors->first->ptr at dspam.c:3092 does not appear to be
> initialized
> ----------------------------------------------------------------
> Starting program: /usr/bin/dspam --debug --user jacob --mode=notrain --client
> --stdout --deliver=innocent,spam <
> /home/jacob/dbg_mail-linux-cluster-bounces@redhat.com-1194995056
> [Thread debugging using libthread_db enabled]
> [New Thread -1212545344 (LWP 17038)]
> 
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread -1212545344 (LWP 17038)]
> add_xdspam_headers (CTX=0x80705e0, ATX=0xbfe51b18) at dspam.c:3092
> 3092                    snprintf(scratch, sizeof(scratch), "%s, %2.5f",
> (gdb) list
> 3087                node_ft = c_nt_first(CTX->factors, &c_ft);
> 3088                while(node_ft != NULL) {
> 3089                  struct dspam_factor *f = (struct dspam_factor *)
> node_ft->ptr;
> 3090                  if (f) {
> 3091                    strlcat(data, ",\n\t", sizeof(data));
> 3092                    snprintf(scratch, sizeof(scratch), "%s, %2.5f",
> 3093                             f->token_name, f->value);
> 3094                    strlcat(data, scratch, sizeof(data));
> 3095                  }
> 3096                  node_ft = c_nt_next(CTX->factors, &c_ft);
> (gdb) bt
> #0  add_xdspam_headers (CTX=0x80705e0, ATX=0xbfe51b18) at dspam.c:3092
> #1  0x08053442 in process_message (ATX=0xbfe51b18, message=0x8062bb8,
>     username=0x8062930 "jacob", result_string=0xbfe519d0) at dspam.c:727
> #2  0x08054295 in process_users (ATX=0xbfe51b18, message=0x8062b10)
>     at dspam.c:1797
> #3  0x08054f30 in main (argc=Cannot access memory at address 0x0
> ) at dspam.c:258
> (gdb) print CTX->factors
> $11 = (struct nt *) 0x806e688
> (gdb) print *CTX->factors
> $12 = {
>   first = 0x806e570,
>   insert = 0x616d6c69,
>   items = 1700146542,
>   nodetype = 1869181810
> }
> (gdb) print *CTX->factors->first
> $13 = {
>   ptr = 0x38,
>   next = 0x20
> }
> -----------------------------------------------------------
> 
How did you manage to get that backtrace in GDB? I can't get DSPAM to produce
me a backtrace when it crashes. How did you compile DSPAM? What CPU are you
using? Did you compiled other packages with the debug USE flag? Do you use PAM
and have you enabled core dumps in your system?

// SteveB

------- Comment #16 From steveb 2007-12-09 20:19:33 0000 -------
BTW: Could you emerge DSPAM without debug USE flag and try if your installation
works more stable then with debug USE flag?

------- Comment #17 From Jacob Joseph 2007-12-10 01:38:09 0000 -------
Steve,
I'm unsure what aspect of debugging with gdb is failing for you, but I've
configured the 3.8.0-r8 ebuild with:

./configure --prefix=/usr --host=i686-pc-linux-gnu --mandir=/usr/share/man
--infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc
--localstatedir=/var/lib --with-storage-driver=hash_drv,mysql_drv,pgsql_drv
--with-dspam-home=/var/spool/dspam --sysconfdir=/etc/mail/dspam --enable-daemon
--enable-ldap --enable-clamav --disable-large-scale --enable-domain-scale
--disable-syslog --enable-debug --enable-bnr-debug --enable-verbose-debug
--enable-long-usernames --with-dspam-group=dspam --with-dspam-home-group=dspam
--with-dspam-mode=2511 --with-logdir=/var/log/dspam --disable-virtual-users
--enable-preferences-extension --disable-homedir
--with-mysql-includes=/usr/include/mysql --with-mysql-libraries=/usr/lib/mysql
--with-pgsql-includes=/usr/include/postgresql
--with-pgsql-libraries=/usr/lib/postgresql --build=i686-pc-linux-gnu

I'm also using CFLAGS="-march=athlon-xp -O1 -g -pipe".  To be
absolutely clear, I have recompiled glibc, postgres, clamav, and
openldap with these CFLAGS.  Much of the rest of the system has
"-march=athlon-xp -03 -pipe".  You may also find it usefull to add
'nostrip' and 'noclean' to your FEATURES in /etc/make.conf.  The
latter, in particular, will keep the sources around where gdb can find them.

To get a backtrace on a message that I know causes a segfault, I
perform the following:
<delete all tokens for the user jacob from the database, begin training>
# dspam_train jacob t_spam t_nonspam  

<Wait for one message to segfault>
...
[test: spam   ] 1195501649.M366542P24186V0000000 result: sh: line 1:  8860
Segmentation fault      /usr/bin/dspam --user jacob --deliver=summary --stdout
<
't_spam/1195501649.M366542P24186V0000000000000302I002DB0F9_5508.jjoseph.org,S=24044:2,S'
BROKEN result!!
...

<Verify we still segfault, and run in gdb>
# /usr/bin/dspam --user jacob --deliver=summary --stdout <
't_spam/1195501649.M366542P24186V0000000000000302I002DB0F9_5508.jjoseph.org,S=24044:2,S'
Segmentation fault
# gdb /usr/bin/dspam
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...
Using host libthread_db library "/lib/libthread_db.so.1".
(gdb) run --user jacob --deliver=summary --stdout <
't_spam/1195501649.M366542P24186V0000000000000302I002DB0F9_5508.jjoseph.org,S=24044:2,S'
Starting program: /usr/bin/dspam --user jacob --deliver=summary --stdout <
't_spam/1195501649.M366542P24186V0000000000000302I002DB0F9_5508.jjoseph.org,S=24044:2,S'
[Thread debugging using libthread_db enabled]
[New Thread -1212250432 (LWP 8965)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1212250432 (LWP 8965)]
add_xdspam_headers (CTX=0x807c728, ATX=0xbfc989c8) at dspam.c:3086
3086                  struct dspam_factor *f = (struct dspam_factor *)
node_ft->ptr;
(gdb) bt
#0  add_xdspam_headers (CTX=0x807c728, ATX=0xbfc989c8) at dspam.c:3086
#1  0x08053672 in process_message (ATX=0xbfc989c8, message=0x80629f8,
    username=0x8062930 "jacob", result_string=0xbfc98880) at dspam.c:724
#2  0x080544cb in process_users (ATX=0xbfc989c8, message=0x8062950) at
dspam.c:1794
#3  0x08055171 in main (argc=5, argv=0xbfc998a4) at dspam.c:255
(gdb) list
3081              if (CTX->factors != NULL) {
3082                snprintf(data, sizeof(data), "X-DSPAM-Factors: %d",
3083                         CTX->factors->items);
3084                node_ft = c_nt_first(CTX->factors, &c_ft);
3085                while(node_ft != NULL) {
3086                  struct dspam_factor *f = (struct dspam_factor *)
node_ft->ptr;
3087                  if (f) {
3088                    strlcat(data, ",\n\t", sizeof(data));
3089                    snprintf(scratch, sizeof(scratch), "%s, %2.5f",
3090                             f->token_name, f->value);
(gdb)      


Though the particular message involved in a crash seems to vary from
compile to compile, I always experience a crash on my training
corpus.  When configured without any of the debug options (but the same
CFLAGS), I am not currently able to reproduce the above segfault thus far.

------- Comment #18 From Jacob Joseph 2007-12-10 01:41:14 0000 -------
I have posted a large training corpus at
http://www.jjoseph.org/files/training_corpus.tar.bz2.  With the debug use flag,
I have never been able to train on this data set without experiencing a
reproducible segfault on a particular message.

------- Comment #19 From steveb 2007-12-10 02:37:40 0000 -------
(In reply to comment #18)
> I have posted a large training corpus at
> http://www.jjoseph.org/files/training_corpus.tar.bz2.  With the debug use flag,
> I have never been able to train on this data set without experiencing a
> reproducible segfault on a particular message.
> 
I downloaded your corpus and started training on my DSPAM 3.8.0 with my own
improved/fixed MySQL driver.

The system is not the fastest in the world:
mail ~ # cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 4
model name      : AMD Athlon (TM)
stepping        : 4
cpu MHz         : 1400.180
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips        : 2803.66
clflush size    : 32

mail ~ # free -m
             total       used       free     shared    buffers     cached
Mem:          1012        990         21          0          0        333
-/+ buffers/cache:        657        355
Swap:         1535        765        770
mail ~ #


So far no crashes with the MySQL backend:
-----------------------------------------
Training Snapshot:
jmjoseph
    TP:  6807 TN:   880 FP:    12 FN:    75 SC:    25 NC:     0
    SHR:   98.91%       HSR:    1.35%       OCA:   98.88%

Overall Statistics:
jmjoseph
    TP:  6807 TN:   880 FP:    12 FN:    75 SC:    25 NC:     0
    SHR:   98.91%       HSR:    1.35%       OCA:   98.88%

real    31m26.005s
user    0m58.616s
sys     0m44.377s
mail ~ # dspam_stats -s -H jmjoseph
jmjoseph:
                TP True Positives:                  6807
                TN True Negatives:                   880
                FP False Positives:                   12
                FN False Negatives:                   75
                SC Spam Corpusfed:                    25
                NC Nonspam Corpusfed:                  0
                TL Training Left:                      0
                SHR Spam Hit Rate                 98.91%
                HSR Ham Strike Rate:               1.35%
                PPV Positive predictive value:    99.82%
                OCA Overall Accuracy:             98.88%

mail ~ #
-----------------------------------------

Tomorrow after work I will test with DSPAM 3.8.0-r8 and PostgreSQL and one
small patch I made one week ago for PostgreSQL.

The MySQL patch I made is not small. I added 511 lines and removed 249 lines in
the MySQL driver. While I was working on the MySQL driver I have looked inside
the PostgreSQL driver to see how things are there solved. Now I know that the
PostgreSQL driver is as well not in the best shape. I need to take my time and
do the same changes to PostgreSQL diver as I did for MySQL.

------- Comment #20 From steveb 2007-12-10 20:27:14 0000 -------
Created an attachment (id=138206) [details]
dspam-3.8.0-fix_bnr_debug.patch

This patch fixes (hopefully) the crash when using --enable-bnr-debug. Have fun
:)

@Alin: If you are going to include that patch into Gentoo, could you then as
well change the ebuild to have this:

        # Debug build
        if use debug; then
                filter-flags "-fomit-frame-pointer" "-O?"
                append-flags "-O0" "-ggdb"
        fi


And this here:
        $(use_enable debug bnr-debug) \
        $(use_enable debug verbose-debug) \


--enable-verbose-debug does implicit set --enable-debug so it is not needed.
But does no harm if you do:
        $(use_enable debug) \
        $(use_enable debug bnr-debug) \
        $(use_enable debug verbose-debug) \


Can you both test this patch and enable:
--enable-debug
--enable-bnr-debug
--enable-verbose-debug

And then check if DSPAM still crashes?

// SteveB

------- Comment #21 From Jacob Joseph 2007-12-10 23:05:36 0000 -------
Steve, your patch does seem to fix my crashes.  I'll post back if a more
complete test reveals any further news.  Thank you.  ~Jacob

------- Comment #22 From steveb 2007-12-10 23:30:29 0000 -------
(In reply to comment #21)
> Steve, your patch does seem to fix my crashes.  I'll post back if a more
> complete test reveals any further news.  Thank you.  ~Jacob
> 
Perfect! You have to thank Alin. Without him I would never know any thing about
this bug. Thanks Alin!

------- Comment #23 From Alin Năstac 2007-12-11 21:03:55 0000 -------
Fixed in -r9, thanks!

I've added $(use_enable debug verbose-debug), but I didn't mess with CFLAGS as
you suggested in comment #20. AFAIK users who want to debug a certain program
have to know how to mangle CFLAGS and FEATURES at package level. I see no
reason to override user's CFLAGS.

------- Comment #24 From steveb 2007-12-12 01:12:06 0000 -------
(In reply to comment #23)
> Fixed in -r9, thanks!
> 
> I've added $(use_enable debug verbose-debug), but I didn't mess with CFLAGS as
> you suggested in comment #20. AFAIK users who want to debug a certain program
> have to know how to mangle CFLAGS and FEATURES at package level. I see no
> reason to override user's CFLAGS.
> 
I am not on your line. If some one enables debug USE flag, then I would expect
that the package gets compiled with debug options. I would understand your
intervention if DSPAM would be the only package overwriting/filtering/enforcing
certain CFLAGS with debug USE flags. But we are not.

Allow me to ask the other way around: Why USE flags at all? If some one wants
to use certain switches (which are right now provided with the help of the USE
flags) then we could expect the user to know CFLAGS, FEATURES, LDFLAGS and
friends at package level.

My question does not make sense. I know. But I think not setting the CFLAGS for
a package where the user explicitly requests "debug" in the USE flags is as
well not making sense.

What do you think?

------- Comment #25 From Alin Năstac 2007-12-12 06:20:21 0000 -------
Debug flag is never used (or shouldn't be used) as a synonym to "build the
program with -O0 -gddb and don't strip the debug info". In this case, it
activates debug logging. Yeah, not exactly intuitive, but this is how upstream
decided to implement things. If I were the author, these would have been
accomplished through command line options, not compilation defines.

Anyway, the debug USE flag reason of being has been discussed over the time on
dev ml. I distinctly remember that adding -ggdb to CFLAGS is a no-no.

------- Comment #26 From steveb 2007-12-12 06:28:08 0000 -------
(In reply to comment #25)
> Debug flag is never used (or shouldn't be used) as a synonym to "build the
> program with -O0 -gddb and don't strip the debug info". In this case, it
> activates debug logging. Yeah, not exactly intuitive, but this is how upstream
> decided to implement things. If I were the author, these would have been
> accomplished through command line options, not compilation defines.
> 
I did not say anything about "don't strip the debug info".


> Anyway, the debug USE flag reason of being has been discussed over the time on
> dev ml. I distinctly remember that adding -ggdb to CFLAGS is a no-no.
> 
Well... then I have a nice list of such "no-no's":
grep -R "\-ggdb" /usr/portage/


Anyway... it is okay.