199700 – mail-filter/dspam segfaults if postgres is used

Bug 199700 - mail-filter/dspam segfaults if postgres is used

Summary: mail-filter/dspam segfaults if postgres is used

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	High major (vote)
Assignee:	Alin Năstac (RETIRED)

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2007-11-19 20:23 UTC by Jacob Joseph
Modified:	2007-12-12 06:28 UTC (History)
CC List:	1 user (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
An example message that crashes dspam (dbg_mail-linux-cluster-bounces@redhat.com-1194995056,5.54 KB, text/plain) 2007-11-19 20:24 UTC, Jacob Joseph	Details
offending dspam.conf (dspam.conf,29.97 KB, text/plain) 2007-11-20 21:06 UTC, Jacob Joseph	Details
training corpus (train.tar.gz,3.08 KB, application/octet-stream) 2007-11-20 21:08 UTC, Jacob Joseph	Details
verbose debug output during a crash (dspam.debug,110.46 KB, text/plain) 2007-12-03 23:24 UTC, Jacob Joseph	Details
Message associated with the above verbose output (1195501649.M638339P24186V0000000000000302I002DB2C4_5967.jjoseph.org,S=29974:2,S,29.27 KB, text/plain) 2007-12-03 23:25 UTC, Jacob Joseph	Details
dspam-3.8.0-fix_bnr_debug.patch (dspam-3.8.0-fix_bnr_debug.patch,450 bytes, patch) 2007-12-10 20:27 UTC, steveb	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jacob Joseph 2007-11-19 20:23:34 UTC

If the postgres driver is used, dspam segfaults on very many messages.  I initially expected a migration from mysql to postgres to be at fault, but training new users (with no information in the database) also fails.  I did not experience these crashes when the mysql driver was used.  This defect is serious enough that dspam cannot be used.  

I have attached one such crash-prone message, include a little information from gdb, and list contents of dspam.debug below.  I'd greatly appreciate any help with this issue and would be happy to assist in any way.

For the record, I am running dspam 3.8.0-r7 along with a postgres 8.0.13 server.  Note that the particular crash below occurs only when showing factors, but disabling this only delays the crash to the next time CTX->factors->first is accessed.

Note that I had originally posted this to the dspam-dev mailing list, but found the project to be inactive.

GDB:
Short story: CTX->factors->first->ptr at dspam.c:3092 does not appear to be initialized
----------------------------------------------------------------
Starting program: /usr/bin/dspam --debug --user jacob --mode=notrain --client --stdout --deliver=innocent,spam < /home/jacob/dbg_mail-linux-cluster-bounces@redhat.com-1194995056
[Thread debugging using libthread_db enabled]
[New Thread -1212545344 (LWP 17038)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1212545344 (LWP 17038)]
add_xdspam_headers (CTX=0x80705e0, ATX=0xbfe51b18) at dspam.c:3092
3092                    snprintf(scratch, sizeof(scratch), "%s, %2.5f",
(gdb) list
3087                node_ft = c_nt_first(CTX->factors, &c_ft);
3088                while(node_ft != NULL) {
3089                  struct dspam_factor *f = (struct dspam_factor *) node_ft->ptr;
3090                  if (f) {
3091                    strlcat(data, ",\n\t", sizeof(data));
3092                    snprintf(scratch, sizeof(scratch), "%s, %2.5f",
3093                             f->token_name, f->value);
3094                    strlcat(data, scratch, sizeof(data));
3095                  }
3096                  node_ft = c_nt_next(CTX->factors, &c_ft);
(gdb) bt
#0  add_xdspam_headers (CTX=0x80705e0, ATX=0xbfe51b18) at dspam.c:3092
#1  0x08053442 in process_message (ATX=0xbfe51b18, message=0x8062bb8,
    username=0x8062930 "jacob", result_string=0xbfe519d0) at dspam.c:727
#2  0x08054295 in process_users (ATX=0xbfe51b18, message=0x8062b10)
    at dspam.c:1797
#3  0x08054f30 in main (argc=Cannot access memory at address 0x0
) at dspam.c:258
(gdb) print CTX->factors
$11 = (struct nt *) 0x806e688
(gdb) print *CTX->factors
$12 = {
  first = 0x806e570,
  insert = 0x616d6c69,
  items = 1700146542,
  nodetype = 1869181810
}
(gdb) print *CTX->factors->first
$13 = {
  ptr = 0x38,
  next = 0x20
}
-----------------------------------------------------------

dspam.debug
-----------------------------------------------------------
17124: [11/14/2007 00:19:02] No QuarantineAgent option found. Using standard quarantine.
17124: [11/14/2007 00:19:02] DSPAM Instance Startup
17124: [11/14/2007 00:19:02] input args: /usr/bin/dspam --debug --user jacob --mode=notrain --client --stdout --deliver=innocent,spam
17124: [11/14/2007 00:19:02] pass-thru args:
17124: [11/14/2007 00:19:02] processing user jacob
17124: [11/14/2007 00:19:02] uid = 0, euid = 0, gid = 0, egid = 503
17124: [11/14/2007 00:19:02] loading preferences for user jacob
17124: [11/14/2007 00:19:02] Loading preferences for uid 680
17124: [11/14/2007 00:19:02] Loading preferences for uid 0
17124: [11/14/2007 00:19:02] default preferences empty. reverting to dspam.conf preferences.
17124: [11/14/2007 00:19:02] Loading preferences from dspam.conf
17124: [11/14/2007 00:19:02] using /var/spool/dspam/opt-in/local/jacob.dspam as path
17124: [11/14/2007 00:19:02] using /var/spool/dspam/opt-out/local/jacob.nodspam as path
17124: [11/14/2007 00:19:02] sedation level set to: 0
17124: [11/14/2007 00:19:02] Connecting to 127.0.0.1:3310 for virus check
17124: [11/14/2007 00:19:02] Loading 278 BNR patterns
17124: [11/14/2007 00:19:02] bnr reported snr of 6.597
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.05_0.05_0.40_ 0.01000 0s 3i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.05_0.40_0.40_ 0.01000 0s 7i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.05_0.05_0.05_ 0.01000 0s 41i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.00_0.00_0.05_ 0.01000 0s 38i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.05_0.40_0.05_ 0.01000 0s 3i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.40_0.05_0.40_ 0.01000 0s 7i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.40_0.40_0.40_ 0.01000 0s 8i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.t| 0.01000 0s 46i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.40_0.05_0.05_ 0.01000 0s 3i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.40_0.40_0.05_ 0.01000 0s 7i
17124: [11/14/2007 00:19:02] Interesting BNR Pattern: bnr.s|0.00_0.05_0.05_ 0.01000 0s 38i
17124: [11/14/2007 00:19:02] Whitelist threshold: 10
17124: [11/14/2007 00:19:02] [burton] [0.001190] is+> (1frq, 0s, 614i)
17124: [11/14/2007 00:19:02] [burton] [0.002290] >+I (2frq, 8s, 2550i)
17124: [11/14/2007 00:19:02] [burton] [0.002290] >+I (2frq, 8s, 2550i)
17124: [11/14/2007 00:19:02] [burton] [0.002382] >+The (2frq, 3s, 919i)
17124: [11/14/2007 00:19:02] [burton] [0.002382] >+The (2frq, 3s, 919i)
17124: [11/14/2007 00:19:02] [burton] [0.002888] https+// (1frq, 11s, 2778i)
17124: [11/14/2007 00:19:02] [burton] [0.003946] X-Mailman-Version*2.1.5 (1frq, 29s, 5355i)
17124: [11/14/2007 00:19:02] [burton] [0.004760] >+> (60frq, 38s, 5812i)
17124: [11/14/2007 00:19:02] [burton] [0.004760] >+> (60frq, 38s, 5812i)
17124: [11/14/2007 00:19:02] [burton] [0.005031] wrote+> (2frq, 39s, 5643i)
17124: [11/14/2007 00:19:02] [burton] [0.005031] wrote+> (2frq, 39s, 5643i)
17124: [11/14/2007 00:19:02] [burton] [0.006364] https (1frq, 26s, 2970i)
17124: [11/14/2007 00:19:02] [burton] [0.006518] List-Post*<mailto (1frq, 62s, 6913i)
17124: [11/14/2007 00:19:02] [burton] [0.006790] List-Help*request (1frq, 55s, 5886i)
17124: [11/14/2007 00:19:02] [burton] [0.008260] Errors-To*bounces (1frq, 67s, 5885i)
17124: [11/14/2007 00:19:02] [burton] [0.008264] Sender*bounces (1frq, 67s, 5882i)
17124: [11/14/2007 00:19:02] [burton] [0.008533] Url*redhat (1frq, 0s, 85i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] List-Help*cluster+request (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] List-Post*cluster+redhat.com> (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] List-Subscribe*<https+//www.redhat.com/mailman/listinfo/linux (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] cluster+redhat (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] Subject*[Linux+cluster] (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] list+Linux (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] Sender*linux+cluster (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] List-Subscribe*//www.redhat.com/mailman/listinfo/linux+cluster> (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] redhat+com (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] Return-Path*cluster+bounces (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] Burton-Bayesian Probability: 0.000000 Samples: 27
17124: [11/14/2007 00:19:02] no factors specified; using default
17124: [11/14/2007 00:19:02] Result Confidence: 0.99
17124: [11/14/2007 00:19:02] [burton] [0.001190] is+> (1frq, 0s, 614i)
17124: [11/14/2007 00:19:02] [burton] [0.002290] >+I (2frq, 8s, 2550i)
17124: [11/14/2007 00:19:02] [burton] [0.002290] >+I (2frq, 8s, 2550i)
17124: [11/14/2007 00:19:02] [burton] [0.002382] >+The (2frq, 3s, 919i)
17124: [11/14/2007 00:19:02] [burton] [0.002382] >+The (2frq, 3s, 919i)
17124: [11/14/2007 00:19:02] [burton] [0.002888] https+// (1frq, 11s, 2778i)
17124: [11/14/2007 00:19:02] [burton] [0.003946] X-Mailman-Version*2.1.5 (1frq, 29s, 5355i)
17124: [11/14/2007 00:19:02] [burton] [0.004760] >+> (60frq, 38s, 5812i)
17124: [11/14/2007 00:19:02] [burton] [0.004760] >+> (60frq, 38s, 5812i)
17124: [11/14/2007 00:19:02] [burton] [0.005031] wrote+> (2frq, 39s, 5643i)
17124: [11/14/2007 00:19:02] [burton] [0.005031] wrote+> (2frq, 39s, 5643i)
17124: [11/14/2007 00:19:02] [burton] [0.006364] https (1frq, 26s, 2970i)
17124: [11/14/2007 00:19:02] [burton] [0.006518] List-Post*<mailto (1frq, 62s, 6913i)
17124: [11/14/2007 00:19:02] [burton] [0.006790] List-Help*request (1frq, 55s, 5886i)
17124: [11/14/2007 00:19:02] [burton] [0.008260] Errors-To*bounces (1frq, 67s, 5885i)
17124: [11/14/2007 00:19:02] [burton] [0.008264] Sender*bounces (1frq, 67s, 5882i)
17124: [11/14/2007 00:19:02] [burton] [0.008533] Url*redhat (1frq, 0s, 85i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] List-Help*cluster+request (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] List-Post*cluster+redhat.com> (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] List-Subscribe*<https+//www.redhat.com/mailman/listinfo/linux (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] cluster+redhat (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] Subject*[Linux+cluster] (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] list+Linux (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] Sender*linux+cluster (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] List-Subscribe*//www.redhat.com/mailman/listinfo/linux+cluster> (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] redhat+com (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] [burton] [0.008634] Return-Path*cluster+bounces (1frq, 0s, 84i)
17124: [11/14/2007 00:19:02] Burton-Bayesian Probability: 0.000000 Samples: 27
17124: [11/14/2007 00:19:02] Result Confidence: 0.99
17124: [11/14/2007 00:19:02] BNR Decision Concurs
17124: [11/14/2007 00:19:02] total processing time: 0.08533s
17124: [11/14/2007 00:19:02] saving signature as 473a8546171241968551123
17124: [11/14/2007 00:19:02] libdspam returned probability of 0.000000
17124: [11/14/2007 00:19:02] message result: NOT SPAM
--------------------------------------------------------------

Comment 1 Jacob Joseph 2007-11-19 20:24:36 UTC

Created attachment 136430 [details]
An example message that crashes dspam

Comment 2 Alin Năstac (RETIRED) gentoo-dev

2007-11-20 05:04:51 UTC

Please attach your dspam.conf. I also need the output of "equery uses mail-filter/dspam" command.

Comment 3 Jacob Joseph 2007-11-20 21:06:26 UTC

I have attached my dspam.conf, as well as a (1-message) training corpus that will crash from an empty database. In particular, clearing the user 'jacob' from the database, and running dspam_train on the attached tarball crashes immediately. I am running 'dspam_train jacob t_spam t_nonspam'.

Thanks for the assistance. Note that I have received a reply on the dspam-users mailing list, but no resolution as of yet. I will post here if any is found.

# equery uses mail-filter/dspam
[ Searching for packages matching mail-filter/dspam... ]
[ Colour Code : set unset ]
[ Legend : Left column (U) - USE flags from make.conf ]
[ : Right column (I) - USE flags packages was installed with ]
[ Found these USE variables for mail-filter/dspam-3.8.0-r7 ]
U I
+ + clamav : Adds support for Clam AntiVirus software (usually with a plugin)
+ + daemon : Enable support for DSPAM to run in --daemon mode
+ + debug : Enable extra debug codepaths, like asserts and extra output. If you want to get meaningful backtraces see http://www.gentoo.org/proj/en/qa/backtraces.xml .
- - large-domain : Builds for large domain rather than for domain scale
+ + ldap : Adds LDAP support (Lightweight Directory Access Protocol)
+ + mysql : Adds mySQL Database support
+ + postgres : Adds support for the postgresql database
- - sqlite : Adds support for sqlite - embedded sql database
- - syslog : Enables support for syslog
- - user-homedirs : Build with user homedir support
- - virtual-users : Build with virtual-users support

Comment 4 Jacob Joseph 2007-11-20 21:06:48 UTC

Created attachment 136538 [details]
offending dspam.conf

Comment 5 Jacob Joseph 2007-11-20 21:08:01 UTC

Created attachment 136539 [details]
training corpus

This tarball contains a single good message.  When dspam_train is used with it, and an empty database schema, I observe the described segmentation fault immediately.

Comment 6 Alin Năstac (RETIRED) gentoo-dev

2007-11-22 09:31:40 UTC

I am unable to reproduce.

Test environment:
  - postgres-8.0.13 installed with USE="doc kerberos nls pam perl python readline ssl tcl test xml zlib
  - dspam-3.8.0-r7 instaled with USE="clamav daemon ldap mysql postgres sqlite"
  - setup postgresql server using emerge --config postgresql
  - start postgresql server
  - install dspam; replace dspam.conf with the attached conf
  - start dspam service
  - run dspam_train mrness t_spam t_nonspam

Result: work as expected

Output:
dspam_train mrness t_spam t_nonspam
Taking Snapshot...
mrness
    TP:     0 TN:     0 FP:     0 FN:     0 SC:     0 NC:     0
Training t_nonspam / t_spam corpora...
[test: nonspam] 1195138440.M148448P15692V0000000 result: PASS
TRAINING COMPLETE

Training Snapshot:
mrness
    TP:     0 TN:     1 FP:     0 FN:     0 SC:     0 NC:     0
    SHR:  100.00%       HSR:    0.00%       OCA:  100.00%

Overall Statistics:
mrness
    TP:     0 TN:     1 FP:     0 FN:     0 SC:     0 NC:     0
    SHR:  100.00%       HSR:    0.00%       OCA:  100.00%


Did I've missed something?

Comment 7 Alin Năstac (RETIRED) gentoo-dev

2007-11-22 09:36:53 UTC

Ah, I forgot the step between "install dspam" and "start dspam":
  - configure postgresql database by running emerge --config dspam

Comment 8 steveb 2007-11-28 23:35:54 UTC

@Jacob Joseph: Could you modify your ebuild and add the verbose debug option to it and re-emerge DSPAM and then post the output of the new debug log?

Changes needed to be done:
                        $(use_enable syslog) \
                        $(use_enable debug) \
                        $(use_enable debug bnr-debug) \
+                       $(use_enable debug verbose-debug) \
                        --enable-long-usernames \
                        --with-dspam-group=dspam \
                        --with-dspam-home-group=dspam \

Comment 9 Jacob Joseph 2007-12-03 23:22:35 UTC

Steve, Alin, thank you for your testing.  

I have tested further mysql on this and other machines.  I have tested with 3.8.0-r7 and 3.8.0-r8, with verbose debugging on.  With repeated compilations, and repeated runs, it does seem that dspam, and dspam_train crash only sometimes.  The backtrace is always as previously posted.  While the posted training corpus sometimes does not cause a crash, some other message always does.  For obvious reasons, I cannot post all of my training corpus to fully illustrate this.

I will attach the verbose debug log shortly.

Comment 10 Jacob Joseph 2007-12-03 23:24:15 UTC

Created attachment 137657 [details]
verbose debug output during a crash

Comment 11 Jacob Joseph 2007-12-03 23:25:10 UTC

Created attachment 137659 [details]
Message associated with the above verbose output

Command line used:
/usr/bin/dspam --user jacob --deliver=summary --stdout < '1195501649.M638339P24186V0000000000000302I002DB2C4_5967.jjoseph.org,S=29974:2,S'

Comment 12 steveb 2007-12-04 01:13:22 UTC

(In reply to comment #9 #10 #11)
> 
Are you fixed on PGSql or would you consider moving to MySQL as an option?

The reason fro the crashes with MySQL are probably:
- In DSPAM you MaxMessageSize (let's abbreviate this to MMS) xxx
- In MySQL you have max_allowed_packet (let's abbreviate this to MAP) xxx
- You got a mail with total size (let's abbreviate this to TS) xxx

Now the problem is that DSPAM will create tokens and data for the message you get. In one point it time it will insert the data into MySQL. Since one part of the data inserted into MySQL is in binary (a blob) DSPAM will call MySQL C API functions to escape the binary data. This C API calls need maximum double amount of memory plus one byte then the real binary data has. Now the problem starts if:
((size of binary data) * 2 + 1) + (space needed for the insert SQL commands) > MAP

Another problem is if:
((MMS * 2 + 1) > MAP

This is one reason why DSPAM in the current release can crash.

Another reason is that when DSPAM does tokenize the mail it searches the storage backend for tokens maching the new tokenized tokens. It creates a huge SQL query to search for the tokens. The generation of the SQL query can grow pretty big. Depending how much tokens you have for the user or any group the user belongs to. If you have a lot of tokens DSPAM can easy create a SQL query bigger then MAP and fail when executing that query. Unfortunately DSPAM then closes the handle to the storage backend and becomes unusable until you restart the daemon. Or if you don't use the client/server (or agent) mode it will just crash the DSPAM binary and not deliver any result.

Does this sound logical to you?

I have taken some time to fix that issue in the MySQL driver. But not in PostgreSQL. I first want to get MySQL running and then focus on PostgreSQL.

If you are not limited in what storage engine you use in DSPAM, then I would offer you to post here a patch for MySQL driver in DSPAM. I am testing since 3 days this driver on my end and it works. But I am not 100% finished with it. If you can wait then I would be very thankful. I promise to soon post the patch here.

Is that okay with you?

In the mean time you could switch to MySQL and increase max_allowed_packet to lets say 128MB or any thing above 60MB and then test again. I am very confident that you will have less crashes. Do decrease the MaxMessageSize as well. Try to stay with max_allowed_packet to at least MaxMessageSize * 3.

btw: The message you posted has already DSPAM headers. If you want to train correctly, then you need to wipe out the DSPAM headers and the DSPAM signature in the body (if you have any).

Running something like this here should clean most of the irrelevant stuff from your corpus:
for foo in *
do
        sed -i "/^\(X\-Quarantine\-ID:\|X\-OSBF\-Lua\-Score:\|X\-CRM114\-[a-zA-Z]*:\|X\-DKIM:\|X\-Virus\-Scanned:\|X\-Greylist:\|X\-DCC\-.*\-Metrics:\|X\-Virus\-Status:\|X\-Delivery\-Agent:\|Received\-SPF:\|X\-policyd\-weight:\|X\-Spam\-[^:]*:\) .*$/d;/^X\-Amavis\-OS\-Fingerprint:/,+1d;/^X\-DSPAM\-Result\:/,/^X\-DSPAM\-Signature: [0-9a-f,]*$/d;s/^Subject: \(ADV\|UNS\): /Subject: /;s/^Subject: \[SPAM\][\t ]*/Subject: /;s/^Subject: \[\(\+\|\-\)\{1,2\}\] /Subject: /;s/\!DSPAM:[0-9]\{0,9\},\{0,1\}[0-9a-f]\{1,32\}\!//g" ${foo}
done

// SteveB

Comment 13 Jacob Joseph 2007-12-04 01:29:41 UTC

Steve,
I do appreciate your detailed reply.  As I mentioned in my first message, all of the tests mentioned here do not occur when mysql is used.  Indeed, I only encountered any of this upon migrating to postgres, my preferred database. 

Without fault to dspam, I would be surprised to see that such things as the maximum packet size would cause intermittent failure.  It sounds very much like memory corruption to me.  Additionally, postgres reports no other error than that the client dropped the connection.  When the dspam daemon is used, it will continue to service 'dspam --client...' with other messages.

As for the dspam headers being present in the message already, I mean these messages only as an example of failure I was seeing on new, incoming messages.  I'd be happy to retest my corpus after their removal to either show this is a bug, or eliminate it as a possibility.

My goal in posting this bug was two-fold.  First, I'd like dspam to work for me.  Additionally, there is little point in someone else suffering similar troubles.

I still would very much appreciate further suggestions to diagnose the problem at hand.  I'm unsure what more can be done to reproduce it your systems.

Thanks again. ~Jacob

Comment 14 Alin Năstac (RETIRED) gentoo-dev

2007-12-07 09:01:38 UTC

Unfortunatelly no one (including upstream) were able to reproduce your segfault. Since even you don't have a safe way to reproduce the bug I can only conclude 2 things:
  a) you are affected by some kind of race (btw, how many processors you have?). I had in the past some bad experiences with gdb when I tried to debug multi-threaded code, therefore I must ask you to compare gdb results with strace.
  b) your problem is caused by an external factor. Please verify your hardware by running a complete memtest pass.

If you don't detect a hardware problem, please give complete step-by-step instructions for reproducing the segfault starting from an empty database. If you cannot obtain a segfault every time you try it, give us a frequency estimate .

Comment 15 steveb 2007-12-09 17:47:40 UTC

(In reply to comment #0)
> GDB:
> Short story: CTX->factors->first->ptr at dspam.c:3092 does not appear to be
> initialized
> ----------------------------------------------------------------
> Starting program: /usr/bin/dspam --debug --user jacob --mode=notrain --client
> --stdout --deliver=innocent,spam <
> /home/jacob/dbg_mail-linux-cluster-bounces@redhat.com-1194995056
> [Thread debugging using libthread_db enabled]
> [New Thread -1212545344 (LWP 17038)]
> 
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread -1212545344 (LWP 17038)]
> add_xdspam_headers (CTX=0x80705e0, ATX=0xbfe51b18) at dspam.c:3092
> 3092                    snprintf(scratch, sizeof(scratch), "%s, %2.5f",
> (gdb) list
> 3087                node_ft = c_nt_first(CTX->factors, &c_ft);
> 3088                while(node_ft != NULL) {
> 3089                  struct dspam_factor *f = (struct dspam_factor *)
> node_ft->ptr;
> 3090                  if (f) {
> 3091                    strlcat(data, ",\n\t", sizeof(data));
> 3092                    snprintf(scratch, sizeof(scratch), "%s, %2.5f",
> 3093                             f->token_name, f->value);
> 3094                    strlcat(data, scratch, sizeof(data));
> 3095                  }
> 3096                  node_ft = c_nt_next(CTX->factors, &c_ft);
> (gdb) bt
> #0  add_xdspam_headers (CTX=0x80705e0, ATX=0xbfe51b18) at dspam.c:3092
> #1  0x08053442 in process_message (ATX=0xbfe51b18, message=0x8062bb8,
>     username=0x8062930 "jacob", result_string=0xbfe519d0) at dspam.c:727
> #2  0x08054295 in process_users (ATX=0xbfe51b18, message=0x8062b10)
>     at dspam.c:1797
> #3  0x08054f30 in main (argc=Cannot access memory at address 0x0
> ) at dspam.c:258
> (gdb) print CTX->factors
> $11 = (struct nt *) 0x806e688
> (gdb) print *CTX->factors
> $12 = {
>   first = 0x806e570,
>   insert = 0x616d6c69,
>   items = 1700146542,
>   nodetype = 1869181810
> }
> (gdb) print *CTX->factors->first
> $13 = {
>   ptr = 0x38,
>   next = 0x20
> }
> -----------------------------------------------------------
> 
How did you manage to get that backtrace in GDB? I can't get DSPAM to produce me a backtrace when it crashes. How did you compile DSPAM? What CPU are you using? Did you compiled other packages with the debug USE flag? Do you use PAM and have you enabled core dumps in your system?

// SteveB

Comment 16 steveb 2007-12-09 20:19:33 UTC

BTW: Could you emerge DSPAM without debug USE flag and try if your installation works more stable then with debug USE flag?

Comment 17 Jacob Joseph 2007-12-10 01:38:09 UTC

Steve,
I'm unsure what aspect of debugging with gdb is failing for you, but I've
configured the 3.8.0-r8 ebuild with:

./configure --prefix=/usr --host=i686-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib --with-storage-driver=hash_drv,mysql_drv,pgsql_drv --with-dspam-home=/var/spool/dspam --sysconfdir=/etc/mail/dspam --enable-daemon --enable-ldap --enable-clamav --disable-large-scale --enable-domain-scale --disable-syslog --enable-debug --enable-bnr-debug --enable-verbose-debug --enable-long-usernames --with-dspam-group=dspam --with-dspam-home-group=dspam --with-dspam-mode=2511 --with-logdir=/var/log/dspam --disable-virtual-users --enable-preferences-extension --disable-homedir --with-mysql-includes=/usr/include/mysql --with-mysql-libraries=/usr/lib/mysql --with-pgsql-includes=/usr/include/postgresql --with-pgsql-libraries=/usr/lib/postgresql --build=i686-pc-linux-gnu

I'm also using CFLAGS="-march=athlon-xp -O1 -g -pipe".  To be
absolutely clear, I have recompiled glibc, postgres, clamav, and
openldap with these CFLAGS.  Much of the rest of the system has
"-march=athlon-xp -03 -pipe".  You may also find it usefull to add
'nostrip' and 'noclean' to your FEATURES in /etc/make.conf.  The
latter, in particular, will keep the sources around where gdb can find them.

To get a backtrace on a message that I know causes a segfault, I
perform the following:
<delete all tokens for the user jacob from the database, begin training>
# dspam_train jacob t_spam t_nonspam  

<Wait for one message to segfault>
...
[test: spam   ] 1195501649.M366542P24186V0000000 result: sh: line 1:  8860 Segmentation fault      /usr/bin/dspam --user jacob --deliver=summary --stdout < 't_spam/1195501649.M366542P24186V0000000000000302I002DB0F9_5508.jjoseph.org,S=24044:2,S'
BROKEN result!!
...

<Verify we still segfault, and run in gdb>
# /usr/bin/dspam --user jacob --deliver=summary --stdout < 't_spam/1195501649.M366542P24186V0000000000000302I002DB0F9_5508.jjoseph.org,S=24044:2,S'
Segmentation fault
# gdb /usr/bin/dspam
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...
Using host libthread_db library "/lib/libthread_db.so.1".
(gdb) run --user jacob --deliver=summary --stdout < 't_spam/1195501649.M366542P24186V0000000000000302I002DB0F9_5508.jjoseph.org,S=24044:2,S'
Starting program: /usr/bin/dspam --user jacob --deliver=summary --stdout < 't_spam/1195501649.M366542P24186V0000000000000302I002DB0F9_5508.jjoseph.org,S=24044:2,S'
[Thread debugging using libthread_db enabled]
[New Thread -1212250432 (LWP 8965)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1212250432 (LWP 8965)]
add_xdspam_headers (CTX=0x807c728, ATX=0xbfc989c8) at dspam.c:3086
3086                  struct dspam_factor *f = (struct dspam_factor *) node_ft->ptr;
(gdb) bt
#0  add_xdspam_headers (CTX=0x807c728, ATX=0xbfc989c8) at dspam.c:3086
#1  0x08053672 in process_message (ATX=0xbfc989c8, message=0x80629f8,
    username=0x8062930 "jacob", result_string=0xbfc98880) at dspam.c:724
#2  0x080544cb in process_users (ATX=0xbfc989c8, message=0x8062950) at dspam.c:1794
#3  0x08055171 in main (argc=5, argv=0xbfc998a4) at dspam.c:255
(gdb) list
3081              if (CTX->factors != NULL) {
3082                snprintf(data, sizeof(data), "X-DSPAM-Factors: %d",
3083                         CTX->factors->items);
3084                node_ft = c_nt_first(CTX->factors, &c_ft);
3085                while(node_ft != NULL) {
3086                  struct dspam_factor *f = (struct dspam_factor *) node_ft->ptr;
3087                  if (f) {
3088                    strlcat(data, ",\n\t", sizeof(data));
3089                    snprintf(scratch, sizeof(scratch), "%s, %2.5f",
3090                             f->token_name, f->value);
(gdb)      


Though the particular message involved in a crash seems to vary from
compile to compile, I always experience a crash on my training
corpus.  When configured without any of the debug options (but the same
CFLAGS), I am not currently able to reproduce the above segfault thus far.

Comment 18 Jacob Joseph 2007-12-10 01:41:14 UTC

I have posted a large training corpus at http://www.jjoseph.org/files/training_corpus.tar.bz2.  With the debug use flag, I have never been able to train on this data set without experiencing a reproducible segfault on a particular message.

Comment 19 steveb 2007-12-10 02:37:40 UTC

(In reply to comment #18)
> I have posted a large training corpus at
> http://www.jjoseph.org/files/training_corpus.tar.bz2.  With the debug use flag,
> I have never been able to train on this data set without experiencing a
> reproducible segfault on a particular message.
> 
I downloaded your corpus and started training on my DSPAM 3.8.0 with my own improved/fixed MySQL driver.

The system is not the fastest in the world:
mail ~ # cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 4
model name      : AMD Athlon (TM)
stepping        : 4
cpu MHz         : 1400.180
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips        : 2803.66
clflush size    : 32

mail ~ # free -m
             total       used       free     shared    buffers     cached
Mem:          1012        990         21          0          0        333
-/+ buffers/cache:        657        355
Swap:         1535        765        770
mail ~ #


So far no crashes with the MySQL backend:
-----------------------------------------
Training Snapshot:
jmjoseph
    TP:  6807 TN:   880 FP:    12 FN:    75 SC:    25 NC:     0
    SHR:   98.91%       HSR:    1.35%       OCA:   98.88%

Overall Statistics:
jmjoseph
    TP:  6807 TN:   880 FP:    12 FN:    75 SC:    25 NC:     0
    SHR:   98.91%       HSR:    1.35%       OCA:   98.88%

real    31m26.005s
user    0m58.616s
sys     0m44.377s
mail ~ # dspam_stats -s -H jmjoseph
jmjoseph:
                TP True Positives:                  6807
                TN True Negatives:                   880
                FP False Positives:                   12
                FN False Negatives:                   75
                SC Spam Corpusfed:                    25
                NC Nonspam Corpusfed:                  0
                TL Training Left:                      0
                SHR Spam Hit Rate                 98.91%
                HSR Ham Strike Rate:               1.35%
                PPV Positive predictive value:    99.82%
                OCA Overall Accuracy:             98.88%

mail ~ #
-----------------------------------------

Tomorrow after work I will test with DSPAM 3.8.0-r8 and PostgreSQL and one small patch I made one week ago for PostgreSQL.

The MySQL patch I made is not small. I added 511 lines and removed 249 lines in the MySQL driver. While I was working on the MySQL driver I have looked inside the PostgreSQL driver to see how things are there solved. Now I know that the PostgreSQL driver is as well not in the best shape. I need to take my time and do the same changes to PostgreSQL diver as I did for MySQL.

Comment 20 steveb 2007-12-10 20:27:14 UTC

Created attachment 138206 [details, diff]
dspam-3.8.0-fix_bnr_debug.patch

This patch fixes (hopefully) the crash when using --enable-bnr-debug. Have fun :)

@Alin: If you are going to include that patch into Gentoo, could you then as well change the ebuild to have this:

	# Debug build
	if use debug; then
		filter-flags "-fomit-frame-pointer" "-O?"
		append-flags "-O0" "-ggdb"
	fi


And this here:
	$(use_enable debug bnr-debug) \
	$(use_enable debug verbose-debug) \


--enable-verbose-debug does implicit set --enable-debug so it is not needed. But does no harm if you do:
	$(use_enable debug) \
	$(use_enable debug bnr-debug) \
	$(use_enable debug verbose-debug) \


Can you both test this patch and enable:
--enable-debug
--enable-bnr-debug
--enable-verbose-debug

And then check if DSPAM still crashes?

// SteveB

Comment 21 Jacob Joseph 2007-12-10 23:05:36 UTC

Steve, your patch does seem to fix my crashes.  I'll post back if a more complete test reveals any further news.  Thank you.  ~Jacob

Comment 22 steveb 2007-12-10 23:30:29 UTC

(In reply to comment #21)
> Steve, your patch does seem to fix my crashes.  I'll post back if a more
> complete test reveals any further news.  Thank you.  ~Jacob
> 
Perfect! You have to thank Alin. Without him I would never know any thing about this bug. Thanks Alin!

Comment 23 Alin Năstac (RETIRED) gentoo-dev

2007-12-11 21:03:55 UTC

Fixed in -r9, thanks!

I've added $(use_enable debug verbose-debug), but I didn't mess with CFLAGS as you suggested in comment #20. AFAIK users who want to debug a certain program have to know how to mangle CFLAGS and FEATURES at package level. I see no reason to override user's CFLAGS.

Comment 24 steveb 2007-12-12 01:12:06 UTC

(In reply to comment #23)
> Fixed in -r9, thanks!
> 
> I've added $(use_enable debug verbose-debug), but I didn't mess with CFLAGS as
> you suggested in comment #20. AFAIK users who want to debug a certain program
> have to know how to mangle CFLAGS and FEATURES at package level. I see no
> reason to override user's CFLAGS.
> 
I am not on your line. If some one enables debug USE flag, then I would expect that the package gets compiled with debug options. I would understand your intervention if DSPAM would be the only package overwriting/filtering/enforcing certain CFLAGS with debug USE flags. But we are not.

Allow me to ask the other way around: Why USE flags at all? If some one wants to use certain switches (which are right now provided with the help of the USE flags) then we could expect the user to know CFLAGS, FEATURES, LDFLAGS and friends at package level.

My question does not make sense. I know. But I think not setting the CFLAGS for a package where the user explicitly requests "debug" in the USE flags is as well not making sense.

What do you think?

Comment 25 Alin Năstac (RETIRED) gentoo-dev

2007-12-12 06:20:21 UTC

Debug flag is never used (or shouldn't be used) as a synonym to "build the program with -O0 -gddb and don't strip the debug info". In this case, it activates debug logging. Yeah, not exactly intuitive, but this is how upstream decided to implement things. If I were the author, these would have been accomplished through command line options, not compilation defines.

Anyway, the debug USE flag reason of being has been discussed over the time on dev ml. I distinctly remember that adding -ggdb to CFLAGS is a no-no.

Comment 26 steveb 2007-12-12 06:28:08 UTC

(In reply to comment #25)
> Debug flag is never used (or shouldn't be used) as a synonym to "build the
> program with -O0 -gddb and don't strip the debug info". In this case, it
> activates debug logging. Yeah, not exactly intuitive, but this is how upstream
> decided to implement things. If I were the author, these would have been
> accomplished through command line options, not compilation defines.
> 
I did not say anything about "don't strip the debug info".


> Anyway, the debug USE flag reason of being has been discussed over the time on
> dev ml. I distinctly remember that adding -ggdb to CFLAGS is a no-no.
> 
Well... then I have a nice list of such "no-no's":
grep -R "\-ggdb" /usr/portage/


Anyway... it is okay.