207943 – dev-db/firebird-2.0.3.12981.0-r2: server does not respond any more, after some weeks uptime

Bug 207943 - dev-db/firebird-2.0.3.12981.0-r2: server does not respond any more, after some weeks uptime

Summary: dev-db/firebird-2.0.3.12981.0-r2: server does not respond any more, after som...

Status:	RESOLVED INVALID

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	New packages (show other bugs)
Hardware:	x86 Linux

Importance:	High normal (vote)
Assignee:	William L. Thomson Jr. (RETIRED)

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2008-01-28 18:29 UTC by Matthias Hanft
Modified:	2008-04-10 07:30 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Matthias Hanft 2008-01-28 18:29:47 UTC

Having run firebird superserver for some weeks, from some time on, it doesn't respond any more to client requests: The client can still connect to port 3050 at the server, but the server sends no data at all, and the client hangs forever. Nothing special is written to the log file, and ps -ef looks like always, too (one fbguard, some fbserver).

In this situation, "/etc/init.d/firebird stop" hangs forever, too. The "solution" is to "killall fbserver", "/etc/init.d/firebird zap", and "/etc/init.d/firebird start".

This always happens after about 4..6 weeks after startup. With the former version (the one which was located in /opt), this never happened - even not after half a year uptime or so.

I know that this kind of bug is very hard to trace - and I'm ready to do any additional debugging you may request (as long as you keep in mind that it is a "production server" - this means, any downtime must not be longer than a few minutes).

Reproducible: Always

Steps to Reproduce:
1. /etc/init.d/firebird start
2. let some client interact with the server for some weeks
3. after 4..6 weeks, client hangs forever because server sends no responses any more

Actual Results:  
Client is hanging forever.

Expected Results:  
Client gets responses from the server.

I used "Severity: Critical" for this bug because its explanation "The software hangs" is excactly true. In fact, "hanging" every six weeks is only "critical" every six weeks - and not at all in the meantime... :-)

Comment 1 Auke Booij (tulcod) 2008-01-28 18:55:10 UTC

Some quick calculations showed that if time would be saved in milliseconds, it would take just over 7 weeks to overflow an int. Maybe this has to do with the bug.

Comment 2 William L. Thomson Jr. (RETIRED) gentoo-dev

2008-01-28 19:08:13 UTC

This is more than likely something caused by your application rather than a bug specific to Firebird on Gentoo. I also run this in production for the last ~7 yrs, and is why I now maintain it ;)

You might want to try -r4, as it drops the hard coded cflags for user specified ones. Otherwise likely need to do some analysis on db, to see if it's db or engine related. Like what's the diff between the transactions new old, etc.

I am aware of some issues with 2.1.x, but 2.0.3 should be pretty solid. Pretty decent sized user base running it. I would imagine bugs like this would have been reported by now. Considering the -r2 been in tree since October 1st. With it going stable in November.

Based on your 4-6 week time line for problem. Means you should have run into this at least 3-4 times in that period of time. Yet just now reporting? Seems like there is other info I am not privy to.

When did you start running that version of Firebird? What type of client application, third party, or developed in house? How are you interfacing with Firebird?

Really seems like either a problem with the application, what it's doing to the db, and maybe a bug in the engine itself. But moving it from /opt would not cause a bug like that. Maybe cflags, if not client related.

Comment 3 Matthias Hanft 2008-01-28 19:39:14 UTC

Thank you for your comments so far. Some additional information:
- I run this Firebird server version since it was available as ebuild (that is, when I did "emerge --sync" and "emerge -pv world" and discovered the new version which I installed some days later). I check for updates usually once a week or so. (Sorry, can't remember the Firebird version which was installed before. Most likely the ebuild before the current version.)
- I really _had_ this bug 3 to 4 times since then.
- I didn't report earlier because I didn't know about this bug tracking system yet. (A user in the firebird-support Yahoo group pointed me to it just today.)
- The client applications are in-house developped Delphi applications which run on some Windows computers within the in-house LAN.
- The client applications haven't changed for years.
- There are about 10 different client applications which use about 10 different databases on the server.
- When the bug appears, _all_ client applications hang, using _any_ database.
- In "normal mode", when you just "telnet server 3050" and type some letters, the server closes the connection immediately.
- In "bug mode", you can type as much as you want, the server lets the connection open (and doesn't react in any other way either).

Comment 4 William L. Thomson Jr. (RETIRED) gentoo-dev

2008-01-28 19:55:50 UTC

(In reply to comment #3)
>
> - The client applications haven't changed for years.

That right there likely explains your problems. Did you test out the app with 2.0.x before upgrading production machines? Apps that connect directly to the engine like that, will need to be modified at some point in time.

I have had a few contact me directly about 1.5.x, because their app doesn't work right with 2.0.x. I have considered adding that version back to tree.

I would likely recommend testing out a binary version from upstream. If you run into the same problems, then your app needs to be updated. If not, then possible a bug in the engine, and/or something Gentoo specific.

Have the libraries client side been updated at all? That might be part of the problem as well, if there is any .dlls or etc on Windows belonging to Firebird. Not very familiar with Delphi, but I believe it connects directly via Firebird C API.

Comment 5 Matthias Hanft 2008-01-28 20:55:18 UTC

The just-before-current version was already a 2.0.x version (maybe 2.0.1? Is there an "ebuild history" somewhere?). I was very careful when I went from 1.5.x to 2.0.x somewhen before: rebuilt all applications as a precaution, updated the client library, used gbak to update the ODS, switched to dialect 3 where applicable, and so on. Everything was fine with 2.0.x (probably 2.0.1?) for a looong time. Then the _one_ update came to the current version 2.0.3 (including the switch from /opt to /usr), and from then on, bug 207932 appeared, as well as that instability after some weeks uptime...

Comment 6 William L. Thomson Jr. (RETIRED) gentoo-dev

2008-01-28 21:05:08 UTC

Again try -r4. If you look at the sed in the newer ebuilds you will see I am just changing paths. That's not likely to cause the instability you are experiencing. It has to be for other reasons. cflags might make more sense, but that has always been there or that way. Could be a system library or something else it's compiled against.

Unless I missed a path somewhere, but that's something that would cause more initial runtime issues. Rather than a issue that only occurs after a particular client application has run for a long time. Splitting up Firebird I don't believe is to blame for this problem.

Again still need stats details from the database itself to say for sure. Not even at problem period, after a week or two goes by. If the gap between new and old is to big, could be a problem there. Plus other things like how many process do you end up with? CPU load if any? UDFs in use?  Really have to look at all things. Very unlikely changing paths has anything to do with this problem.


What does your firebird.log file look like? Any client disconnection errors where transactions might not be properly closed out.


Finally if you use this stuff in production. I highly recommend you run a development or spare machine as ~arch. So you can run and test upcomming version and provide feedback on stability issues, before stabilized.

Comment 7 Matthias Hanft 2008-01-29 17:31:35 UTC

(In reply to comment #6)
> Again try -r4.

I'd like to, but how? (I'm not a portage guru.) In "normal state", I get
*  dev-db/firebird
      Latest version available: 2.0.3.12981.0-r2
      Latest version installed: 2.0.3.12981.0-r2
so there is nothing to update; and when I put 'ACCEPT_KEYWORDS="~x86"' into /etc/make.conf and look at "emerge -pv world", I get already 2.1:
[ebuild     U ] dev-db/firebird-2.1.0.16780_beta2-r3 [2.0.3.12981.0-r2] USE="doc examples -debug -xinetd" 13,191 kB

But a file /usr/portage/dev-db/firebird/firebird-2.0.3.12981.0-r4.ebuild _does_ exist - so it seems r4 is somewhere, somewhen, somehow...

> Again still need stats details from the database itself to say for sure.

There are several databases, but the "most heavy used" generates this:
        Generation              906672
        Page size               4096
        ODS version             11.0
        Oldest transaction      906635
        Oldest active           906636
        Oldest snapshot         906636
        Next transaction        906666
        Bumped transaction      1
        Sequence number         0
        Next attachment ID      0
        Implementation ID       19
        Shadow count            0
        Page buffers            0
        Next header page        0
        Database dialect        3
        Creation date           Sep 23, 2007 10:03:05
        Attributes              force write

ps -ef shows 1 "fbguard -f" and 5 "fbserver". (I'm not quite sure - but in "bug state", there might have been more than 5 fbserver.) "top" shows (while a database request is made) this:

top - 18:23:04 up 5 days,  6:53,  2 users,  load average: 0.11, 0.08, 0.06
Tasks: 154 total,   1 running, 153 sleeping,   0 stopped,   0 zombie
Cpu(s): 16.1%us,  1.7%sy,  0.0%ni, 81.4%id,  0.7%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:    904572k total,   885364k used,    19208k free,    79888k buffers
Swap:  2008116k total,       88k used,  2008028k free,   523588k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20685 firebird  16   0 45388 5284 3648 S    1  0.6   4:57.13 fbserver
28829 firebird  15   0 45388 5284 3648 S    1  0.6   1:26.47 fbserver
    1 root      15   0  1476  524  456 S    0  0.1   0:01.84 init
    2 root      11  -5     0    0    0 S    0  0.0   0:00.00 kthreadd
[...and other 0% stuff...]

No UDFs in use at all.

/var/log/firebird/firebird.log: Only entries like
fileserver (Server)     Tue Jan 29 13:30:02 2008
        INET/inet_error: read errno = 9
fileserver (Server)     Tue Jan 29 13:43:37 2008
        INET/inet_error: select in packet_receive errno = 9
about once an hour at an average. Do I have to worry about this? What is error 9? What is the difference between the two messages?

A spare machine for testing would be a good idea indeed - I'll try to setup one in the long run. But errors which only come out after many weeks are hard to find/trace anyway...

Comment 8 William L. Thomson Jr. (RETIRED) gentoo-dev

2008-01-30 04:25:45 UTC

(In reply to comment #7)
> (In reply to comment #6)
> > Again try -r4.
> 
> I'd like to, but how?

First off sync your tree, because -r3 has been gone for ~2 weeks now.

Second, keyword just the package, not the system
http://www.gentoo.org/doc/en/handbook/handbook-x86.xml?part=3&chap=3#doc_chap2

> /var/log/firebird/firebird.log: Only entries like
> fileserver (Server)     Tue Jan 29 13:30:02 2008
>         INET/inet_error: read errno = 9
> fileserver (Server)     Tue Jan 29 13:43:37 2008
>         INET/inet_error: select in packet_receive errno = 9

The first one is not good, the second one is likely very bad.

> about once an hour at an average.

That is not good

> Do I have to worry about this? What is error
> 9? What is the difference between the two messages?

Pretty sure both are connection related. So something isn't being closed properly or something with client side communication. I believe.

I the following every so often. Not consistently, rarely more than 2-3 per day, and at times, weeks between messages. No other error messages in log.
INET/inet_error: read errno = 104

On a production server with ~10 databases and on avg > 250 or so connections in connection pools to the various database total. With all that, plus other directly client side connections, non-pooled connections.


> A spare machine for testing would be a good idea indeed - I'll try to setup one
> in the long run. But errors which only come out after many weeks are hard to
> find/trace anyway..

Well I bet what you are seeing in the log could be leading to that. Does that log file ever fill up the partition it's on? Or is it rotated?

Either way those errors are not good, and likely what is causing your problem after prolonged runs. Maybe not so much the first one, but the packet error is not good.

Comment 9 William L. Thomson Jr. (RETIRED) gentoo-dev

2008-02-18 21:45:17 UTC

Since no one else has comment on this, its leading me to believe this is a local issue. I will give it a few more days. But will likely be closing soon as invalid. I think this is a client application issue causing the server to hang after a period of time. Verses something specific to Firebird on Gentoo. If it's not an application problem. Very likely could be a core Firebird bug, again not specific to Gentoo, but Firebird in general.

Might need to test actual binaries from upstream. Just for comparison.

Comment 10 Matthias Hanft 2008-02-20 18:36:15 UTC

Hi, I have done some updates since then:
- Linux kernel from 2.6.22-gentoo-r8 to 2.6.23-gentoo-r6
- Firebird from 2.0.3.12981.0-r2 to 2.0.3.12981.0-r4
- Client libraries gds32.dll from 2.0.1, 1.5.3 and even IB6.5 to 2.0.3

/var/log/firebird/firebird.log still looks like this (the latest entries):

[...]
fileserver (Server)     Wed Feb 20 16:30:05 2008
        INET/inet_error: read errno = 9
fileserver (Server)     Wed Feb 20 17:35:03 2008
        INET/inet_error: read errno = 9
fileserver (Server)     Wed Feb 20 18:10:04 2008
        INET/inet_error: read errno = 9
fileserver (Server)     Wed Feb 20 18:20:01 2008
        INET/inet_error: select in packet_receive errno = 9
fileserver (Server)     Wed Feb 20 18:30:02 2008
        INET/inet_error: select in packet_receive errno = 9
fileserver (Server)     Wed Feb 20 18:45:03 2008
        INET/inet_error: select in packet_receive errno = 9

I can't say about long-term stability yet (uptime is just 10 days until now).

Is there a detailled description of those errors anywhere out there? I have tried Google already, but I couldn't find any further explanation what *is* going wrong at *what* point when those error messages appear.

Something comes to my mind right now: I just notice that the time of the errors is always xx:x0 or xx:x5. That's exactly the time when a cron job on the Linux server performs a "curl ..." to the Windows server where the Firebird client is running, which then performs some database operation on the Linux server again. I don't know how this may interfere (and curl shows no error, too), but this might be a starting point of some deeper investigations...

On the other side, as far as I can remember, I had those errors already when using Firebird 2.0.1 (in /opt), and there were no problems with stability at all... still strange...

Comment 11 William L. Thomson Jr. (RETIRED) gentoo-dev

2008-02-20 19:55:51 UTC

(In reply to comment #10)
>
> On the other side, as far as I can remember, I had those errors already when
> using Firebird 2.0.1 (in /opt)

Ok, so likely those errors aren't causing the stability issue. Still possible though.

> and there were no problems with stability at
> all... still strange...

Well moving it really wouldn't cause stability issues. That's more of all or nothing issues. What you are experiencing might be due to a dependency firebird was compiled against. Or something else on the system, potentially.

That would make way more sense to cause stability issues, than just changing paths.

Comment 12 William L. Thomson Jr. (RETIRED) gentoo-dev

2008-04-10 07:30:25 UTC

There has been no activity on this in a month. I am going to close as invalid. Please comment and/or re-open if you find and confirm problem is Gentoo specific.