Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 218378

Summary: os.listdir fails with random "cannot allocate memory" errors
Product: Gentoo Linux Reporter: Marcin Kurek <morgoth6>
Component: [OLD] Core systemAssignee: Python Gentoo Team <python>
Status: RESOLVED FIXED    
Severity: normal CC: duaneg, filip.elgstedt, kernel, zmedico
Priority: High    
Version: unspecified   
Hardware: AMD64   
OS: Linux   
Whiteboard:
Package list:
Runtime testing required: ---
Attachments: emerge --info
Console output from example fail
.25 config file
cat /proc/meminfo
dmesg
strace -f -o /tmp/st/portage-mem.log -- emerge gtk-engines-qt
python -v /usr/bin/emerge
python-2.5.2-unicode-listdir.patch

Description Marcin Kurek 2008-04-19 09:38:59 UTC
Recently emerge started to fail here with following error:

--------
<root@mordorpc portage> emerge -pv pygtk

These are the packages that would be merged, in order:

Calculating dependencies \Traceback (most recent call last):
  File "/usr/bin/emerge", line 7928, in <module>
    retval = emerge_main()
  File "/usr/bin/emerge", line 7922, in emerge_main
    myopts, myaction, myfiles, spinner)
  File "/usr/bin/emerge", line 7164, in action_build
    retval, favorites = mydepgraph.select_files(myfiles)
  File "/usr/bin/emerge", line 2476, in select_files
    expanded_atoms = self._dep_expand(root_config, x)
  File "/usr/bin/emerge", line 2280, in _dep_expand
    cp_set.update(db.cp_all())
  File "/usr/lib64/portage/pym/portage.py", line 7561, in cp_all
    for y in listdir(oroot+"/"+x, EmptyOnError=1, ignorecvs=1, dirsonly=1):
  File "/usr/lib64/portage/pym/portage.py", line 290, in listdir
    list, ftype = cacheddir(mypath, ignorecvs, ignorelist, EmptyOnError, followSymlinks)
  File "/usr/lib64/portage/pym/portage.py", line 226, in cacheddir
    list = os.listdir(mypath)
OSError: [Errno 12] Cannot allocate memory: '/usr/portage/net-mail'
--------

This seems to be quite random as I was able to run this command when I run it 4 or 5 times. It seems to happen for random packages and always finally works fine after some emerge reruns.

Anyway this machine has 2GB of ram then I guess out of memory situation is quite  impossible as I saw this message when booted in text mode without X. 



Reproducible: Always

Steps to Reproduce:
Comment 1 Marcin Kurek 2008-04-19 09:39:22 UTC
Created attachment 150265 [details]
emerge --info
Comment 2 Marcin Kurek 2008-04-19 09:40:53 UTC
Created attachment 150266 [details]
Console output from example fail 

As you can see first three commands fails, but next one works fine and another fails too, etc.
Comment 3 Marcin Kurek 2008-04-20 09:44:18 UTC
It seems to be kernel related as I can observe this problem only on 2.6.25 kernel and not on 2.6.24.
Comment 4 Marcin Kurek 2008-04-20 09:44:49 UTC
Created attachment 150362 [details]
.25 config file
Comment 5 Marcin Kurek 2008-05-18 17:03:11 UTC
Hmmm, ping ? This is realy anonying when updating system. Any ideas what this can be or how to debug it ?
Comment 6 Zac Medico gentoo-dev 2008-05-18 18:08:46 UTC
Unless you show that the problem does not occur with a vanilla kernel (I'm not sure exactly which kernel sources you are using), it's probably safe to assume that it's an upstream kernel bug therefore you should be looking to kernel.org for answers.
Comment 7 Duane Griffin 2008-05-21 12:12:41 UTC
If this is a kernel bug it should be assigned to the kernel team. We don't usually mark kernel issues resolved upstream until they been reported on the kernel.org bugzilla.

Could you please provide your dmesg and "cat /proc/meminfo" from the system after it starts showing these symptoms, thanks.
Comment 8 Marcin Kurek 2008-05-22 08:01:19 UTC
I guess this can be a portage, bug as this kind of problem should show for other applications too ? System works perfectly stable for two days now with heavy usage of deluge, firefox, gcc with no problems.

I can not see anything suspicious in dmesg, but I will attach both files as suggested.
Comment 9 Marcin Kurek 2008-05-22 08:02:46 UTC
Created attachment 153911 [details]
cat /proc/meminfo
Comment 10 Marcin Kurek 2008-05-22 08:04:43 UTC
Created attachment 153913 [details]
dmesg

I think there is nothing unusual in it.
Comment 11 Duane Griffin 2008-05-22 11:51:57 UTC
These are from immediately after you saw an OOM? Nothing unusual, as you say.

It could be a bug in portage, but it is strange that it only manifests under 2.6.25. Just to confirm, booting back into 2.6.24 (without changing anything else) makes the problem go away, right? Could you upload your 2.6.24 config, please, let's check there weren't any significant changes.

I see you are running with an unstable version of portage, does the problem still occur if you switch to using 2.1.4.4?
Comment 12 Marcin Kurek 2008-05-22 15:29:00 UTC
I tired to look at strace output from faulty call, but I can not see anything unusual. 

Comment 13 Marcin Kurek 2008-05-22 15:30:23 UTC
Created attachment 153949 [details]
strace -f -o /tmp/st/portage-mem.log -- emerge gtk-engines-qt
Comment 14 Daniel Drake (RETIRED) gentoo-dev 2008-05-22 16:26:49 UTC
I looked at the Python source, I think the error is coming from Modules/posixmodule.c posix_listdir()
	if ((dirp = opendir(name)) == NULL) {
		return posix_error_with_allocated_filename(name);
	}

i.e. opendir() on something is returning NULL, probably /usr/portage/net-mail
opendir() is implemented in libc as a wrapper around open() or something like that

Then I looked at the strace logs, but it shows that opening /usr/portage/net-mail quite early on was successful.

It gets to this part:

11723 open("/usr/portage/sec-policy", O_RDONLY|O_NONBLOCK|O_DIRECTORY|0x80000) = 3
11723 fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
11723 getdents(3, /* 68 entries */, 4096) = 2568
11723 brk(0x2deb000)                    = 0x2d8a000
11723 mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8708afb000
11723 getdents(3, /* 0 entries */, 4096) = 0
11723 close(3)                          = 0
11723 write(2, "Traceback (most recent call last"..., 35) = 35

There are no errors here, and nothing to do with net-mail

libc listdir always does getdents() twice or more in order to decide that it has finished reading the directory (when it gets a return code of 0). The only slightly unusual thing is the brk and mmap in the middle, but this seems to be just python growing its data segment and allocating some anonymously mapped memory. I don't see why it would have any effect on anything here.

strange...
Comment 15 Duane Griffin 2008-05-28 13:43:09 UTC
Very strange. The memory allocation is very suspicious, given the error reported by python, even though it succeeds(!)

I think the error is coming from just after the opendir, in the for loop immediately below. The strace logs show getdents are happening, so it must be inside the loop, doing readdir calls:

        for (;;) {
                Py_BEGIN_ALLOW_THREADS
                ep = readdir(dirp);
                Py_END_ALLOW_THREADS
                if (ep == NULL)
                        break;

Then outside the loop:

        if (errno != 0 && d != NULL) {
                /* readdir() returned NULL and set errno */
                closedir(dirp);
                Py_DECREF(d);
                return posix_error_with_allocated_filename(name); 
        }

Looking at the code one thing that jumps out is the use of errno directly to check whether the loop was terminated successfully or on error. It looks like PyEval_RestoreThread takes care not to modify errno, so that seems safe. However a quick look on the python issue tracker gives:

http://bugs.python.org/issue1608818

This could explain the problem, but only if the path given was unicode. Marcin, would you be able to recompile python with the patch given in that ticket? If you need assistance in doing so then I'd be happy to help. It would be very interesting to see if the problem goes away with it applied.

BTW, regarding net-mail, note that the reporter says it fails on random directories. In this case it seems to have failed reading "/usr/portage/sec-policy", but I doubt the particular directory matters much.
Comment 16 Marcin Kurek 2008-05-28 20:18:59 UTC
OK, tired this patch and it seems it's a bit broken as it makes python completly unusable. 'emerge --help' throws:

Traceback (most recent call last):
  File "/usr/bin/emerge", line 31, in <module>
    import portage
  File "/usr/lib64/portage/pym/portage.py", line 20, in <module>
    import copy, errno, os, re, shutil, time, types
ImportError: No module named time

Looking to /usr/lib64/python2.5 directory shows me that lib-dynload directory was empty (But it was not empty on portage workdir image)

About net-mail directory it indeed fails on random directory in portage only the stacktrace is similar (Same functions shows and always end's with File "/usr/lib64/portage/pym/portage.py", line 226, in cacheddir
    list = os.listdir(mypath))

About unicode my system uses unicode by default.
Comment 17 Marcin Kurek 2008-05-28 20:21:20 UTC
Created attachment 154619 [details]
python -v /usr/bin/emerge 

Verbose python output after patch
Comment 18 Duane Griffin 2008-05-31 12:58:01 UTC
(In reply to comment #16)
> OK, tired this patch and it seems it's a bit broken as it makes python
> completly unusable. 'emerge --help' throws:

Hmm, looks like something went wrong somewhere. The patch is really quite simple and limited in scope; it certainly shouldn't be causing that sort of trouble. I've just applied it here without any problem, this is what I did:

ebuild /usr/portage/dev-lang/python/python-2.5.2-r4.ebuild unpack
patch -d /var/tmp/portage/dev-lang/python-2.5.2-r4/work/Python-2.5.2 -p1 < proposed-patch.txt
ebuild /usr/portage/dev-lang/python/python-2.5.2-r4.ebuild compile install
sudo ebuild /usr/portage/dev-lang/python/python-2.5.2-r4.ebuild qmerge

> About unicode my system uses unicode by default.

Ah, very interesting...
Comment 19 Duane Griffin 2008-05-31 13:42:17 UTC
(In reply to comment #18)
> ...I've just applied it here without any problem...

Whoa -- spoke too soon! Sorry about that, I didn't test correctly before. I get the same error you did. For anyone else playing along at home -- don't follow those previous instructions.
Comment 20 Duane Griffin 2008-05-31 15:56:58 UTC
Created attachment 154963 [details, diff]
python-2.5.2-unicode-listdir.patch

You were right -- the patch was a bit broken. I apologise for not inspecting it closer or testing it properly before asking you to try it. Here is a working and tested version, if you don't mind having another go.
Comment 21 Marcin Kurek 2008-06-04 07:05:57 UTC
This problem was higly random then I can not be 100% sure it's gone, but I use portage a few days now with this patch and there was no OOM messages.

I think it was it. ThX for not closing this bug and helping me out witch this as I propably would never find issue1608818 on python bugzilla as it's quite ancient now.
Comment 22 Duane Griffin 2008-06-04 13:20:43 UTC
Excellent, glad to be of service.

Since this is seems like a fairly critical python bug (for you and anyone else using unicode, anyway) I'll send it over to the Python team. I'm unfamiliar with their procedures, but they may want to add the patch to our patch set and/or try to push it upstream. I've also updated the ticket on the Python bug tracker and uploaded the working version of the patch.
Comment 23 Marcin Kurek 2008-06-18 07:39:16 UTC
I want to confirm that this fixed this portage problem for good :) Also the same problem appear on my gf machine and I wonder can it be pulled to python ebuild as soon as possible ?
Comment 24 Marcin Kurek 2008-06-25 16:34:39 UTC
I see this patch was not included in new python 2.5.2-r5 as I start to observe same problem here as soon as I updated python.
Comment 25 Rafal 2008-07-30 07:35:08 UTC
(In reply to comment #20)
> Created an attachment (id=154963) [edit]
> python-2.5.2-unicode-listdir.patch
> 
> You were right -- the patch was a bit broken. I apologise for not inspecting it
> closer or testing it properly before asking you to try it. Here is a working
> and tested version, if you don't mind having another go.
> 

Hello

I have this problem to, but I'm green in gentoo, and I don't know what I have to do whith this: python-2.5.2-unicode-listdir.patch. Can somebody explain me in easy steps what I have to do ?
Comment 26 Rafal 2008-07-30 12:44:35 UTC
I download Python from python.org, change everything wat is written on page: http://bugs.gentoo.org/attachment.cgi?id=154963&action=diff
Next (./configure, make, make install) and no effect. I have always this same report when I write emerge (something):

!!! Failed to complete portage imports. There are internal modules for
!!! portage and failure here indicates that you have a problem with your
!!! installation of portage. Please try a rescue portage located in the
!!! portage tree under '/usr/portage/sys-apps/portage/files/' (default).
!!! There is a README.RESCUE file that details the steps required to perform
!!! a recovery of portage.
    No module named _socket

Traceback (most recent call last):
  File "/usr/bin/emerge", line 28, in <module>
    import portage
  File "/usr/lib/portage/pym/portage.py", line 55, in <module>
    import getbinpkg
  File "/usr/lib/portage/pym/getbinpkg.py", line 10, in <module>
    import htmllib,HTMLParser,string,formatter,sys,os,xpak,time,tempfile,base64,urllib2
  File "/usr/lib/python2.5/urllib2.py", line 92, in <module>
    import httplib
  File "/usr/lib/python2.5/httplib.py", line 71, in <module>
    import socket
  File "/usr/lib/python2.5/socket.py", line 45, in <module>
    import _socket
ImportError: No module named _socket
Comment 27 Tiziano Müller (RETIRED) gentoo-dev 2008-07-31 14:20:16 UTC
Fixed in python-2.5.2-r7, sorry for the delay.
Comment 28 Zac Medico gentoo-dev 2008-09-20 15:03:32 UTC
*** Bug 238174 has been marked as a duplicate of this bug. ***