Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 787770

Summary: [TRACKER] sys-apps/portage: use database(s) with mmap and zero-copy support to reduce memory footprint
Product: Portage Development Reporter: Zac Medico <zmedico>
Component: CoreAssignee: Portage team <dev-portage>
Status: CONFIRMED ---    
Severity: enhancement CC: flx.bier, gentoo, kingjon3377, mattst88, sam
Priority: Normal Keywords: Tracker
Version: unspecified   
Hardware: All   
OS: All   
See Also:
Package list:
Runtime testing required: ---
Bug Depends on:    
Bug Blocks: 835380    

Description Zac Medico gentoo-dev 2021-05-02 20:22:33 UTC
It would be useful to mmap package metadata cache entries in order to eliminate package metadata from the heap.

For parallization of dependency calculations (bug 660860), we'll probably want to use a DB with mmap support to store the dependency calculation where concurrent processes can efficiently collaborate on it.

LMDB is a candidate since it supports mmap, and we should beware that writemap=True with an undersized map_size value will trigger SIGBUS in concurrent processes:

> What we have here is two bugs, one in py-lmdb and one in lmdb. The
> first bug is that a non-zero default value of map_size on Environment is
> inappropriate. Passing 0 is generally the right answer most of the time.
> That bug triggers a second bug in the underlying lmdb where opening a
> database with write_map=True and explicitly specifying a map_size too
> small will ftruncate the file out from underneath another open process.
Comment 1 Arfrever Frehtes Taifersar Arahesis 2021-05-03 12:02:04 UTC
There is possibly some support also in SQLite:
Comment 2 Zac Medico gentoo-dev 2021-06-19 21:10:36 UTC
I want to create a something like memcached or redis that's entirely based on files and uses zero copy. I'm not sure if portage will use it or not, but it's a related zero-copy / mmap idea.

This is the related discussion from #gentoo-portage today:

> [13:25:51] <zmedico> adelks: are you making heavy use of mmap? I want portage to use mmap more...
> [13:26:16] <adelks> zmedico: not at all as I don't even know what it is
> [13:26:18] <zmedico> heaps are overrated and mmap is awesome
> [13:26:27] <adelks> xD
> [13:27:46] <zmedico> adelks: see bug 787770
> [13:27:48] <willikins> "[TRACKER] sys-apps/portage: use database(s) with mmap support to reduce memory footprint"; Portage Development, Core; CONF; zmedico:dev-portage
> [13:29:30] <zmedico> adelks: the idea is that you try to access most things in a zero-copy mmap sort of way, which is basically as efficient as you can get
> [13:30:44] <adelks> zmedico: I am aware of memoryviews in Cython, does this concept exist in compiled languages ?
> [13:30:53] <adelks> I will read the bug report
> [13:31:23] <zmedico> I actually want to implement a something like memcached or redis that's entirely based on files and uses zero copy
> [13:31:33] <adelks> But in any case, when I will implement multi-threading, the database will for sure be shared in memory
> [13:32:23] <adelks> zmedico: that could improve emerge without even touching its code right ?
> [13:32:40] <adelks> would it be the same if the repository be loaded in ramfs ?
> [13:32:59] <adelks> I suppose not, although it would help ?
> [13:33:50] <zmedico> ramfs won't help
> [13:34:27] <adelks> got it zmedico
> [13:34:35] <zmedico> adelks: yeah if we change the underlying portage APIs to utilize mmeap then we don't have to touch a lot of code
> [13:34:40] <zmedico> *mmap
> [13:38:22] <zmedico> mmap effectivly offloads more of the memory management to the OS
> [13:39:01] <adelks> zmedico: in my code, I want to read the files from the disk exactly once
> [13:39:37] <adelks> I was thinking of the RAM usage for that, it shouldn't be much ? I mean at most there are like 50 000 packages ? 100kb of metadata each ?
> [13:40:13] <zmedico> if you use mmap, access files just like they're RAM
> [13:40:33] <adelks> I was reding the doc now I understand better
> [13:40:53] <zmedico> an the OS *will* cache them in RAM when appropriate
> [13:41:19] <zmedico> so mmap gives you similar results to caching in RAM
> [13:41:34] <zmedico> but without eating heap memory
> [13:42:17] <zmedico> and it's zero-copy, which is *faster* than reading things into RAM!!!
> [13:42:49] <adelks> I wonder how filesystems intervene into this grand scheme
> [13:43:52] <zmedico> filesystems can trigger... SIGBUS ad noted here:
> [13:44:51] <zmedico> so you don't want to shrink mmaped files
> [13:45:17] <zmedico> just don't, it's easy ;-P
> [13:46:36] <zmedico> portage overwrites cache files via atomic rename
> [13:47:15] <zmedico> so hopefully you'd never see a SIGBUS for a mmaped portage cache file
> [13:48:05] <zmedico> and if you saw one one day you'd be surprised :-D