Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 787770 - [TRACKER] sys-apps/portage: use database(s) with mmap and zero-copy support to reduce memory footprint
Summary: [TRACKER] sys-apps/portage: use database(s) with mmap and zero-copy support t...
Status: CONFIRMED
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Core (show other bugs)
Hardware: All All
: Normal enhancement with 1 vote (vote)
Assignee: Portage team
URL:
Whiteboard:
Keywords: Tracker
Depends on:
Blocks: 835380
  Show dependency tree
 
Reported: 2021-05-02 20:22 UTC by Zac Medico
Modified: 2023-05-23 12:44 UTC (History)
5 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Zac Medico gentoo-dev 2021-05-02 20:22:33 UTC
It would be useful to mmap package metadata cache entries in order to eliminate package metadata from the heap.

For parallization of dependency calculations (bug 660860), we'll probably want to use a DB with mmap support to store the dependency calculation where concurrent processes can efficiently collaborate on it.

LMDB is a candidate since it supports mmap, and we should beware that writemap=True with an undersized map_size value will trigger SIGBUS in concurrent processes: 

https://github.com/jnwatson/py-lmdb/issues/269#issuecomment-729750375

> What we have here is two bugs, one in py-lmdb and one in lmdb. The
> first bug is that a non-zero default value of map_size on Environment is
> inappropriate. Passing 0 is generally the right answer most of the time.
> 
> That bug triggers a second bug in the underlying lmdb where opening a
> database with write_map=True and explicitly specifying a map_size too
> small will ftruncate the file out from underneath another open process.
Comment 1 Arfrever Frehtes Taifersar Arahesis 2021-05-03 12:02:04 UTC
There is possibly some support also in SQLite:
https://sqlite.org/mmap.html
Comment 2 Zac Medico gentoo-dev 2021-06-19 21:10:36 UTC
I want to create a something like memcached or redis that's entirely based on files and uses zero copy. I'm not sure if portage will use it or not, but it's a related zero-copy / mmap idea.

This is the related discussion from #gentoo-portage today:

> [13:25:51] <zmedico> adelks: are you making heavy use of mmap? I want portage to use mmap more...
> [13:26:16] <adelks> zmedico: not at all as I don't even know what it is
> [13:26:18] <zmedico> heaps are overrated and mmap is awesome
> [13:26:27] <adelks> xD
> [13:27:46] <zmedico> adelks: see bug 787770
> [13:27:48] <willikins> https://bugs.gentoo.org/787770 "[TRACKER] sys-apps/portage: use database(s) with mmap support to reduce memory footprint"; Portage Development, Core; CONF; zmedico:dev-portage
> [13:29:30] <zmedico> adelks: the idea is that you try to access most things in a zero-copy mmap sort of way, which is basically as efficient as you can get
> [13:30:44] <adelks> zmedico: I am aware of memoryviews in Cython, does this concept exist in compiled languages ?
> [13:30:53] <adelks> I will read the bug report
> [13:31:23] <zmedico> I actually want to implement a something like memcached or redis that's entirely based on files and uses zero copy
> [13:31:33] <adelks> But in any case, when I will implement multi-threading, the database will for sure be shared in memory
> [13:32:23] <adelks> zmedico: that could improve emerge without even touching its code right ?
> [13:32:40] <adelks> would it be the same if the repository be loaded in ramfs ?
> [13:32:59] <adelks> I suppose not, although it would help ?
> [13:33:50] <zmedico> ramfs won't help
> [13:34:27] <adelks> got it zmedico
> [13:34:35] <zmedico> adelks: yeah if we change the underlying portage APIs to utilize mmeap then we don't have to touch a lot of code
> [13:34:40] <zmedico> *mmap
> [13:38:22] <zmedico> mmap effectivly offloads more of the memory management to the OS
> [13:39:01] <adelks> zmedico: in my code, I want to read the files from the disk exactly once
> [13:39:37] <adelks> I was thinking of the RAM usage for that, it shouldn't be much ? I mean at most there are like 50 000 packages ? 100kb of metadata each ?
> [13:40:13] <zmedico> if you use mmap, access files just like they're RAM
> [13:40:33] <adelks> I was reding the doc https://docs.python.org/3/library/mmap.html now I understand better
> [13:40:53] <zmedico> an the OS *will* cache them in RAM when appropriate
> [13:41:19] <zmedico> so mmap gives you similar results to caching in RAM
> [13:41:34] <zmedico> but without eating heap memory
> [13:42:17] <zmedico> and it's zero-copy, which is *faster* than reading things into RAM!!!
> [13:42:49] <adelks> I wonder how filesystems intervene into this grand scheme
> [13:43:52] <zmedico> filesystems can trigger... SIGBUS ad noted here: https://bugs.gentoo.org/787770#c0
> [13:44:51] <zmedico> so you don't want to shrink mmaped files
> [13:45:17] <zmedico> just don't, it's easy ;-P
> [13:46:36] <zmedico> portage overwrites cache files via atomic rename
> [13:47:15] <zmedico> so hopefully you'd never see a SIGBUS for a mmaped portage cache file
> [13:48:05] <zmedico> and if you saw one one day you'd be surprised :-D