Bug 690694

Summary:	limits.h - fail to create file name more than 134 chars in non-English locale (incomplete UTF-8 support?)
Product:	Gentoo Linux	Reporter:	Sergey S. Starikoff <Ikonta>
Component:	Current packages	Assignee:	Gentoo Toolchain Maintainers <toolchain>
Status:	RESOLVED INVALID
Severity:	normal	CC:	slyfox
Priority:	Normal
Version:	unspecified
Hardware:	All
OS:	Linux
Whiteboard:
Package list:		Runtime testing required:	---

Description Sergey S. Starikoff 2019-07-25 12:25:50 UTC

Normally I use Cyrillic locale:
$ locale
LANG=ru_RU.utf8
LC_CTYPE="ru_RU.utf8"
LC_NUMERIC="ru_RU.utf8"
LC_TIME="ru_RU.utf8"
LC_COLLATE="ru_RU.utf8"
LC_MONETARY="ru_RU.utf8"
LC_MESSAGES="ru_RU.utf8"
LC_PAPER="ru_RU.utf8"
LC_NAME="ru_RU.utf8"
LC_ADDRESS="ru_RU.utf8"
LC_TELEPHONE="ru_RU.utf8"
LC_MEASUREMENT="ru_RU.utf8"
LC_IDENTIFICATION="ru_RU.utf8"
LC_ALL=

According to limits.h the filename length limit is 255 chars:
[I] sys-kernel/linux-headers
     Available versions:  3.18^bs 4.4^bs (~)4.9^bs [m]4.13^bs [m]4.14-r1^bs [m]~4.15-r1^bs [m]~4.16^bs [m]~4.16-r2^bs [m]~4.17^bs [m]~4.18^bs [m]~4.19^bs [m]~4.20^bs [m]~5.0^bs [m]~5.0-r1^bs [m]~5.1^bs [M]~5.2^bs {headers-only}
     Installed versions:  4.9^bs(12:51:56 07/25/19)(-headers-only)
     Homepage:            https://www.kernel.org/ https://www.gentoo.org/
     Description:         Linux system headers

# uname -a
Linux host 4.9.184-gentoo #1 SMP Wed Jul 17 17:32:02 MSK 2019 x86_64 AMD Athlon(tm) II X2 250 Processor AuthenticAMD GNU/Linux

$ grep NAME_MAX /usr/include/linux/limits.h
#define NAME_MAX         255	/* # chars in a file name */

Today I've tried to unpack an archive, containing filename in 175 chars containing cyrillic letters and got an error it doesn't fit into system limits.

I've performed additional check and find, that issue starts with filename length in 134 chars.

$ touch tttttттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттт.text
touch: cannot touch 'tttttттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттт.text': File name too long

$ echo "tttttттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттт.text" | wc -m
134

Wiki article https://wiki.gentoo.org/wiki/UTF-8 provides no ideas on this issue.

Comment 1 Alexander Tsoy 2019-07-25 23:56:18 UTC

(In reply to Sergey S. Starikoff from comment #0)
> $ echo
> "tttttттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттт
> тттттттттттттттттттттттттттттттттттттттттттттттттттт.text" | wc -m
> 134
char = 1 byte in C, so you should use `wc -c` in the command above. Cyrillic letters occupy 2 bytes in UTF-8 encoding.

Comment 2 Sergey S. Starikoff 2019-07-26 12:44:46 UTC

(In reply to Alexander Tsoy from comment #1)
> char = 1 byte in C, so you should use `wc -c` in the command above. Cyrillic
> letters occupy 2 bytes in UTF-8 encoding.

I remember, that cyrillic letter in UTF8 locale is encode with two bytes.

But quoted comment in limits.h promises not bytes, but chars.

That not only confuses, but provides some troubles, for example when I need to get long cyrillic named files from an archive.

Comment 3 Sergei Trofimovich (RETIRED) gentoo-dev

2019-08-12 06:33:20 UTC

On linux most filesystems except reiserfs have an internal limitation of 255 bytes per filename:
    https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits

$ LANG=C strace -f touch tttttттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттттт.text

openat(AT_FDCWD, "ttttt\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202\321\202.text", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = -1 ENAMETOOLONG (File name too long)

btrfs example:
  https://elixir.bootlin.com/linux/latest/source/fs/btrfs/inode.c#L5788

  /*
   * we can actually store much bigger names, but lets not confuse the rest
   * of linux
   */
  #define BTRFS_NAME_LEN 255
  ...
  struct inode *btrfs_lookup_dentry(struct inode *dir, struct dentry *dentry)
  {
    ....
    if (dentry->d_name.len > BTRFS_NAME_LEN)
        return ERR_PTR(-ENAMETOOLONG);

It's an arbitrary linux kernel limitation. It could be anything else.

Comment 4 Sergey S. Starikoff 2019-08-13 06:25:48 UTC

(In reply to Sergei Trofimovich from comment #3)
> It's an arbitrary linux kernel limitation. It could be anything else.

The real issue of this bug was the archive with unextractable files and a question about proper solution of this issue.

Comment 5 Sergei Trofimovich (RETIRED) gentoo-dev

2019-08-13 19:22:02 UTC

(In reply to Sergey S. Starikoff from comment #4)
> (In reply to Sergei Trofimovich from comment #3)
> > It's an arbitrary linux kernel limitation. It could be anything else.
> 
> The real issue of this bug was the archive with unextractable files and a
> question about proper solution of this issue.

Depends on your desired end state.

1. Possible solution 1: Explicit linux kernel support

If you end goal is to get filesystem with UTF-8 names more than 255 bytes you wouldn't get it without patching linux kernel and breaking some APIs that assume maximum file path.

I suggest asking linux-fsdevel@vger.kernel.org if it's feasible to add an extended mode to allow overgrown filename lengths and pay the price of some interfaces being broken against such files (stat()?).

It will likely have a rippling effect on libcs and beyond. Might not be an easy thing to do alone. But if enough people are onboard with the idea then why not.

The small precedent in a nearby area is a select() system call that does not have a kernel limitation on bit field size but most of userspace does not easily expose the functionality (FD_SET/FD_CLEAR).

2. Possible solution 2: use single-byte locale to get past unpacking

Something like:
$ LANG=ru_RU.KOI8-R unzip foo
$ LANG=ru_RU.KOI8-R luit ls
$ LANG=ru_RU.KOI8-R luit mv foo bar

3. Possible solution 3: use a wrapper/tool to extract and rename individual files with a rename (or mangle filenames in the archive)

An example wrapper is app-misc/mc that allows you to browse zip archives and copy out individual files with a user-specified target file name.

Might be good-enough to deal with an individual archive.