892485 – app-text/tessdata_fast and app-text/tessdata_best: add option to install language-neutral "script" training data

Bug 892485 - app-text/tessdata_fast and app-text/tessdata_best: add option to install language-neutral "script" training data

Summary: app-text/tessdata_fast and app-text/tessdata_best: add option to install lang...

Status:	UNCONFIRMED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	Bernard Cafarelli

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2023-01-29 06:40 UTC by GB
Modified:	2023-01-29 20:05 UTC (History)
CC List:	1 user (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description GB 2023-01-29 06:40:45 UTC

Tesseract's repos provide not only language-specific training data, but also script-specific (i.e. Latin, Cyrillic etc.) This can be useful when OCRing text that has unconventional diacritics, e.g. German text with accents.

The mapping via L10N is nice, but even activating all flags doesn't install all available tessdata.

Reproducible: Always

Steps to Reproduce:
1. Try to install tessdata_best or tessdata_fast
Actual Results:  
USE flags (via L10N) only allow installation of language-specific data.

Expected Results:  
There should be a way to install script-specific training data.

The script-specific files are in https://github.com/tesseract-ocr/tessdata/tree/main/script and https://github.com/tesseract-ocr/tessdata_best/tree/main/script