Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 892485

Summary: app-text/tessdata_fast and app-text/tessdata_best: add option to install language-neutral "script" training data
Product: Gentoo Linux Reporter: GB <g.brandl>
Component: Current packagesAssignee: Bernard Cafarelli <voyageur>
Status: UNCONFIRMED ---    
Severity: normal CC: jstein
Priority: Normal    
Version: unspecified   
Hardware: All   
OS: Linux   
Whiteboard:
Package list:
Runtime testing required: ---

Description GB 2023-01-29 06:40:45 UTC
Tesseract's repos provide not only language-specific training data, but also script-specific (i.e. Latin, Cyrillic etc.) This can be useful when OCRing text that has unconventional diacritics, e.g. German text with accents.

The mapping via L10N is nice, but even activating all flags doesn't install all available tessdata.

Reproducible: Always

Steps to Reproduce:
1. Try to install tessdata_best or tessdata_fast
Actual Results:  
USE flags (via L10N) only allow installation of language-specific data.

Expected Results:  
There should be a way to install script-specific training data.

The script-specific files are in https://github.com/tesseract-ocr/tessdata/tree/main/script and https://github.com/tesseract-ocr/tessdata_best/tree/main/script