Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 663564 - >=app-text/tesseract-4.0.0: add support for light/best data files
Summary: >=app-text/tesseract-4.0.0: add support for light/best data files
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Bernard Cafarelli
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-08-14 08:18 UTC by Andreas Kirsch
Modified: 2021-08-20 09:48 UTC (History)
3 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
tessdata_legacy-4.0.0.ebuild (tessdata_legacy-4.0.0.ebuild,2.32 KB, text/plain)
2019-06-12 16:17 UTC, Marek Szuba
Details
tessdata_best-4.0.0.ebuild (tessdata_best-4.0.0.ebuild,2.11 KB, text/plain)
2019-07-30 08:39 UTC, Marek Szuba
Details
tessdata_fast-4.0.0.ebuild (tessdata_fast-4.0.0.ebuild,2.12 KB, text/plain)
2019-07-30 08:39 UTC, Marek Szuba
Details
tessdata_best-4.0.0.ebuild (tessdata_best-4.0.0.ebuild,2.11 KB, text/plain)
2019-07-30 08:41 UTC, Marek Szuba
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andreas Kirsch 2018-08-14 08:18:36 UTC
Hi Gentoo Community,

related to my Bug #663482: Please, can you provide additional USE-Flags for the package (tessdata tessdata_best tessdata_fast)?

Because for tesseract 4 there are three types of prepared tesseract training data:

https://github.com/tesseract-ocr/tessdata
https://github.com/tesseract-ocr/tessdata_best
https://github.com/tesseract-ocr/tessdata_fast

But i don't if it's better to use additional L10N-Flags. For example: l10n_en l10n_en_fast l10n_en_best

Thanks in advance.

Best regards
Andreas
Comment 1 Bernard Cafarelli gentoo-dev 2018-08-14 10:00:14 UTC
Looking at https://github.com/tesseract-ocr/tesseract/wiki/Data-Files we should:
* use the _light variant in 4.0
* support _best would be nice (USE flag probably)
* maybe support the legacy one (currently in ebuild)
Comment 2 Bernard Cafarelli gentoo-dev 2019-06-04 13:26:52 UTC
Updating title as I will only get to this after 4.0 bump itself
Comment 3 Marek Szuba archtester gentoo-dev 2019-06-11 15:21:19 UTC
My gut feeling is that it would make sense to make the three types of training data available as separate packages, then having app-text/tesseract depend on the correct one through appropriate USE flags. To begin with we could make the three data packages mutually exclusive so that they can use the same destination directory.
Comment 4 Marek Szuba archtester gentoo-dev 2019-06-12 16:17:29 UTC
Created attachment 579638 [details]
tessdata_legacy-4.0.0.ebuild

Right, here is my first draft of a tessdata_legacy ebuild. The ones for tessdata_best and tessdata_fast are essentially identical, the only difference other than supported languages (turns out that there is a difference even between _best and _fast) and RDEPEND constraints is that the latter two can use the name of the package directly as repo names.

One thing that is missing is the installation of script-detection trained models which are present in all three tessdata repositories. Does anyone know what these are useful for? If we do install them, IMHO we shouldn't bother with trying to map scripts to languages and simply install all of them.

We might also want to add a warning somewhere that only tessdata_best models can be used for retraining.
Comment 5 Bernard Cafarelli gentoo-dev 2019-07-30 07:41:09 UTC
Thanks, Marek! Yes, a runtime dependency sounds good (and it's nice that we have matches for all languages now, thanks for checking)

Let's try to get this integrated with 4.1 bump, I should have time in the next days
Comment 6 Marek Szuba archtester gentoo-dev 2019-07-30 08:38:28 UTC
Attaching ebuilds for tessdata_best and tessdata_fast so that you needn't build their respective language mappings from scratch.

By the way, it might make sense to change URI_PREFIX from "https://github.com/tesseract-ocr/${MY_PN}/raw/${PV}/" to "https://raw.githubusercontent.com/tesseract-ocr/${PN}/${PN}/" so that the ebuild avoids triggering a HTTP redirection for every single downloaded file.
Comment 7 Marek Szuba archtester gentoo-dev 2019-07-30 08:39:04 UTC
Created attachment 585110 [details]
tessdata_best-4.0.0.ebuild
Comment 8 Marek Szuba archtester gentoo-dev 2019-07-30 08:39:19 UTC
Created attachment 585112 [details]
tessdata_fast-4.0.0.ebuild
Comment 9 Marek Szuba archtester gentoo-dev 2019-07-30 08:41:30 UTC
Created attachment 585114 [details]
tessdata_best-4.0.0.ebuild

...and immediately having uploaded one of them I spotted a mistake :-)
Comment 10 Marek Szuba archtester gentoo-dev 2019-07-30 08:44:56 UTC
Come to think of it, maybe it would make sense to create a simple eclass for these to avoid code duplication?
Comment 11 Bernard Cafarelli gentoo-dev 2019-07-30 18:52:58 UTC
Ack I'll change URI_PREFIX before getting these in

As for code duplication, not sure if 3 packages is worth an eclass. We can always update later, I kept your changes on backburner for too long, let's get them in first!
Comment 12 Larry the Git Cow gentoo-dev 2019-07-30 19:58:51 UTC
The bug has been closed via the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=fa323ef580d791b1dc583fc6169238150b9d71d4

commit fa323ef580d791b1dc583fc6169238150b9d71d4
Author:     Bernard Cafarelli <voyageur@gentoo.org>
AuthorDate: 2019-07-30 19:57:12 +0000
Commit:     Bernard Cafarelli <voyageur@gentoo.org>
CommitDate: 2019-07-30 19:57:25 +0000

    app-text/tesseract: 4.1.0 bump
    
    This adds the ability to choose trained data files:
    * app-text/tessdata_fast: default and recommended for most users
    * app-text/tessdata_best: to trade a lot of speed for slightly better accuracy
    * app-text/tessdata_legacy: the only one that supports the legacy recognizer
    
    Closes: https://bugs.gentoo.org/663564
    Package-Manager: Portage-2.3.69, Repoman-2.3.16
    Signed-off-by: Bernard Cafarelli <voyageur@gentoo.org>

 app-text/tesseract/Manifest               |  1 +
 app-text/tesseract/metadata.xml           |  3 +-
 app-text/tesseract/tesseract-4.1.0.ebuild | 83 +++++++++++++++++++++++++++++++
 3 files changed, 85 insertions(+), 2 deletions(-)

Additionally, it has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=a7f9ece4bc650be7dc84a60e5f64eecf5b68a500

commit a7f9ece4bc650be7dc84a60e5f64eecf5b68a500
Author:     Bernard Cafarelli <voyageur@gentoo.org>
AuthorDate: 2019-07-30 19:37:44 +0000
Commit:     Bernard Cafarelli <voyageur@gentoo.org>
CommitDate: 2019-07-30 19:57:25 +0000

    app-text/tessdata_legacy: initial commit
    
    Thanks a lot to marecki for initial idea and ebuilds
    
    Bug: https://bugs.gentoo.org/663564
    Package-Manager: Portage-2.3.69, Repoman-2.3.16
    Signed-off-by: Bernard Cafarelli <voyageur@gentoo.org>

 app-text/tessdata_legacy/Manifest                  | 127 +++++++++++++++++++++
 app-text/tessdata_legacy/metadata.xml              |  16 +++
 .../tessdata_legacy/tessdata_legacy-4.0.0.ebuild   |  55 +++++++++
 3 files changed, 198 insertions(+)

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=87d7216af4d9960d6bac560553335a313dc70991

commit 87d7216af4d9960d6bac560553335a313dc70991
Author:     Bernard Cafarelli <voyageur@gentoo.org>
AuthorDate: 2019-07-30 19:36:44 +0000
Commit:     Bernard Cafarelli <voyageur@gentoo.org>
CommitDate: 2019-07-30 19:57:25 +0000

    app-text/tessdata_best: initial commit
    
    Thanks a lot to marecki for initial idea and ebuilds
    
    Bug: https://bugs.gentoo.org/663564
    Package-Manager: Portage-2.3.69, Repoman-2.3.16
    Signed-off-by: Bernard Cafarelli <voyageur@gentoo.org>

 app-text/tessdata_best/Manifest                   | 124 ++++++++++++++++++++++
 app-text/tessdata_best/metadata.xml               |  15 +++
 app-text/tessdata_best/tessdata_best-4.0.0.ebuild |  50 +++++++++
 3 files changed, 189 insertions(+)

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=f0b3b9b2cd3f4e1393108ea9c8eb181a5a29a1a8

commit f0b3b9b2cd3f4e1393108ea9c8eb181a5a29a1a8
Author:     Bernard Cafarelli <voyageur@gentoo.org>
AuthorDate: 2019-07-30 19:36:13 +0000
Commit:     Bernard Cafarelli <voyageur@gentoo.org>
CommitDate: 2019-07-30 19:57:24 +0000

    app-text/tessdata_fast: initial commit
    
    Thanks a lot to marecki for initial idea and ebuilds
    
    Bug: https://bugs.gentoo.org/663564
    Package-Manager: Portage-2.3.69, Repoman-2.3.16
    Signed-off-by: Bernard Cafarelli <voyageur@gentoo.org>

 app-text/tessdata_fast/Manifest                   | 124 ++++++++++++++++++++++
 app-text/tessdata_fast/metadata.xml               |  15 +++
 app-text/tessdata_fast/tessdata_fast-4.0.0.ebuild |  50 +++++++++
 3 files changed, 189 insertions(+)
Comment 17 NoraHuerta 2021-08-20 09:48:58 UTC
Wow, What a Excellent post. I really found this to much informatics. It is what i was searching for.I would like to suggest you that please keep sharing such type of info.Thanks! https://boxnmove.com/packers-and-movers-in-chandigarh