Hi Gentoo Community, related to my Bug #663482: Please, can you provide additional USE-Flags for the package (tessdata tessdata_best tessdata_fast)? Because for tesseract 4 there are three types of prepared tesseract training data: https://github.com/tesseract-ocr/tessdata https://github.com/tesseract-ocr/tessdata_best https://github.com/tesseract-ocr/tessdata_fast But i don't if it's better to use additional L10N-Flags. For example: l10n_en l10n_en_fast l10n_en_best Thanks in advance. Best regards Andreas
Looking at https://github.com/tesseract-ocr/tesseract/wiki/Data-Files we should: * use the _light variant in 4.0 * support _best would be nice (USE flag probably) * maybe support the legacy one (currently in ebuild)
Updating title as I will only get to this after 4.0 bump itself
My gut feeling is that it would make sense to make the three types of training data available as separate packages, then having app-text/tesseract depend on the correct one through appropriate USE flags. To begin with we could make the three data packages mutually exclusive so that they can use the same destination directory.
Created attachment 579638 [details] tessdata_legacy-4.0.0.ebuild Right, here is my first draft of a tessdata_legacy ebuild. The ones for tessdata_best and tessdata_fast are essentially identical, the only difference other than supported languages (turns out that there is a difference even between _best and _fast) and RDEPEND constraints is that the latter two can use the name of the package directly as repo names. One thing that is missing is the installation of script-detection trained models which are present in all three tessdata repositories. Does anyone know what these are useful for? If we do install them, IMHO we shouldn't bother with trying to map scripts to languages and simply install all of them. We might also want to add a warning somewhere that only tessdata_best models can be used for retraining.
Thanks, Marek! Yes, a runtime dependency sounds good (and it's nice that we have matches for all languages now, thanks for checking) Let's try to get this integrated with 4.1 bump, I should have time in the next days
Attaching ebuilds for tessdata_best and tessdata_fast so that you needn't build their respective language mappings from scratch. By the way, it might make sense to change URI_PREFIX from "https://github.com/tesseract-ocr/${MY_PN}/raw/${PV}/" to "https://raw.githubusercontent.com/tesseract-ocr/${PN}/${PN}/" so that the ebuild avoids triggering a HTTP redirection for every single downloaded file.
Created attachment 585110 [details] tessdata_best-4.0.0.ebuild
Created attachment 585112 [details] tessdata_fast-4.0.0.ebuild
Created attachment 585114 [details] tessdata_best-4.0.0.ebuild ...and immediately having uploaded one of them I spotted a mistake :-)
Come to think of it, maybe it would make sense to create a simple eclass for these to avoid code duplication?
Ack I'll change URI_PREFIX before getting these in As for code duplication, not sure if 3 packages is worth an eclass. We can always update later, I kept your changes on backburner for too long, let's get them in first!
The bug has been closed via the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=fa323ef580d791b1dc583fc6169238150b9d71d4 commit fa323ef580d791b1dc583fc6169238150b9d71d4 Author: Bernard Cafarelli <voyageur@gentoo.org> AuthorDate: 2019-07-30 19:57:12 +0000 Commit: Bernard Cafarelli <voyageur@gentoo.org> CommitDate: 2019-07-30 19:57:25 +0000 app-text/tesseract: 4.1.0 bump This adds the ability to choose trained data files: * app-text/tessdata_fast: default and recommended for most users * app-text/tessdata_best: to trade a lot of speed for slightly better accuracy * app-text/tessdata_legacy: the only one that supports the legacy recognizer Closes: https://bugs.gentoo.org/663564 Package-Manager: Portage-2.3.69, Repoman-2.3.16 Signed-off-by: Bernard Cafarelli <voyageur@gentoo.org> app-text/tesseract/Manifest | 1 + app-text/tesseract/metadata.xml | 3 +- app-text/tesseract/tesseract-4.1.0.ebuild | 83 +++++++++++++++++++++++++++++++ 3 files changed, 85 insertions(+), 2 deletions(-) Additionally, it has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=a7f9ece4bc650be7dc84a60e5f64eecf5b68a500 commit a7f9ece4bc650be7dc84a60e5f64eecf5b68a500 Author: Bernard Cafarelli <voyageur@gentoo.org> AuthorDate: 2019-07-30 19:37:44 +0000 Commit: Bernard Cafarelli <voyageur@gentoo.org> CommitDate: 2019-07-30 19:57:25 +0000 app-text/tessdata_legacy: initial commit Thanks a lot to marecki for initial idea and ebuilds Bug: https://bugs.gentoo.org/663564 Package-Manager: Portage-2.3.69, Repoman-2.3.16 Signed-off-by: Bernard Cafarelli <voyageur@gentoo.org> app-text/tessdata_legacy/Manifest | 127 +++++++++++++++++++++ app-text/tessdata_legacy/metadata.xml | 16 +++ .../tessdata_legacy/tessdata_legacy-4.0.0.ebuild | 55 +++++++++ 3 files changed, 198 insertions(+) https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=87d7216af4d9960d6bac560553335a313dc70991 commit 87d7216af4d9960d6bac560553335a313dc70991 Author: Bernard Cafarelli <voyageur@gentoo.org> AuthorDate: 2019-07-30 19:36:44 +0000 Commit: Bernard Cafarelli <voyageur@gentoo.org> CommitDate: 2019-07-30 19:57:25 +0000 app-text/tessdata_best: initial commit Thanks a lot to marecki for initial idea and ebuilds Bug: https://bugs.gentoo.org/663564 Package-Manager: Portage-2.3.69, Repoman-2.3.16 Signed-off-by: Bernard Cafarelli <voyageur@gentoo.org> app-text/tessdata_best/Manifest | 124 ++++++++++++++++++++++ app-text/tessdata_best/metadata.xml | 15 +++ app-text/tessdata_best/tessdata_best-4.0.0.ebuild | 50 +++++++++ 3 files changed, 189 insertions(+) https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=f0b3b9b2cd3f4e1393108ea9c8eb181a5a29a1a8 commit f0b3b9b2cd3f4e1393108ea9c8eb181a5a29a1a8 Author: Bernard Cafarelli <voyageur@gentoo.org> AuthorDate: 2019-07-30 19:36:13 +0000 Commit: Bernard Cafarelli <voyageur@gentoo.org> CommitDate: 2019-07-30 19:57:24 +0000 app-text/tessdata_fast: initial commit Thanks a lot to marecki for initial idea and ebuilds Bug: https://bugs.gentoo.org/663564 Package-Manager: Portage-2.3.69, Repoman-2.3.16 Signed-off-by: Bernard Cafarelli <voyageur@gentoo.org> app-text/tessdata_fast/Manifest | 124 ++++++++++++++++++++++ app-text/tessdata_fast/metadata.xml | 15 +++ app-text/tessdata_fast/tessdata_fast-4.0.0.ebuild | 50 +++++++++ 3 files changed, 189 insertions(+)
Wow, What a Excellent post. I really found this to much informatics. It is what i was searching for.I would like to suggest you that please keep sharing such type of info.Thanks! https://boxnmove.com/packers-and-movers-in-chandigarh