Home | Docs | Forums | Lists | Bugs | Planet | Store | GMN | Get Gentoo!
Not eligible to see or edit group visibility for this bug.
View Bug Activity | Format For Printing | XML | Clone This Bug
Google just released Tesseract OCR, an optical character recognition package originally developed by Hewlett-Packard and now released as open source software at SourceForge (http://sourceforge.net/projects/tesseract-ocr). I would very much like to try it out and hope it can be included as an ebuild.
Created an attachment (id=96085) [edit] Ebuild I tried to make this as robust as I could, but since this is my first ebuild, I'm sure there are things I could've done differently. I have mine in: ${PORTDIR_OVERLAY}/media-gfx/tesseract
Created an attachment (id=96086) [edit] Xterm path patch Place this in the 'files' subdir. It corrects a hardcode path in the source code... changes /usr/bin/X11/xterm to /usr/bin/xterm. Mine is installed in ${PORTDIR_OVERLAY}/media-gfx/tesseract/files
Putting the ebuild aside, were you able to get tesseract processing correctly? Can you run the test image?
I'll say that I cannot. I've just downloaded and compiled the package (v1.01) from SF and it "hangs" using 100% CPU. Strace reveals this as the last few things it does when invoked as: tesseract phototest.tif phototest batch write(2, "Tesseract Open Source OCR Engine"..., 33Tesseract Open Source OCR Engine ) = 33 open("phototest.tif", O_RDONLY|O_LARGEFILE) = 6 read(6, "II*\0\10\0\0\0", 8) = 8 fstat64(6, {st_mode=S_IFREG|0644, st_size=38668, ...}) = 0 mmap2(NULL, 38668, PROT_READ, MAP_SHARED, 6, 0) = 0xb7f64000 fstat64(6, {st_mode=S_IFREG|0644, st_size=38668, ...}) = 0 brk(0x86b6000) = 0x86b6000 munmap(0xb7f64000, 38668) = 0 close(6) = 0 open("phototest.bl", O_RDONLY) = -1 ENOENT (No such file or directory) open("phototest.vec", O_RDONLY) = -1 ENOENT (No such file or directory) open("phototest.uzn", O_RDONLY) = -1 ENOENT (No such file or directory) open("phototest.pd", O_RDONLY) = -1 ENOENT (No such file or directory) times({tms_utime=10, tms_stime=4, tms_cutime=0, tms_cstime=0}) = 1838221478 brk(0x86d7000) = 0x86d7000 times({tms_utime=12, tms_stime=4, tms_cutime=0, tms_cstime=0}) = 1838221479
I tried this ebuild with tesseract 1.02, and it seems to work. Accuracy on phototest.tif was 100%. Unfortunately, the other pages I've tried haven't fared so well; I get a lot of typos. But it is producing recognizable text. Note that the ebuild says LICENSE="GPL-1" when it should have been LICENSE="Apache-2.0".
There are some useful scripts here, that might help make it more user friendly: http://www.groklaw.net/article.php?story=20061210115516438 It is already in Ubuntu so this might help: http://packages.ubuntu.com/feisty/graphics/tesseract-ocr
I would like to confirm that this tesseract-ocr is functioning fine in conjunction with mail-filter/spamassassin-fuzzyocr-3.5.0 From this ebuild request: https://bugs.gentoo.org/show_bug.cgi?id=158445 However, since the filename on sourceforge is tesseract-<ver>.tar.gz, I created it as media-gfx/tesseract/tesseract-1.02.ebuild in portage overlay. This was easiest for me.
Created an attachment (id=108906) [edit] tesseract-1.02.ebuild New ebuild for tesseract, this installs tesseract to /usr/lib/tesseract and installs a wrapper in /usr/bin (this is more the gentoo way of doing things).
Created an attachment (id=108907) [edit] tesseract-1.02.ebuild this needs some error trapping
Created an attachment (id=108940) [edit] tesseract-1.02.02022007.ebuild Ebuild for a CVS pull of tesseract, this one will compile cleanly on amd64, unlike the release version. This has been cleaned up quite a bit, and should be ready for portage.
Created an attachment (id=108943) [edit] tesseract-1.02.02022007.ebuild oops, this one actually works properly.
Added to CVS as app-text/tesseract.
Patrick, just some minutes too early. 1.0.3 released half an hour ago ,-)