网上在windows上安装、使用的资料比较多,而没有linux的资料;
作者虽然没有说明pytesser在linux环境下测试过,但也表示“The scripts should work in Linux as well.”;
今天在我的ubuntu9.10上编译、安装、使用了一把,过程中遇到一些问题并解决,记在这里:
- pytesser依赖于PIL,因此需要先安装PIL模块.(下载Imaging-1.1.7.tar.gz源码包,sudo apt-get install python-dev, 这个必须安装)
- pytesser调用了tesseract,因此需要安装tesseract:
先用包管理器安装这几个库:下载tesseract的源码包:http://tesseract-ocr.googlecode.com/files/tesseract-3.00.tar.gz1234sudo
apt-get
install
libpng12-dev
sudo
apt-get
install
libjpeg62-dev
sudo
apt-get
install
libtiff4-dev
sudo
apt-get
install
zlibg-dev
解压、cd到解压后目录下tesseract-3.00/
运行./configure --prefix=你想要安装到的路径,比如:
然后make & make install1.
/configure
--prefix=
/home/pf-miles/installation/install/tesseract
将tesseract的运行脚本加到环境变量中,比如:
, 这个路径与刚才你configure的时候设置的路径有关1export
PATH=$PATH:
/home/pf-miles/installation/install/tesseract/bin
到http://code.google.com/p/tesseract-ocr/downloads/list页 面去下载最新的eng.traineddata.gz文件,解压后的eng.traineddata放到/home/pf-miles /installation/install/tesseract/share/tessdata目录下,注意,虽然tesseract的svn trunk里也有这个文件,但那个用不得,会报
错误,详见:http://www.uluga.ubuntuforums.org/showthread.php?p=10248384,所以一定要用http://code.google.com/p/tesseract-ocr/downloads/list这里下载的那一份1actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES:Error:Assert failed:
in
file
tessdatamanager.cpp, line 55
试一试:
OK,tesseract安装完毕12pf-miles@pf-miles-desktop:~
/downloads
$ tesseract
Usage:tesseract imagename outputbase [-l lang] [configfile [[+|-]varfile]...]
- 下载pytesser包:http://pytesser.googlecode.com/files/pytesser_v0.0.1.zip(目前是0.0.1版本), 解压...并cd到解压后的目录下
- 目录下有个“phototest.tif”图片文件作为测试用,直接在目录下写一个python脚本进行测试:
test.py:运行:1234from
pytesser
import
*
im
=
Image.
open
(
'phototest.tif'
)
text
=
image_to_string(im)
print
text
结果:1pf-miles@pf-miles-desktop:~
/downloads/pytesser
$ python
test
.py 2>
/dev/null
Thls IS a lot of 12 pornt text to test the
ocr code and see lf It works on all types
of frle format
lazy fox The qurck brown dog jumped
over the lazy fox The qulck brown dog
jumped over the lazy fox The QUICK
brown dog jumped over the lazy fox
The quick brown dog jumped over the
No comments:
Post a Comment