converting png to tiff and character recognition with tesseract
Denis Wen
Ranch Hand
Joined: Nov 11, 2008
Posts: 33
posted
0
His,
Trying to have tesseract (http://code.google.com/p/tesseract-ocr/) read text from the tiff image (converted from a png image source either with imageio in Linux or Image Converter .EXE in Windows). The outputted text is empty or looks like \\\\\\\\\\\\\\\\\\\\\HHHHHHHHHHHH\\\\\\\\\\\\\\\\\UU\\\\\\\\\\\\\\\H\W
Does anyone have an idea what can cause the problem. I could imagine it is related with low contrast between yellow background and text or some type of attribute that one needs to set when converting from png.
Having a bit of experience with image processing (though not much with OCR), I would imagine that it's easier to perform OCR on a black-and-white image than on a colored image. If the images you need to work with are essentially bi-colored like the one shown above, converting it to B/W should not be hard to do, and may yield better results.
Ok, I should try that. What's the best way to grayscale an image you would suggest? with ImageIO somehow?
Ulf Dittmer wrote:Having a bit of experience with image processing (though not much with OCR), I would imagine that it's easier to perform OCR on a black-and-white image than on a colored image. If the images you need to work with are essentially bi-colored like the one shown above, converting it to B/W should not be hard to do, and may yield better results.