This week's book giveaway is in the OCPJP forum.
We're giving away four copies of OCA/OCP Java SE 7 Programmer I & II Study Guide and have Kathy Sierra & Bert Bates on-line!
See this thread for details.
The moose likes Java in General and the fly likes ocr from a website - how and how difficult? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCA/OCP Java SE 7 Programmer I & II Study Guide this week in the OCPJP forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "ocr from a website - how and how difficult?" Watch "ocr from a website - how and how difficult?" New topic
Author

ocr from a website - how and how difficult?

Denis Wen
Ranch Hand

Joined: Nov 11, 2008
Posts: 33
His,

If anyone can share experience with using OCR software with java to parse webpages.
My prospective task could turn out to be parsing a website where unfortunately some of the important information is displayed as images. Therefore I thought it would be better to seek advice first on good (and possibly free OCR software to use and maybe someone's portion of experience.

thanks
salvin francis
Ranch Hand

Joined: Jan 12, 2009
Posts: 928

once you have converted a webpage to an image (Quite abstract for me)

you should probably convert it to grayscale and then make custom filters to increase contrast to a very high value,
the resulting output would be an image thats black and white with text ready for OCR


I do not know of any current apis that support OCR.

My Website: [Salvin.in] Cool your mind:[Salvin.in/painting] My Sally:[Salvin.in/sally]
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42355
    
  64
I'd start by looking at the various packages that a search for "java ocr" brings up.


Ping & DNS - my free Android networking tools app
Denis Wen
Ranch Hand

Joined: Nov 11, 2008
Posts: 33
Thanks for the replies.

No, i was not going to convert the whole website into an image, there are just some pieces of information on the website not in textual form but as png images, that should be converted to textual form with ocr. Looking at the moment at Tesseract (http://code.google.com/p/tesseract-ocr/) since it's free (unlike java-coded Asprise) and quite simple, though should be called from the command line.
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: ocr from a website - how and how difficult?