This week's giveaway is in the Android forum.
We're giving away four copies of Android Security Essentials Live Lessons and have Godfrey Nolan on-line!
See this thread for details.
The moose likes Java in General and the fly likes ocr from a website - how and how difficult? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "ocr from a website - how and how difficult?" Watch "ocr from a website - how and how difficult?" New topic
Author

ocr from a website - how and how difficult?

Denis Wen
Ranch Hand

Joined: Nov 11, 2008
Posts: 33
His,

If anyone can share experience with using OCR software with java to parse webpages.
My prospective task could turn out to be parsing a website where unfortunately some of the important information is displayed as images. Therefore I thought it would be better to seek advice first on good (and possibly free OCR software to use and maybe someone's portion of experience.

thanks
salvin francis
Ranch Hand

Joined: Jan 12, 2009
Posts: 917

once you have converted a webpage to an image (Quite abstract for me)

you should probably convert it to grayscale and then make custom filters to increase contrast to a very high value,
the resulting output would be an image thats black and white with text ready for OCR


I do not know of any current apis that support OCR.

My Website: [Salvin.in] Cool your mind:[Salvin.in/painting] My Sally:[Salvin.in/sally]
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41127
    
  45
I'd start by looking at the various packages that a search for "java ocr" brings up.


Ping & DNS - my free Android networking tools app
Denis Wen
Ranch Hand

Joined: Nov 11, 2008
Posts: 33
Thanks for the replies.

No, i was not going to convert the whole website into an image, there are just some pieces of information on the website not in textual form but as png images, that should be converted to textual form with ocr. Looking at the moment at Tesseract (http://code.google.com/p/tesseract-ocr/) since it's free (unlike java-coded Asprise) and quite simple, though should be called from the command line.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: ocr from a website - how and how difficult?
 
Similar Threads
extract text from image
read text from image
How to search image content present in PDF file
What do you use for scanning on the Mac?
Read images in PDF document