Meaningless Drivel is fun!*
The moose likes Other Open Source Projects and the fly likes How to search image content present in PDF file Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCA/OCP Java SE 7 Programmer I & II Study Guide this week in the OCPJP forum!
JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "How to search image content present in PDF file" Watch "How to search image content present in PDF file" New topic
Author

How to search image content present in PDF file

srikanth savannagari
Greenhorn

Joined: Mar 22, 2011
Posts: 15

Hi All,

i am able to search content of an pdf using Apache lucene, but if some images are there in that pdf
my probelm starts it's not searching the content of an image in that pdf. Does any body know how
search image content which present in the pdf file.



Cheers
Srikanth
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42371
    
  64
What do you mean by "searching the content of an image" - do the images contain text in them, and you'd like to search in that text? If so, that's a hard thing to do, and Lucene can't do it for you. You'd need to extract the images (maybe using a library like PDFBox), and then perform Optical_character_recognition on the image. That may provide you with text that you can index using Lucene.


Ping & DNS - my free Android networking tools app
syed aq
Greenhorn

Joined: Jul 16, 2011
Posts: 3
You need to use advanced PDF Editing API's like Aspose
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42371
    
  64
Really? An editing API that knows how to do OCR? That *is* advanced.
srikanth savannagari
Greenhorn

Joined: Mar 22, 2011
Posts: 15
Hi Syed,

i have seen that Aspose Api by using that we can extract images from the pdf but we can't extract the
content in that image.
Is there any possibility other than that.

Hi Ulf,

as per my requirement they are not allowing to use OCR, that's why i am searching with in java.

is there any option to parse the images using any api??

Thanks & Regards,
Srikanth
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42371
    
  64
OCR is the process of extracting text from an image. In other words: no OCR --> no text. You will need to have that requirement changed (it sounds silly to begin with).
syed aq
Greenhorn

Joined: Jul 16, 2011
Posts: 3
i too agree with Ulf, you need to use OCR to extract text from images, you can find some Java OCR API's
sherazam khan
Ranch Hand

Joined: Mar 10, 2010
Posts: 303
I hope this may help regarding Aspose: Aspose.OCR for .NET is a character recognition component built to allow developers to add OCR functionality in their ASP .NET web applications, web services and Windows applications. It provides a simple set of classes for controlling character recognition tasks. It helps developers to work with image (BMP, TIFF) files from within their own applications. It allows developers to extract text from images quickly & easily , saving time & effort involved in developing an OCR solution from scratch. View more details at: http://www.aspose.com/categories/.net-components/aspose.ocr-for-.net/default.aspx

 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: How to search image content present in PDF file