Win a copy of Mesos in Action this week in the Cloud/Virtualizaton forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

How to search image content present in PDF file

 
srikanth savannagari
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Hi All,

i am able to search content of an pdf using Apache lucene, but if some images are there in that pdf
my probelm starts it's not searching the content of an image in that pdf. Does any body know how
search image content which present in the pdf file.



Cheers
Srikanth
 
Ulf Dittmer
Rancher
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What do you mean by "searching the content of an image" - do the images contain text in them, and you'd like to search in that text? If so, that's a hard thing to do, and Lucene can't do it for you. You'd need to extract the images (maybe using a library like PDFBox), and then perform Optical_character_recognition on the image. That may provide you with text that you can index using Lucene.
 
syed aq
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You need to use advanced PDF Editing API's like Aspose
 
Ulf Dittmer
Rancher
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Really? An editing API that knows how to do OCR? That *is* advanced.
 
srikanth savannagari
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Syed,

i have seen that Aspose Api by using that we can extract images from the pdf but we can't extract the
content in that image.
Is there any possibility other than that.

Hi Ulf,

as per my requirement they are not allowing to use OCR, that's why i am searching with in java.

is there any option to parse the images using any api??

Thanks & Regards,
Srikanth
 
Ulf Dittmer
Rancher
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
OCR is the process of extracting text from an image. In other words: no OCR --> no text. You will need to have that requirement changed (it sounds silly to begin with).
 
syed aq
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i too agree with Ulf, you need to use OCR to extract text from images, you can find some Java OCR API's
 
sherazam khan
Ranch Hand
Posts: 459
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I hope this may help regarding Aspose: Aspose.OCR for .NET is a character recognition component built to allow developers to add OCR functionality in their ASP .NET web applications, web services and Windows applications. It provides a simple set of classes for controlling character recognition tasks. It helps developers to work with image (BMP, TIFF) files from within their own applications. It allows developers to extract text from images quickly & easily , saving time & effort involved in developing an OCR solution from scratch. View more details at: http://www.aspose.com/categories/.net-components/aspose.ocr-for-.net/default.aspx

 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic