• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

How to search image content present in PDF file

 
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Hi All,

i am able to search content of an pdf using Apache lucene, but if some images are there in that pdf
my probelm starts it's not searching the content of an image in that pdf. Does any body know how
search image content which present in the pdf file.



Cheers
Srikanth
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What do you mean by "searching the content of an image" - do the images contain text in them, and you'd like to search in that text? If so, that's a hard thing to do, and Lucene can't do it for you. You'd need to extract the images (maybe using a library like PDFBox), and then perform Optical_character_recognition on the image. That may provide you with text that you can index using Lucene.
 
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You need to use advanced PDF Editing API's like Aspose
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Really? An editing API that knows how to do OCR? That *is* advanced.
 
srikanth savannagari
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Syed,

i have seen that Aspose Api by using that we can extract images from the pdf but we can't extract the
content in that image.
Is there any possibility other than that.

Hi Ulf,

as per my requirement they are not allowing to use OCR, that's why i am searching with in java.

is there any option to parse the images using any api??

Thanks & Regards,
Srikanth
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
OCR is the process of extracting text from an image. In other words: no OCR --> no text. You will need to have that requirement changed (it sounds silly to begin with).
 
syed aq
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
i too agree with Ulf, you need to use OCR to extract text from images, you can find some Java OCR API's
 
Ranch Hand
Posts: 714
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I hope this may help regarding Aspose: Aspose.OCR for .NET is a character recognition component built to allow developers to add OCR functionality in their ASP .NET web applications, web services and Windows applications. It provides a simple set of classes for controlling character recognition tasks. It helps developers to work with image (BMP, TIFF) files from within their own applications. It allows developers to extract text from images quickly & easily , saving time & effort involved in developing an OCR solution from scratch. View more details at: http://www.aspose.com/categories/.net-components/aspose.ocr-for-.net/default.aspx

reply
    Bookmark Topic Watch Topic
  • New Topic