i am able to search content of an pdf using Apache lucene, but if some images are there in that pdf
my probelm starts it's not searching the content of an image in that pdf. Does any body know how
search image content which present in the pdf file.
What do you mean by "searching the content of an image" - do the images contain text in them, and you'd like to search in that text? If so, that's a hard thing to do, and Lucene can't do it for you. You'd need to extract the images (maybe using a library like PDFBox), and then perform Optical_character_recognition on the image. That may provide you with text that you can index using Lucene.
i have seen that Aspose Api by using that we can extract images from the pdf but we can't extract the
content in that image.
Is there any possibility other than that.
Hi Ulf,
as per my requirement they are not allowing to use OCR, that's why i am searching with in java.
is there any option to parse the images using any api??
OCR is the process of extracting text from an image. In other words: no OCR --> no text. You will need to have that requirement changed (it sounds silly to begin with).
I hope this may help regarding Aspose: Aspose.OCR for .NET is a character recognition component built to allow developers to add OCR functionality in their ASP .NET web applications, web services and Windows applications. It provides a simple set of classes for controlling character recognition tasks. It helps developers to work with image (BMP, TIFF) files from within their own applications. It allows developers to extract text from images quickly & easily , saving time & effort involved in developing an OCR solution from scratch. View more details at: http://www.aspose.com/categories/.net-components/aspose.ocr-for-.net/default.aspx
When all four tires fall off your canoe, how many tiny ads does it take to build a doghouse?
Free, earth friendly heat - from the CodeRanch trailboss