aspose file tools*
The moose likes Java in General and the fly likes Highlight the words in files( doc, excel, pdf etc) using java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Highlight the words in files( doc, excel, pdf etc) using java" Watch "Highlight the words in files( doc, excel, pdf etc) using java" New topic
Author

Highlight the words in files( doc, excel, pdf etc) using java

mangala shenoy
Greenhorn

Joined: Jan 17, 2012
Posts: 8
I have a web application in which i have a search functionality. Search results will be different types of files.

While opening the file i want to highlight the search words.

Can anybody please help me.
Karthik Shiraly
Ranch Hand

Joined: Apr 04, 2009
Posts: 513
    
    6
Depends on what you've used to implement search inside files. How have you implemented it? What frameworks are you using?
mangala shenoy
Greenhorn

Joined: Jan 17, 2012
Posts: 8
Iam using JSF and SDO. And for Search i have my own logic. It works fine. Now once the search results are displayed i want to open the file with highlighed.
I tried lucene,JACOB, inserting html tags in the file.

Nothing seems to work.
Tim Moores
Rancher

Joined: Sep 21, 2011
Posts: 2408
How are you displaying the search results? What do you mean by "inserting html tags in the file"?
mangala shenoy
Greenhorn

Joined: Jan 17, 2012
Posts: 8
Iam putting the search results in session scope and displaying it in jsf page using jsf data table. Iam providing a link to open file.

On click of filename calling a function for reading from the file and writing to servletouputstram.

I tried inserting html tags before writing to outputstream. Like for .doc i used wordextractor and then did a replaceall for the word to include htl tags
Tim Moores
Rancher

Joined: Sep 21, 2011
Posts: 2408
So the output is HTML? Because if you stream the actual file contents then it's far from trivial to alter the file contents so that arbitrary words will be highlighted, especially for the structured document formats you mention. I predict you will end up not doing this.
Karthik Shiraly
Ranch Hand

Joined: Apr 04, 2009
Posts: 513
    
    6
I agree with Tim. Extracting contents of all hits, making a copy for every search query, and inserting html tags seems error prone, inefficient and time consuming. If you extract just the words and display it, you're losing all the formatting. Even if use something like toHtml to convert to formatted HTML, it'll still look quite messed up in a lot of cases and not resemble formatting of the original files at all.

An ideal solution would be a document viewer that is capable of displaying multiple formats with formatting, searching them and highlighting hits.

I don't know any perfect solution, but one solution I could think of is the google docs viewer. Here's an example document with a search query.
I would think your users expect a readable and formatting rendering of a document, and feels like viewing it using a native viewer.
Using their Google documents API with appropriate sharing permissions and some kind of use-once-then-throwaway URLs for your documents, I think it's possible to integrate google docs even for private documents.

Now, if highlighting and viewing a document is not a critical use case - just highlighting the area around the search hit is enough - then you can look into the snippet highlighting capabilities provided by lucene and solr.
Here's an example of what it can look like.
You mention search is your own logic - not sure what logic that is - but if it's feasible for you to move to lucene or solr, then they can provide this kind of snippet highlighting.
Solr gives you document extraction and highlighting out of the box, through the Apache Tika framework (which is itself built over POI, iText and other file format specific handlers).
With lucene, you'd have to roll out your own implementation that integrates Tika and then use the Highlighter API.
Karthik Shiraly
Ranch Hand

Joined: Apr 04, 2009
Posts: 513
    
    6
Just found this after typing my reply: You might also be interested in Aspose's capabilities for this. It's a commercial product.
mangala shenoy
Greenhorn

Joined: Jan 17, 2012
Posts: 8
No.. Output need not be html. My requirement is when user opens a file i want the search terms to be highlighted.
I just tried putting html tags. It dint work.

I tried lucene, using lucene is possible to highlight the words in search results . But i dont think it has the option for highlighting the terms in a file.

Tim Moores
Rancher

Joined: Sep 21, 2011
Posts: 2408
mangala shenoy wrote:I just tried putting html tags. It dint work.

Of course not. You can't mix and match structured document formats and HTML.

But i dont think [lucene] has the option for highlighting the terms in a file.

Correct. It can help you find the stuff that you want to highlight, but your code needs to do it, and it's different for each kind of document format. For PDF and DOC(X) in particular this will be hard, if not impossible. XLS(X) a bit easier if you use the Apache POI library.
mangala shenoy
Greenhorn

Joined: Jan 17, 2012
Posts: 8
Can you please tell me how to integrate google doc api in Java
Karthik Shiraly
Ranch Hand

Joined: Apr 04, 2009
Posts: 513
    
    6
If your documents are - or can be made - available at some public URLs, then it's a simple matter of showing URL like this in your search results table:
<a href="https://docs.google.com/viewer?url=[your-document-url]&q=[search query]" target="_blank" rel="nofollow">View document</a>
It'll open the viewer in another window/tab.

But if your documents have some security requirements (can be opened only by select people, etc), then you should go through their guides and prototype using their java client library, before integrating it. The viewer certainly seems to solve your problem, but whether there are any security incompatibilities with your system should be checked by you. I have not integrated google docs myself so far, so I'm basing my answer only on a shallow review of their documentation.
mangala shenoy
Greenhorn

Joined: Jan 17, 2012
Posts: 8
Documents will be stored in a location on server. I need to open from there. For Google doc viewer it has to be in some url right?

Do you know about JACOB?. For doc and docx i could do it using JACOB. Iam not able to do for ppt, xls.
Karthik Shiraly
Ranch Hand

Joined: Apr 04, 2009
Posts: 513
    
    6
By JACOB, are you referring to this project - Java COM bridge?
I have not used it. As I understood, it uses the COM interfaces exposed by Word, Excel, and other MSoffice components via JNI. I guess MSOffice has to be installed on your web server and both have to be running on a Windows OS for this solution to work. What problem are you facing exactly - any error information?
mangala shenoy
Greenhorn

Joined: Jan 17, 2012
Posts: 8
I dint know that thanks.

I tried google doc viewer the files have to be on some url right? S i think i cant use it.

So is there any other solution for my reqirement.
Karthik Shiraly
Ranch Hand

Joined: Apr 04, 2009
Posts: 513
    
    6
mangala shenoy wrote:I tried google doc viewer the files have to be on some url right? S i think i cant use it.
So is there any other solution for my reqirement.


Yes, google docs viewer requires some URL to get the contents. But this URL need not necessarily be a direct URL to a document.
It could be a plain servlet URL which reads the document from wherever it's stored, and dumps the contents on the output stream.
Or it could be uploaded to a google docs account (perhaps temporarily, then deleted) . That uploaded document will have a URL which can be sent to the viewer.

If these are private docs and there are security constraints, then security wise too, I can think of a bunch of options to make it secure (or atleast, as obscure as possible) though which one would be suitable for your system I have no idea. Perhaps, examine request headers and allow only if it's from google docs viewer URL (hopefully, the headers contains such information). Or, check whether google docs API's ACL permissions and viewer can be integrated. Or authenticate first and then redirect to viewer. If I were you, I'd prototype these approaches, to see which one(s) give fairly foolproof security.

I'm not aware of other solution, but that doesn't mean it doesn't exist. Probably you'll have to look at commercial offerings, like Aspose or box.com.

Perhaps other forum members know some good viewers.
You can try asking another question asking specifically for web based document viewers, instead of highlighting - maybe in the HTML forum (because solution is likely to be a flash implementation) - and then evaluate the suggestions you get.
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: Highlight the words in files( doc, excel, pdf etc) using java