aspose file tools*
The moose likes JForum and the fly likes Lucene search in attachements does not work Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of The Java EE 7 Tutorial Volume 1 or Volume 2 this week in the Java EE forum
or jQuery UI in Action in the JavaScript forum!
JavaRanch » Java Forums » Products » JForum
Bookmark "Lucene search in attachements does not work" Watch "Lucene search in attachements does not work" New topic
Author

Lucene search in attachements does not work

Heri Bender
Greenhorn

Joined: Jan 07, 2011
Posts: 16
Testing V2.3.5 on a localohos (tomcat 7 and mysql)

The search does not seem to look into attachements. Since I know that Lucene is able to search many file formats: Am I doing something wrong? Is there a config option to enable the file search?

Thanks for reply.

Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41572
    
  54
Actually, Lucene does not know anything about any file formats. It can index and search text that it gets handed by application code. JForum does not have provisions for that. If one wanted to add that, one approach would be to use a library like Apache Tika that knows how to extract text from many file formats.


Ping & DNS - my free Android networking tools app
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41572
    
  54
Thinking about it some more, that would be an interesting feature. Since I happen to be a commiter on the JForum2 project on GoogleCode, I may even be in a position to make it happen :-) No promises, though, I need to look into it more deeply first.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41572
    
  54
I've just added that feature to the trunk, since I happened to have code lying around that uses Tika to index files. That was fun :-)
Heri Bender
Greenhorn

Joined: Jan 07, 2011
Posts: 16
Hey, great news. I will try out. Thanks.
Heri Bender
Greenhorn

Joined: Jan 07, 2011
Posts: 16
Hi Ulf

updated to the trunk, but does not compile:

GenericAnnouncementDAO: Preferences not found.

The previous revision (275) compiles, but three Unittest fail, when I run "mvn install" (POPListenerTestCase).

Thanks for a hint how to fix the Unit-Tests or how to run mvn install without executing the unit tests

Heri

BTW: I have updated the german language file. Many entries were missing. But I cannot commit to SVN nor attach it here. Regardless what extension I choose it always complains that "the extension xy is not allowed" (props, txt, no extension at all, ..)
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41572
    
  54
I apologize - I forgot to check in that particular file. Please try again now, it should work.

As to the German language properties file - no, you can not attach that here, nor check it into SVN. I'll be happy to do that for you, if you send that to me via the email address listed in my profile here.
Heri Bender
Greenhorn

Joined: Jan 07, 2011
Posts: 16
Finally got it running.

Two suggestions:

1. Attachements should be indexed automatically when posting the message (configurable?). It seems to me that I have to start the index as administrator manually. From the log:

20:21:04.672 ( 98201) DEBUG [http-apr-8080-exec-9] LuceneIndexer.java:196 - Indexed Document<stored,indexed<post.id:22> stored,indexed<forum.id:1> stored,indexed<topic.id:1> stored,indexed<user.id:2> stored,indexed<date:20130830202104> indexed,tokenized<subject:Aw:Welcome to JForum> indexed,tokenized<contents:Und noch eine docx Beilage
und eine PDF Beilage
und eine ODt Beilage>>
20:22:10.019 (163548) INFO [http-apr-8080-exec-10] LuceneSearch.java:268 - searching for: Collaborative Housing
20:22:10.041 (163570) INFO [http-apr-8080-exec-10] LuceneSearch.java:155 - criteria=[+(contents:"Collaborative Housing" OR subject:"Collaborative Housing")]
20:22:10.074 (163603) INFO [http-apr-8080-exec-10] LuceneSearch.java:172 - hits=0
20:23:26.771 (240300) INFO [http-apr-8080-exec-8] SystemGlobals.java:306 - Key 'lucene.currently.indexing' is not found in F:\Program Files\apache-tomcat-7.0.33\webapps\jforum/WEB-INF/config/SystemGlobals.properties and F:\Program Files\apache-tomcat-7.0.33\webapps\jforum/WEB-INF/config/jforum-custom.conf
20:24:21.258 (294787) DEBUG [Thread-42] LuceneReindexer.java:128 - firstPostId=0
20:24:21.259 (294788) DEBUG [Thread-42] LuceneReindexer.java:130 - lastPostId=1000
20:24:21.261 (294790) DEBUG [Thread-42] LuceneReindexer.java:134 - dbFirstPostId=1
20:24:21.261 (294790) DEBUG [Thread-42] LuceneReindexer.java:135 - dbLastPostId=22
20:24:21.261 (294790) DEBUG [Thread-42] LuceneReindexer.java:144 - firstPostId=1
20:24:21.261 (294790) DEBUG [Thread-42] LuceneReindexer.java:145 - lastPostId=22
20:24:21.261 (294790) DEBUG [Thread-42] LuceneReindexer.java:157 - firstPostId=1
20:24:21.261 (294790) DEBUG [Thread-42] LuceneReindexer.java:158 - toPostId=22
20:24:21.267 (294796) DEBUG [Thread-42] LuceneIndexer.java:245 - indexing 8c2c937b2d569921de6e2352e96534b9_2.txt_
20:24:21.572 (295101) DEBUG [Thread-42] LuceneIndexer.java:245 - indexing dc3cdd22df5e502bbcd1a3d1bbe62088_2.pdf_
20:24:22.713 (296242) DEBUG [Thread-42] LuceneIndexer.java:245 - indexing d3c2544d5d514b6b511fe6e6624f148c_2.docx_
20:24:23.509 (297038) DEBUG [Thread-42] LuceneIndexer.java:245 - indexing 14dbf15f1e3ad73bb08e3c8c496d9ecc_2.odt_
20:24:23.655 (297184) INFO [Thread-42] LuceneReindexer.java:209 - **** Total: 2397 ms
20:24:35.256 (308785) INFO [http-apr-8080-exec-4] LuceneSearch.java:268 - searching for: Collaborative Housing
20:24:35.258 (308787) INFO [http-apr-8080-exec-4] LuceneSearch.java:155 - criteria=[+(+(contents:Collaborative OR subject:Collaborative)+(contents:Housing OR subject:Housing))]
20:24:35.292 (308821) INFO [http-apr-8080-exec-4] LuceneSearch.java:172 - hits=1


BTW1: I have chosen SearchStats.appendToIndex=true and SearchStats.checkMessageExists=off: Nevertheless the log shows "indexing 8c2c937b2d569921de6e2352e96534b9_2.txt" although this file was already indexed.
BTW2: A button reindex all would be nice. Or the date fields should have a meaningful default (e.g. last week until today ).

2. It would be nice if the search result page would indicate that the search term was found in the attached document xy. My test message added three documents, the search result (searched for a term which is only in one of these three documents) shows only the link to the message. A document link in the search result page would be great.

Thanks for your effort.

Heri
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41572
    
  54
Now that this discussion in entirely about the JForum version hosted on GoogleCode, it would sit better in the forums over at http://jforum.andowson.com/. Please open a new discussion topic there.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41572
    
  54
1. Attachements should be indexed automatically when posting the message (configurable?). It seems to me that I have to start the index as administrator manually.

Whenever a post is edited in any way, it is reindexed. I'm not sure how the output you posted indicates otherwise...?

I have chosen SearchStats.appendToIndex=true and SearchStats.checkMessageExists=off: Nevertheless the log shows "indexing 8c2c937b2d569921de6e2352e96534b9_2.txt" although this file was already indexed.

You should not use "appendToIndex", unless you know exactly what you are doing. It causes the posts to be stored twice in the index - not what you want. The indexer has no concept of what it has indexed before, either posts or attachments, so it can't take that into account during in index run. (Unless you use "checkMessageExists", that is, but using that prevents reindexing, which is what you generally want when starting an index run on that page.)

A button reindex all would be nice.

The easiest way to do that is to check "Recreate index from scratch", and use IDs from 0 to more than the number of posts you have. For example, if "Total posts in the database:" is 666, you can reindex from 0 to 700.

It would be nice if the search result page would indicate that the search term was found in the attached document xy. My test message added three documents, the search result (searched for a term which is only in one of these three documents) shows only the link to the message. A document link in the search result page would be great.

That would be a major effort - currently the index has no way of differentiating between content from a post and content from an attachment.
Heri Bender
Greenhorn

Joined: Jan 07, 2011
Posts: 16
Ulf Dittmer wrote:Whenever a post is edited in any way, it is reindexed. I'm not sure how the output you posted indicates otherwise...?


The log-entry at 20:22:10.019 shows that I start a search, and following entry shows that no hits were found. This happens after I have posted the post. Afterwards I start the reindexing manually (with described params), and then the next search finds the term "Collaborative Housing". This led me to the assumption that the indexing of attachements must be triggered manually.

Heri Bender wrote:It would be nice if the search result page would indicate that the search term was found in the attached document xy.

Ulf Dittmer wrote:That would be a major effort - currently the index has no way of differentiating between content from a post and content from an attachment.


A suggestion which probably does not cause that "major effort": When rendering the search result page there could be a quick search for the term within the found message. If not present in the message text, the algo can assume it was found in the attached document and present a sentence to the user about this fact.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41572
    
  54
Turns out there was a bug that caused attachments not to get indexed under certain circumstances - I have just checked in a fix for that.

(I've also checked in the German language properties file you had sent me; thanks for that.)
 
 
subject: Lucene search in attachements does not work