This week's giveaway is in the Android forum.
We're giving away four copies of Android Security Essentials Live Lessons and have Godfrey Nolan on-line!
See this thread for details.
The moose likes Other Open Source Projects and the fly likes Lucene wildcard with multiple tokenized terms. Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "Lucene wildcard with multiple tokenized terms." Watch "Lucene wildcard with multiple tokenized terms." New topic
Author

Lucene wildcard with multiple tokenized terms.

Devaka Cooray
ExamLab Creator
Saloon Keeper

Joined: Jul 29, 2008
Posts: 3019
    
  35

With Lucene 2.9, I have a tokenized-indexed field named 'title'. I have an indexed document with "coderanch moose" in the title. With this setup, I can easily search for all these terms in query:

  • coderanch
  • "coderanch moose"
  • coder*

  • Now, if I search for something like "coderanch mo*" (even with spaces escaped), it doesn't match with anything. Maybe it should work if I add a separate non-tokenized field, but that would be pretty redundant . Is there any better way to make such a query functional?


    Author of ExamLab ExamLab - a free SCJP / OCPJP exam simulator
    What would SCJP exam questions look like? -- Home -- Twitter -- How to Ask a Question
    Ulf Dittmer
    Marshal

    Joined: Mar 22, 2005
    Posts: 41144
        
      45
    What kind of Analyzer and what kind of Query are you using?


    Ping & DNS - my free Android networking tools app
    Devaka Cooray
    ExamLab Creator
    Saloon Keeper

    Joined: Jul 29, 2008
    Posts: 3019
        
      35

    Current implementation runs on a StandardAnalyzer. About the query type - I'm not sure what I should say as the 'type', but I use QueryParser.parse(String) to parse the query. I'm open to change the analyzer and queries. The only requirement is I need to be able to search on terms that are both tokenized and untokenized. In other words, "moo*", "codera*", and something like "coderanch moo*" should be able to yield out the intended document. As with the StandardAnalyzer, I don't get anything for "coderanch moo*".
    Ulf Dittmer
    Marshal

    Joined: Mar 22, 2005
    Posts: 41144
        
      45
    The following code -which uses StandardAnalyzer and QueryParser- finds that term using Lucene 3.6.2. Maybe your code is somehow different?
    Ulf Dittmer
    Marshal

    Joined: Mar 22, 2005
    Posts: 41144
        
      45
    Printing "query.toString()" is also helpful, as it shows what query will be executed. For example, searching for "coderanch moose" will search for "coderanch or moose", not for "coderanch and moose", and also not for the consecutive words "coderanch moose".
    Devaka Cooray
    ExamLab Creator
    Saloon Keeper

    Joined: Jul 29, 2008
    Posts: 3019
        
      35

    Thanks Ulf . The code is almost the same, except I have to search it with a field name, as in caption:coderanch mo* . However, when I search for "coderanch mo*" in that way, what it looks like it only searches the "coderanch" part. It returns the same result despite whatever used as in the second term with wildcard.
    Ulf Dittmer
    Marshal

    Joined: Mar 22, 2005
    Posts: 41144
        
      45
    Not sure I follow. caption:coderanch mo* will find all documents that match either "coderanch" in "caption" or "mo*" in whatever the default search field is (because there is no field specified for it). If you want to search for the phrase "coderanch mo*" in "caption", then the syntax would be caption:"coderanch mo*".

    The entire syntax is explained in http://lucene.apache.org/core/3_6_1/queryparsersyntax.html
    Devaka Cooray
    ExamLab Creator
    Saloon Keeper

    Joined: Jul 29, 2008
    Posts: 3019
        
      35

    Ulf Dittmer wrote:If you want to search for the phrase "coderanch mo*" in "caption", then the syntax would be caption:"coderanch mo*".

    That's what I need, but it didn't work - seaching so yeilded out no result. Could that be because of some bug in 2.9?
    Ulf Dittmer
    Marshal

    Joined: Mar 22, 2005
    Posts: 41144
        
      45
    If you want to search for the phrase "coderanch mo*" in "caption", then the syntax would be caption:"coderanch mo*".

    Sorry, I need to retract that. PrefixQuery (and thus the standard QueryParser parsing prefix queries) does not support prefixes with phrases, only with terms. So you need to use a query parser that does support this, namely ComplexPhraseQueryParser which is part of the contrib/queryparser jar. For some reason that only worked with prefix phrases, not prefix terms, in my tests, so the solution needs to distinguish between those two cases:



    If you add 3 documents (say, "coderanch", "moose" and "coderanch moose") you can see the difference.
    Ulf Dittmer
    Marshal

    Joined: Mar 22, 2005
    Posts: 41144
        
      45
    Does this address the issue?
    Devaka Cooray
    ExamLab Creator
    Saloon Keeper

    Joined: Jul 29, 2008
    Posts: 3019
        
      35

    Sorry, I was completely away from almost everything.

    And yes, your suggestion was what helped. Thanks muchly! The only problem then was I had to support both prefix-phrases and prefix-terms, which I then did by extending your suggestion in a way that it takes the search query, split it by spaces, and use * only at the last 'word'. The rest of the words are grouped into a single phrase, so I can search it as a separate phrase, added to the original search query.
     
     
    subject: Lucene wildcard with multiple tokenized terms.
     
    Similar Threads
    Fuzzy String Search
    Lucene in Action
    Liferay With Apache Lucene
    Lucene Ranking
    hibernate search indexing not working