Win a copy of Mesos in Action this week in the Cloud/Virtualizaton forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Lucene wildcard with multiple tokenized terms.

 
Devaka Cooray
ExamLab Creator
Marshal
Pie
Posts: 4435
256
Chrome Eclipse IDE Google App Engine IntelliJ IDE jQuery Postgres Database Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
With Lucene 2.9, I have a tokenized-indexed field named 'title'. I have an indexed document with "coderanch moose" in the title. With this setup, I can easily search for all these terms in query:

  • coderanch
  • "coderanch moose"
  • coder*

  • Now, if I search for something like "coderanch mo*" (even with spaces escaped), it doesn't match with anything. Maybe it should work if I add a separate non-tokenized field, but that would be pretty redundant . Is there any better way to make such a query functional?
     
    Ulf Dittmer
    Rancher
    Posts: 42967
    73
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    What kind of Analyzer and what kind of Query are you using?
     
    Devaka Cooray
    ExamLab Creator
    Marshal
    Pie
    Posts: 4435
    256
    Chrome Eclipse IDE Google App Engine IntelliJ IDE jQuery Postgres Database Tomcat Server
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Current implementation runs on a StandardAnalyzer. About the query type - I'm not sure what I should say as the 'type', but I use QueryParser.parse(String) to parse the query. I'm open to change the analyzer and queries. The only requirement is I need to be able to search on terms that are both tokenized and untokenized. In other words, "moo*", "codera*", and something like "coderanch moo*" should be able to yield out the intended document. As with the StandardAnalyzer, I don't get anything for "coderanch moo*".
     
    Ulf Dittmer
    Rancher
    Posts: 42967
    73
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    The following code -which uses StandardAnalyzer and QueryParser- finds that term using Lucene 3.6.2. Maybe your code is somehow different?
     
    Ulf Dittmer
    Rancher
    Posts: 42967
    73
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Printing "query.toString()" is also helpful, as it shows what query will be executed. For example, searching for "coderanch moose" will search for "coderanch or moose", not for "coderanch and moose", and also not for the consecutive words "coderanch moose".
     
    Devaka Cooray
    ExamLab Creator
    Marshal
    Pie
    Posts: 4435
    256
    Chrome Eclipse IDE Google App Engine IntelliJ IDE jQuery Postgres Database Tomcat Server
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Thanks Ulf . The code is almost the same, except I have to search it with a field name, as in caption:coderanch mo* . However, when I search for "coderanch mo*" in that way, what it looks like it only searches the "coderanch" part. It returns the same result despite whatever used as in the second term with wildcard.
     
    Ulf Dittmer
    Rancher
    Posts: 42967
    73
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Not sure I follow. caption:coderanch mo* will find all documents that match either "coderanch" in "caption" or "mo*" in whatever the default search field is (because there is no field specified for it). If you want to search for the phrase "coderanch mo*" in "caption", then the syntax would be caption:"coderanch mo*".

    The entire syntax is explained in http://lucene.apache.org/core/3_6_1/queryparsersyntax.html
     
    Devaka Cooray
    ExamLab Creator
    Marshal
    Pie
    Posts: 4435
    256
    Chrome Eclipse IDE Google App Engine IntelliJ IDE jQuery Postgres Database Tomcat Server
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Ulf Dittmer wrote:If you want to search for the phrase "coderanch mo*" in "caption", then the syntax would be caption:"coderanch mo*".

    That's what I need, but it didn't work - seaching so yeilded out no result. Could that be because of some bug in 2.9?
     
    Ulf Dittmer
    Rancher
    Posts: 42967
    73
    • Likes 1
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    If you want to search for the phrase "coderanch mo*" in "caption", then the syntax would be caption:"coderanch mo*".

    Sorry, I need to retract that. PrefixQuery (and thus the standard QueryParser parsing prefix queries) does not support prefixes with phrases, only with terms. So you need to use a query parser that does support this, namely ComplexPhraseQueryParser which is part of the contrib/queryparser jar. For some reason that only worked with prefix phrases, not prefix terms, in my tests, so the solution needs to distinguish between those two cases:



    If you add 3 documents (say, "coderanch", "moose" and "coderanch moose") you can see the difference.
     
    Ulf Dittmer
    Rancher
    Posts: 42967
    73
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Does this address the issue?
     
    Devaka Cooray
    ExamLab Creator
    Marshal
    Pie
    Posts: 4435
    256
    Chrome Eclipse IDE Google App Engine IntelliJ IDE jQuery Postgres Database Tomcat Server
    • Mark post as helpful
    • send pies
    • Quote
    • Report post to moderator
    Sorry, I was completely away from almost everything.

    And yes, your suggestion was what helped. Thanks muchly! The only problem then was I had to support both prefix-phrases and prefix-terms, which I then did by extending your suggestion in a way that it takes the search query, split it by spaces, and use * only at the last 'word'. The rest of the words are grouped into a single phrase, so I can search it as a separate phrase, added to the original search query.
     
    • Post Reply
    • Bookmark Topic Watch Topic
    • New Topic