aspose file tools*
The moose likes Programmer Certification (SCJP/OCPJP) and the fly likes doubt on group() in Matcher class Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Certification » Programmer Certification (SCJP/OCPJP)
Bookmark "doubt on group() in Matcher class " Watch "doubt on group() in Matcher class " New topic
Author

doubt on group() in Matcher class

saipavan vallabhaneni
Ranch Hand

Joined: Nov 14, 2008
Posts: 34



i have two doubts regarding this program...a match having zero or more digits
  • 1st match is found at index-0 as it has zero digits...but why is the matcher returning a empty string when the actual match is "a"...the function on group is to return the match found isn't it???
  • im totally confused with the execution part ...can any explain how the match is done


  • Pawan Arora
    Ranch Hand

    Joined: Sep 14, 2008
    Posts: 105
    \\d metacharacter matches digits and * quantifier which is greedy quantifier means zero or more digits. That's why you're getting empty string here when there is no match found. use the + quantifier instead.
    [ December 06, 2008: Message edited by: Pawan Arora ]
    Ankit Garg
    Sheriff

    Joined: Aug 03, 2008
    Posts: 9280
        
      17

    well this is the behavior of greedy quantifiers that I observed. But I didn't experiment on it much so it may be wrong

    when you use * with a pattern matcher like \d or \w, then it becomes reluctant to find the matching pattern. It will start matching zero length matches.

    But when you use * with dot (.), then it becomes greedy. It tries to match the . with as much characters that it can. So if you try to find .*\\d, it starts to search from the right and matches the first digit that it finds...


    SCJP 6 | SCWCD 5 | Javaranch SCJP FAQ | SCWCD Links
    saipavan vallabhaneni
    Ranch Hand

    Joined: Nov 14, 2008
    Posts: 34
    thanks ankit and pawan,

    Ankit ..as the greedy quantifiers read the entire source string and start back from right most for a match...so i was wondering how the start method printed out 0 in the 1st place..because 5 must have been printed as "f" is a perfect match as it has got 0 digits in it...
    i am really confused with this...

    can anyone elaborate on the execution sequence???
    Ankit Garg
    Sheriff

    Joined: Aug 03, 2008
    Posts: 9280
        
      17

    saipavan you got my point wrong. If you use this

    \\d*

    then it will look for zero or more occurrences of any digit. It will look into the string



    It will find zero occurrences of a digit at index 0,
    then it will find zero occurrences of a digit at index 1,
    then it will find two occurrences of a digit at index 2,
    then it will find zero occurrences of a digit at index 4,
    then it will find zero occurrences of a digit at index 5,
    then it will find zero occurrences of a digit at index 6.

    I hope this clears your doubt...
    saipavan vallabhaneni
    Ranch Hand

    Joined: Nov 14, 2008
    Posts: 34
    thanks ankit,

    but is it not true that greedy quantifier looks at the entire source string once and then reverts back from right to find the match and include the part of the source left side to the match as the final match...

    source: yyxxxyxx
    regex: .*xx
    output: yyxxxyxx(at match is found at the end and part source string prior to the match is included in the output as the entire source ends in a xx)
    Henry Wong
    author
    Sheriff

    Joined: Sep 28, 2004
    Posts: 18117
        
      39

    but is it not true that greedy quantifier looks at the entire source string once and then reverts back from right to find the match and include the part of the source left side to the match as the final match...


    Keep in mind that there are two things going on here. First, the regex, which includes a greedy qualifier, which will try to match as much as possible, backing down only if it fails to match.

    And Second, is related to the logic of the find() method. The find() method determines the start of the string to match. It "finds" matches from the start of the string to the end of the string, applying the regex -- returning matches that it finds.

    Henry
    [ December 06, 2008: Message edited by: Henry Wong ]

    Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
    Henry Wong
    author
    Sheriff

    Joined: Sep 28, 2004
    Posts: 18117
        
      39

    Using this example -- applying the principles from the previous post...



    The find() method will start at index 0, and apply the regex. The regex will greedily match the whole string with ".*" portion of the regex -- but must then back down the last two letters, so that the "xx" part of the regex could also match.

    On the next call, the find() method will then start at the end of the previous match, which is at index 8, and apply the regex. The regex will fail to match -- the ".*" portion can match (zero characters), but the "xx" portion can't match. So, the find method will return false.

    Henry
    [ December 06, 2008: Message edited by: Henry Wong ]
    saipavan vallabhaneni
    Ranch Hand

    Joined: Nov 14, 2008
    Posts: 34
    thanks henry,
    but in the 1st code snippet having source "ab34ef" why is the group() method returning a null(when start() returns 0)instead of "a"(since it has 0 or more digits)...group() method returns the match that has been found(which in my guess is "a" rather than null as returned by the group())
    Ankit Garg
    Sheriff

    Joined: Aug 03, 2008
    Posts: 9280
        
      17

    I would still stick to my words. If you use * with a dot(.), then * will become greedy. But if you put * with a pattern, then * will be reluctant.

    See this example



    Just compile and run this program and you will see what I am trying to say...
    saipavan vallabhaneni
    Ranch Hand

    Joined: Nov 14, 2008
    Posts: 34
    thanks Ankit,
    i am now a little aware of working of the mehods ...
    Henry Wong
    author
    Sheriff

    Joined: Sep 28, 2004
    Posts: 18117
        
      39

    Originally posted by saipavan vallabhaneni:
    thanks henry,
    but in the 1st code snippet having source "ab34ef" why is the group() method returning a null(when start() returns 0)instead of "a"(since it has 0 or more digits)...group() method returns the match that has been found(which in my guess is "a" rather than null as returned by the group())


    First of all, it is *not* returning null. It is returning a zero length string -- which is what was matched. And BTW, how can it return "a"? That doesn't even match!! But here is the complete explanation...

    The find() method will start at index 0, and apply the regex. The regex can't match anything as the character at this location isn't a digit -- but it does match as a "zero digits" and match as a zero length string.

    On the next call, the find() method should start at the end of the previous match, which is at index 0, but at minimum, it increments the index by 1, so it starts at index 1. The regex can't match anything as the character at this location isn't a digit -- but it does match as a "zero digits" and match as a zero length string.

    On the next call, the find() method should start at the end of the previous match, which is at index 1, but at minimum, it increments the index by 1, so it starts at index 2. The regex does find digits at this location, and greedily matches all of it -- and matches "34".

    On the next call, the find() method will start at the end of the previous match, which is at index 4, and apply the regex. The regex can't match anything as the character at this location isn't a digit -- but it does match as a "zero digits" and match as a zero length string.

    On the next call, the find() method should start at the end of the previous match, which is at index 4, but at minimum, it increments the index by 1, so it starts at index 5. The regex can't match anything as the character at this location isn't a digit -- but it does match as a "zero digits" and match as a zero length string.

    On the next call, the find() method should start at the end of the previous match, which is at index 5, but at minimum, it increments the index by 1, so it starts at index 6. The regex can't match anything as the character at this location isn't a digit -- but it does match as a "zero digits" and match as a zero length string.

    Also, note that this index if at the end of the string. This is allowed because technically, it is possible to have a zero length string at the end of the string. Weird, but true.

    On the next call, the find() method should start at the end of the previous match, which is at index 6, but at minimum, it increments the index by 1, so it starts at index 7. The regex can't match anything as this this location doesn't exist. It can't even match with a zero length string -- because this exceeds the length of the string.

    Henry
    [ December 07, 2008: Message edited by: Henry Wong ]
    Henry Wong
    author
    Sheriff

    Joined: Sep 28, 2004
    Posts: 18117
        
      39

    I would still stick to my words. If you use * with a dot(.), then * will become greedy. But if you put * with a pattern, then * will be reluctant.


    Can you elaborate what you mean by this statement?

    Greedy means to match as much as possible, but back down (match less), if it causes the overall regex to fail. Reluctant means to match as little as possible, but match more, if it causes the overall regex to fail. Whether a regex is greedy or reluctant is based on the quatifier -- not what is being matched.

    Henry
    [ December 07, 2008: Message edited by: Henry Wong ]
    Punit Singh
    Ranch Hand

    Joined: Oct 16, 2008
    Posts: 952
    but in the 1st code snippet having source "ab34ef" why is the group() method returning a null(when start() returns 0)instead of "a"(since it has 0 or more digits)...group() method returns the match that has been found(which in my guess is "a" rather than null as returned by the group())


    As regex is "\\d*", group() method is trying to fing digits, and "a" is not digit.
    if "a" was digit, then sure it must have been return "a".
    But group() method finds zero digits means no digits at index 0, so it is returning null.
    [ December 07, 2008: Message edited by: Punit Singh ]

    SCJP 6
    Ankit Garg
    Sheriff

    Joined: Aug 03, 2008
    Posts: 9280
        
      17

    Hi Henry I backed what I said with an example. If I search for
    .*\\d
    in
    1bxfdsx3xss5

    then it would match the last 5 as .* would be greedy. But if you search for
    \\d*
    in
    1bxfdsx3xss5

    then it would match as little as possible. So it would give you empty matches at index 1,2,3,4 etc. This is what I was trying to say. I may be wrong as I said earlier also that I have not experimented on this much...
    Henry Wong
    author
    Sheriff

    Joined: Sep 28, 2004
    Posts: 18117
        
      39

    then it would match as little as possible. So it would give you empty matches at index 1,2,3,4 etc. This is what I was trying to say. I may be wrong as I said earlier also that I have not experimented on this much...


    No... This is not what greedy means. Greedy doesn't mean that it matches a lot of stuff. Those empty matches at index 1, 2, etc., are greedy matches -- it is trying to match as much as possible, but there is simply little to match.

    I'll give a better example between greedy and reluctant in my next post.

    Henry
    Henry Wong
    author
    Sheriff

    Joined: Sep 28, 2004
    Posts: 18117
        
      39

    Let's use an example mentioned in this topic...


    source: yyxxxyxx
    regex: .*xx


    This will do a greedy match of any character, and then match "xx" at the end.... with a call to find()... The ".*" portion is greedy, and hence, will try to match everything. However, it must back down two characters, because if it didn't, the "xx" portion of the regex would not match.

    Basically, the ".*" portion of the regex will match "yyxxxy", while the whole regex will match the whole string.

    Let's change the example to use a reluctant qualifier...


    source: yyxxxyxx
    regex: .*?xx


    This will do a reluctant match of any character, and then match "xx" at the end.... with a call to find()... The ".*?" portion is reluctant, and hence, will try to match as little as possible -- match zero characters. However, it must match the two "y" characters, because if it didn't, the "xx" portion of the regex would not match.

    Basically, the ".*?" portion of the regex will match "yy", while the whole regex will match "yyxx" -- for find at index zero. The reluctant portion match the bare minimum to allow the whole regex to match.

    Henry
    Vivek Gorade
    Greenhorn

    Joined: Nov 14, 2009
    Posts: 6
    Thank you very much for detailed explanation. I was getting sick of zero length concept, but Henry's posts clarified everything.
     
    I agree. Here's the link: http://aspose.com/file-tools
     
    subject: doubt on group() in Matcher class
     
    Similar Threads
    K&B Study Guide for Java 5 p498 Selftest problem 1
    regex confusion
    Please help me check this regex
    can some one explain me the output of this program?
    Parsing, Tokenizing and Formatting