aspose file tools
The moose likes Programmer Certification (SCJP/OCPJP) and the fly likes Greedy quantifier Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login
JavaRanch » Java Forums » Certification » Programmer Certification (SCJP/OCPJP)
Reply Bookmark "Greedy quantifier" Watch "Greedy quantifier" New topic
Author

Greedy quantifier

Jeena Jeen
Ranch Hand

Joined: Feb 11, 2009
Posts: 47
Hi ,
I am reading chapter6 of K&B 1.5 and I am not very clear about Greedy quantifiers.
Its says that "+", "*", and "?" are greedy quantifiers. And later it says for greedy quantifier search is from Right to Left.
So if I am right then for the code below there should be only one result and the result should start from position 8 and The matches should be at position: "8","7","6","4","3".
which is not true.
I don;t know what am I missing here . I mean is "+" not a greedy quantifier or Greedy quantifier doesn;t start from Right to Left or something else?


Can anyone please explain this or provide some good link about this?
Ruben Soto
Ranch Hand

Joined: Dec 16, 2008
Posts: 1032
Jeena,

What the authors mean when they say that greedy quantifiers match from right to left is not that the string is parsed from right to left, but that a greedy quantifier will match as much as possible, which means that if you have a regex "\\d+" and you have a string "a12345bcd", the regex engine will look at 1 and continue with 2, 3, 4, and 5 (so you go as far to the right as possible.) Then it sees b and that's not a match, so the match in this case will be "12345." In contrast, reluctant quantifiers won't continue looking towards the right once a match has been found, because they match as little as possible.


All code in my posts, unless a source is explicitly mentioned, is my own.
Ankit Garg
Saloon Keeper

Joined: Aug 03, 2008
Posts: 9189
    
    2

Jeena, AFAIK regex never starts matching from the right. The statement is just there to make you understand the concept. As Ruben said, regex engine keeps searching the search criteria even if a match which satisfies the expression is found. They try to fir as much of text into the match as it is possible...


SCJP 6 | SCWCD 5 | Javaranch SCJP FAQ | SCWCD Links
Ruben Soto
Ranch Hand

Joined: Dec 16, 2008
Posts: 1032
And, since we are talking about greedy and reluctant quantifiers:

?? will only match the empty string
*? will only match the empty string
+? will only match one instance of the item its quantifying
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 16680
    
  19

Ruben Soto wrote:And, since we are talking about greedy and reluctant quantifiers:

?? will only match the empty string
*? will only match the empty string
+? will only match one instance of the item its quantifying


Example please. What do you mean that these qualifiers "will only match the empty string"? Whether they match an empty string is also dependent on the whole pattern, not just the qualifier.

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Ruben Soto
Ranch Hand

Joined: Dec 16, 2008
Posts: 1032
Henry Wong wrote:
Ruben Soto wrote:And, since we are talking about greedy and reluctant quantifiers:

?? will only match the empty string
*? will only match the empty string
+? will only match one instance of the item its quantifying


Example please. What do you mean that these qualifiers "will only match the empty string"? Whether they match an empty string is also dependent on the whole pattern, not just the qualifier.

Henry

I should have been a little more specific.

I think that "(anything)??" will only match the empty string. The same for "(anything)*?", ".??", ".*?", "1??", "a*?", etc. Since the lower boundary of ? and * is 0 occurrences, making them reluctant makes them only match zero instances (the empty string.)

If used as a part of a bigger pattern, they won't affect the result because of this reason.

For example, the pattern "1234.?(anything)??abc(anything)*?" is equivalent to the pattern "1234.?abc"

Something similar happens with +?, except since the lower boundary of + is 1, this will always cause it to only match one of whatever it is modifying.

This is my understanding although I might be wrong.
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 16680
    
  19

I think that "(anything)??" will only match the empty string. The same for "(anything)*?", ".??", ".*?", "1??", "a*?", etc. Since the lower boundary of ? and * is 0 occurrences, making them reluctant makes them only match zero instances (the empty string.)

If used as a part of a bigger pattern, they won't affect the result because of this reason.

For example, the pattern "1234.?(anything)??abc(anything)*?" is equivalent to the pattern "1234.?abc"


Sorry, but this is not true. The reluctant qualifier will match as little as possible, while still allowing the overall pattern to match. If matching zero items causes the overall pattern to not match, it will match more... for example...

Let's say the pattern is ".*?AAA". This pattern matches zero or more of any characters, reluctantly, followed by "AAA'. Now, let's assume the string is "abcdefAAAhijklAAA". And you call the find() method...

It will try to match as little as possible, but if it matches zero characters, the bigger pattern will not match. The smallest amount that the ".*?" portion of the pattern can match is "abcdef" -- any less and the ".*?AAA" pattern won't match via the find() method.

Henry
Ruben Soto
Ranch Hand

Joined: Dec 16, 2008
Posts: 1032
Yes, you are correct. I see what you mean by "match as little as possible while allowing the overall pattern to match."

What I find confusing, is this type of result:

source: ab12345cd
pattern: \\d*? // Same results if pattern is \\d??
matches: (Empty string) at 0
(Empty string) at 1 // Why not match 12345 here?
(Empty string) at 2
(Empty string) at 3
(Empty string) at 4
(Empty string) at 5
(Empty string) at 6
(Empty string) at 7
(Empty string) at 8
(Empty string) at 9

I think that the strategy is "match as little as possible, while allowing the overall pattern to match starting at the earliest index of the source string." That is not as catchy, but seems to more accurately depict how the regex engine handles this.

Thanks,

Ruben
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 16680
    
  19

I think that the strategy is "match as little as possible, while allowing the overall pattern to match starting at the earliest index of the source string." That is not as catchy, but seems to more accurately depict how the regex engine handles this.


Not really. There are two things going on -- and they are only somewhat related.

First, the regex. In this case, the whole pattern matches zero characters, so it will always match zero characters. There is no bigger pattern that will force it to match more -- so it will always match zero length in this example.

Second, the find() method. The method will start from beginning of the string looking for a match. If one is found, it will start the next match at the end of the one that is found. Unless, of course, the match is zero length, then it will go to the next character to look for the next match. And finally, if it doesn't find a match at the current location, it will move to the next location.

The behavior of the find() method is the same, regardless of the regex pattern. So, it is better to keep those two things in mind, and try to keep them separated.

Henry
Ruben Soto
Ranch Hand

Joined: Dec 16, 2008
Posts: 1032
Henry Wong wrote:
I think that the strategy is "match as little as possible, while allowing the overall pattern to match starting at the earliest index of the source string." That is not as catchy, but seems to more accurately depict how the regex engine handles this.


Not really. There are two things going on -- and they are only somewhat related.

First, the regex. In this case, the whole pattern matches zero characters, so it will always match zero characters. There is no bigger pattern that will force it to match more -- so it will always match zero length in this example.

Second, the find() method. The method will start from beginning of the string looking for a match. If one is found, it will start the next match at the end of the one that is found. Unless, of course, the match is zero length, then it will go to the next character to look for the next match. And finally, if it doesn't find a match at the current location, it will move to the next location.

The behavior of the find() method is the same, regardless of the regex pattern. So, it is better to keep those two things in mind, and try to keep them separated.

Henry

I understand what you mean by keeping what the regex considers a match and the behavior of the find() method separate, Henry. I was testing this using a program which does use the Matcher.find() method, and maybe I was extracting conclusions without seeing through the behavior of the find() method. Thanks for explaining!
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Greedy quantifier
 
Similar Threads
While loop question from SCJP 5.0
Regular expression - Greedy and Reluctant quantifiers
Difference Among Greedy, Reluctant, and Possessive Quantifiers
Quantifiers
SCJP 5.0 objective 3