Hi ,
I am reading chapter6 of K&B 1.5 and I am not very clear about Greedy quantifiers.
Its says that "+", "*", and "?" are greedy quantifiers. And later it says for greedy quantifier search is from Right to Left.
So if I am right then for the code below there should be only one result and the result should start from position 8 and The matches should be at position: "8","7","6","4","3".
which is not true.
I don;t know what am I missing here . I mean is "+" not a greedy quantifier or Greedy quantifier doesn;t start from Right to Left or something else?
Can anyone please explain this or provide some good link about this?
Ruben Soto
Ranch Hand
Joined: Dec 16, 2008
Posts: 1032
posted
0
Jeena,
What the authors mean when they say that greedy quantifiers match from right to left is not that the string is parsed from right to left, but that a greedy quantifier will match as much as possible, which means that if you have a regex "\\d+" and you have a string "a12345bcd", the regex engine will look at 1 and continue with 2, 3, 4, and 5 (so you go as far to the right as possible.) Then it sees b and that's not a match, so the match in this case will be "12345." In contrast, reluctant quantifiers won't continue looking towards the right once a match has been found, because they match as little as possible.
All code in my posts, unless a source is explicitly mentioned, is my own.
Jeena, AFAIK regex never starts matching from the right. The statement is just there to make you understand the concept. As Ruben said, regex engine keeps searching the search criteria even if a match which satisfies the expression is found. They try to fir as much of text into the match as it is possible...
Ruben Soto wrote:And, since we are talking about greedy and reluctant quantifiers:
?? will only match the empty string
*? will only match the empty string
+? will only match one instance of the item its quantifying
Example please. What do you mean that these qualifiers "will only match the empty string"? Whether they match an empty string is also dependent on the whole pattern, not just the qualifier.
Ruben Soto wrote:And, since we are talking about greedy and reluctant quantifiers:
?? will only match the empty string
*? will only match the empty string
+? will only match one instance of the item its quantifying
Example please. What do you mean that these qualifiers "will only match the empty string"? Whether they match an empty string is also dependent on the whole pattern, not just the qualifier.
Henry
I should have been a little more specific.
I think that "(anything)??" will only match the empty string. The same for "(anything)*?", ".??", ".*?", "1??", "a*?", etc. Since the lower boundary of ? and * is 0 occurrences, making them reluctant makes them only match zero instances (the empty string.)
If used as a part of a bigger pattern, they won't affect the result because of this reason.
For example, the pattern "1234.?(anything)??abc(anything)*?" is equivalent to the pattern "1234.?abc"
Something similar happens with +?, except since the lower boundary of + is 1, this will always cause it to only match one of whatever it is modifying.
This is my understanding although I might be wrong.
I think that "(anything)??" will only match the empty string. The same for "(anything)*?", ".??", ".*?", "1??", "a*?", etc. Since the lower boundary of ? and * is 0 occurrences, making them reluctant makes them only match zero instances (the empty string.)
If used as a part of a bigger pattern, they won't affect the result because of this reason.
For example, the pattern "1234.?(anything)??abc(anything)*?" is equivalent to the pattern "1234.?abc"
Sorry, but this is not true. The reluctant qualifier will match as little as possible, while still allowing the overall pattern to match. If matching zero items causes the overall pattern to not match, it will match more... for example...
Let's say the pattern is ".*?AAA". This pattern matches zero or more of any characters, reluctantly, followed by "AAA'. Now, let's assume the string is "abcdefAAAhijklAAA". And you call the find() method...
It will try to match as little as possible, but if it matches zero characters, the bigger pattern will not match. The smallest amount that the ".*?" portion of the pattern can match is "abcdef" -- any less and the ".*?AAA" pattern won't match via the find() method.
Henry
Ruben Soto
Ranch Hand
Joined: Dec 16, 2008
Posts: 1032
posted
0
Yes, you are correct. I see what you mean by "match as little as possible while allowing the overall pattern to match."
What I find confusing, is this type of result:
source: ab12345cd
pattern: \\d*? // Same results if pattern is \\d??
matches: (Empty string) at 0
(Empty string) at 1 // Why not match 12345 here?
(Empty string) at 2
(Empty string) at 3
(Empty string) at 4
(Empty string) at 5
(Empty string) at 6
(Empty string) at 7
(Empty string) at 8
(Empty string) at 9
I think that the strategy is "match as little as possible, while allowing the overall pattern to match starting at the earliest index of the source string." That is not as catchy, but seems to more accurately depict how the regex engine handles this.
I think that the strategy is "match as little as possible, while allowing the overall pattern to match starting at the earliest index of the source string." That is not as catchy, but seems to more accurately depict how the regex engine handles this.
Not really. There are two things going on -- and they are only somewhat related.
First, the regex. In this case, the whole pattern matches zero characters, so it will always match zero characters. There is no bigger pattern that will force it to match more -- so it will always match zero length in this example.
Second, the find() method. The method will start from beginning of the string looking for a match. If one is found, it will start the next match at the end of the one that is found. Unless, of course, the match is zero length, then it will go to the next character to look for the next match. And finally, if it doesn't find a match at the current location, it will move to the next location.
The behavior of the find() method is the same, regardless of the regex pattern. So, it is better to keep those two things in mind, and try to keep them separated.
Henry
Ruben Soto
Ranch Hand
Joined: Dec 16, 2008
Posts: 1032
posted
0
Henry Wong wrote:
I think that the strategy is "match as little as possible, while allowing the overall pattern to match starting at the earliest index of the source string." That is not as catchy, but seems to more accurately depict how the regex engine handles this.
Not really. There are two things going on -- and they are only somewhat related.
First, the regex. In this case, the whole pattern matches zero characters, so it will always match zero characters. There is no bigger pattern that will force it to match more -- so it will always match zero length in this example.
Second, the find() method. The method will start from beginning of the string looking for a match. If one is found, it will start the next match at the end of the one that is found. Unless, of course, the match is zero length, then it will go to the next character to look for the next match. And finally, if it doesn't find a match at the current location, it will move to the next location.
The behavior of the find() method is the same, regardless of the regex pattern. So, it is better to keep those two things in mind, and try to keep them separated.
Henry
I understand what you mean by keeping what the regex considers a match and the behavior of the find() method separate, Henry. I was testing this using a program which does use the Matcher.find() method, and maybe I was extracting conclusions without seeing through the behavior of the find() method. Thanks for explaining!