Win a copy of Design for the Mind this week in the Design forum!

# Regex Question - Greedy Quantifier ?

Arvind Darwin
Greenhorn
Posts: 10

For the above code when pattern is [\w]?xx and source is yxxx the output is 0:3:yxx

Since the pattern is greedy, it should start searching from rightmost and output should be 1:4:xxx

Why yxx is coming as output? yxx should be output for reluctant pattern [\w]??xx

Henry Wong
author
Marshal
Posts: 20992
76
Arvind Darwin wrote:

For the above code when pattern is [\w]?xx and source is yxxx the output is 0:3:yxx

Since the pattern is greedy, it should start searching from rightmost and output should be 1:4:xxx

Why yxx is coming as output? yxx should be output for reluctant pattern [\w]??xx

Okay, I'll ask... Where did you learn this?...

Since the pattern is greedy, it should start searching from rightmost and output should be 1:4:xxx

A greedy qualifier on a pattern matches as much as possible -- and backs off only to prevent the overall pattern from failing. Where did you get the definition that greedy means starts from the right?

Henry

vasudha shekar
Greenhorn
Posts: 16
Hi Arvind,

Yes you are right.

1. The greedy quantifier does work backwards until it finds the rightmost match.
2. It also makes sure that it includes everything in the source data upto and including the rightmost match.

Pattern : \w*xx
source : yxxx

In this case the output will be 0 3 yxxx

But the pattern which you have mentioned is as follows :

pattern : \w?xx
source: yxxx

In this case even though the qualifier is greedy , see what is means "\w?" looks for 0 or 1 character. \w? cannot match yx it can either match yxx or xxx . In this case, it opts for "yxx" beacuse of the second point mentioned above.

Arvind Darwin
Greenhorn
Posts: 10
vasudha shekar wrote:Hi Arvind,

Yes you are right.

1. The greedy quantifier does work backwards until it finds the rightmost match.
2. It also makes sure that it includes everything in the source data upto and including the rightmost match.

Pattern : \w*xx
source : yxxx

In this case the output will be 0 3 yxxx

But the pattern which you have mentioned is as follows :

pattern : \w?xx
source: yxxx

In this case even though the qualifier is greedy , see what is means "\w?" looks for 0 or 1 character. \w? cannot match yx it can either match yxx or xxx . In this case, it opts for "yxx" beacuse of the second point mentioned above.

Thanks Vasudha.

So with pattern : \w?xx and source: yxxx,
yxx is chosen because its either 0 or 1.
xxx is not chosen because when everything in the source data when included it becomes yxxx and does not confirm to pattern which is 0 or 1.
Please correct me if my understanding is wrong?

Thanks once again.

Arvind Darwin
Greenhorn
Posts: 10
Henry Wong wrote:
A greedy qualifier on a pattern matches as much as possible -- and backs off only to prevent the overall pattern from failing. Where did you get the definition that greedy means starts from the right?

Hi Henry,

I read this line from K&B but didnt really understand it completely hence I was not sure.
The greedy quantifier does in fact read the entire source data, and then it works backward (from the right) until it finds the rightmost match. At that point, it includes everything from earlier in the source data up to and including the data that is part of the rightmost match.

Henry Wong
author
Marshal
Posts: 20992
76
vasudha shekar wrote:
Yes you are right.

1. The greedy quantifier does work backwards until it finds the rightmost match.
2. It also makes sure that it includes everything in the source data upto and including the rightmost match.

Pattern : \w*xx
source : yxxx

In this case the output will be 0 3 yxxx

Well, this is definitely not true. It is not guaranted to find the rightmost match first. And it will also not "makes sure that it includes everything in the source data upto and including the rightmost match".

Here is an example that does not do either ...

Pattern : \w*xx
source : abc 123 yxxx yxxxxxxxxxxxxxxxxxxxxx

The first call to the find() method will return "yxxx". This is *not* the rightmost match. This is *not* the largest match. And this doesn't include the beginning of the source.

Henry

Kalmun Malvi
Greenhorn
Posts: 2
Henry Wong wrote:
Here is an example that does not do either ...

Pattern : \w*xx
source : abc 123 yxxx yxxxxxxxxxxxxxxxxxxxxx

The first call to the find() method will return "yxxx". This is *not* the rightmost match. This is *not* the largest match. And this doesn't include the beginning of the source.

Wong is right (and wrong is wrong). I would like to elaborate on this with an example to clarify any confusion. This was also my misinterpretation of the greedy quantifiers and their presumed behavior described in the K&B. As Wong stated, the rightmost match doesn't mean that the entire source text is consumed by the engine, then it starts reading backward (from right to left). Not at all. Taking the Wong's example, with a bit of modification so I can fit it on the same line:

Pattern : \w*xx
source : abc 123 yxxx yxxxxxxx

The engine reads from the left (index [0]) to see if it can find a match; "a" fits the "\w" pattern and since greedy quantifier "*" looks for "zero or more" of these "words," it matches forward. It carries on with consecutive number of matches (index [1] and [2]) until it hits index [3] which is a "white space", and everything that it has matched so far goes kaput. Remember, the reason it has come this far is because it hadn't matched the first "x" needed to advance further. It starts all over at the index [4], "1"; it matches "\w", and because of "*", it consumes all the way to the index [7], in which, once again, it fails because the character at this index ([7] is a "white space". Everything is discarded.

Index [8] looks promising which satisfies "\w" but still no "x"; it moves to index [9] -- woola! First "x" is matched, moving on to the next index, [10]. It's another "x"; great! I think a match has been found -- "yxx", but this is a greedy quantifier. The search does not end here. It continues on to consume another character, because it needs to feed the beast -- it is its nature. Now, we are at index [11] and what do you know, it's another "x", so our engine feasts on it, because taking into account the previous character at index [10] with index [11] (and everything that it has matched so far, index [8] and [9], "yx"), it now has another "xx" (index [10] and [11] this time) lined up.

It keeps going one character further, but at index [12], it hits a road block; it's a "white space" again, so now it backs one character off. That's what they mean by finding the rightmost match, because, for instance in this case, when the engine backs one character to the left [from the right to left], it, once again, has a match ("yxxx"). So it seizes it, and everyone is happy. The same happens for a match that starts from the index [13], all the way to [20], "yxxxxxxx". Notice, in this case, the engine goes all the way to the end because all the characters consumed to the end (index [20]), from index [13], it now satisfies the pattern.

This is exactly why they are called "greedy" quantifiers, because they eat up as many characters as possible, as they advance to the right, until the engine fails the matching, at which point, the engine works backward to again revert to the position where the matched substring was satisfying the pattern.

Also remember, that the engine starts reading from the left, that's why your first match's start() is [8] and end() is [11], which again, is a proof that the engine doesn't start from index [20] and works its way backward to index [13] as it was implied in the original post.

Now, what happens if we substitute the "*" quantifier with "?" and tweak the source (notice a space between "123" and "yxxx" is removed)?

Pattern : \w?xx
source : abc 123yxxx yxxxxxxx

Same process takes place. From index [0], the engine begins reading one character at a time but doesn't find any "x" until it encounters a white space at index [3] and starts all over from index [4]. It finds the a match for "\w" but the next index, [5], is a character "2" which is not what it was looking for, "x" (remember, "?" looks for "one or zero" match, so it can't afford having any character other than "x" after it satisfies "\w"). It bumps a reset pointer to the next character at index [5], but index [6] is "3", and the same faith follows it. This process continues until a match at index [8] is found to be an "x" (remember, at this point, the match has starting at index [7]). The next character read is another "x", at index [9]. Now, the "\w?xx" has been achieved, so the first match is [7] to [9] (end() actually shows [10] but this index is exclusive while [7] is inclusive).

The match is reset at index [10] which satisfies "\w" but index [11], a white space, throws the match away. Starting at index [12], the matching finds its one or zero "\w" ("y"), so it attempts to read the next character at index [13], which is an "x". Good! The next character is "x", index [14], and once again, pattern has been matched. So the second match is index [12] to [14] "yxx" (again end() shows [15] but it is exclusive).

Is there any other matches? You bet. Starting at index [15], it reads "x" which is good on "\w". it reads indexes [15] and [16] one character at a time. At that point, it has found another match -- [15] to [17] (end() prints [18] but that's exclusive), "xxx". The last ditch of attempt is to read index [18] which satisfies "\w" pattern, and index [19] which is an "x". At this point, the engine decides that this is also a match because "?" quantifier requires zero or one characters, and in order to satisfy the last two characters of the pattern, "xx", it takes indexes [18] and [19] as its match that has no "\w". We're done.

Performing the same matching with a reluctant quantifier would be different, as expected. Taken the first pattern, albeit modifying it to a reluctant quantifier, we discussed with the second example's source:

Pattern : \w*?xx
source : abc 123yxxx yxxxxxxx

This time, rather than indexes from [4] to [10] are declared as a match, the engine reads up to, and including, index [9], but since it is not a "greedy" quantifier, it reluctantly accepts whatever pattern it has successfully matched so far, "123yxx". So with reluctant quantifiers, as soon as a match is realized, it stops and marks it as its match. Continuing through, it reads index [10], which satisfies "\w" and attempts to read index [11], a white space, which fails the patter, and it is reset.

Index [12] seems to do us good with "\w", and the engine consumes the next character at index [13], "x". It has now provided a match for our first "x". Reading index [14], another "x", solidify our pattern, and the engine reluctantly accepts "yxx" -- index [12] to [14]. Once again, as soon as the engine finds its match from the source, it stops and find() returns true.

Starting at index [15], an "x", is read which satisfies "\w". The next character at [16] is read, another "x", and the engine declares "x" a match -- [15] to [16] (end() is [17] but that's exclusive). What happened? Wasn't it suppose to read another character, an "x" at index [17], to get one "x" for "\w" and two others for the end of "\w*?xx" pattern? Well, since the quantifier is "*", it can match one or zero characters. Moreover, since it is a "reluctant" quantifier, it prefers the shortest answer, which is "zero character plus 'xx'".

Same reasoning can be made for the next match, index [17] and [18]. Then the lonely "x" at index [19] doesn't have a chance to be of any match. The matching game is over.

I hope this helps.

Yalvin Duha
Ranch Hand
Posts: 41
I apologize to the forum as I had not realized I was posting under an old account of mine. Frankly, I had forgotten that I even had an account here in the first place. Apparently, my browser automatically logged me on (just updated it and transferred user/pass files), and I had assumed I was posting under this username account.

In either case, I hope my post comes handy, and above all, I didn't create more confusion among the "doubters."