This week's book giveaway is in the OCPJP forum.
We're giving away four copies of OCA/OCP Java SE 7 Programmer I & II Study Guide and have Kathy Sierra & Bert Bates on-line!
See this thread for details.
The moose likes Java in General and the fly likes Need help in regular expression Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCA/OCP Java SE 7 Programmer I & II Study Guide this week in the OCPJP forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Need help in regular expression " Watch "Need help in regular expression " New topic
Author

Need help in regular expression

Maan Suraj
Ranch Hand

Joined: Dec 20, 2007
Posts: 299
I need some help in the below regex. I am going through below site for learnign regular expressins. So far it has been going good
but kind of stuck with below example

http://www.regular-expressions.info/brackets.html

in above link is "Backtracking Into Capturing Groups"

String is <boo>bold</b> and regex is <(A-Z][A-Z0-9]*)[^>.*?</\1>

I know iam getting it wrong, however acc to my understanding , and the articles i have read on regex so far, i felt the regex should have worked like below......Please correct me

Regex ---Token String
<([A-Z][A-Z0-9]*)[^>].*?</\1> <boo>bold




1) < consumes <

2) [A-Z] in round bracket consumes b

3) [A-Z0-9]* in round bracket consumes oo

therefore , first backreference stores boo


4) ^> doesnot match >

Since the above token has star, so it is ok and we proceed to next token of regex. The position of string remains same


5) > consumes > (which is first one in the string)

6) .*? lazy Regex Engine will skip this token as . is lazy

7) < doesnot match b

so engine backtracks to pt 6, and . consumes b. similarly backtracking occurs over and over and . consumes "bold"

8)< consumes < (which is second one)

9)\1 which i think like mentioned in point 3, it must have value boo

therefore
boo doesnot match b

10)So engine will backtrack to point 6 and now . will consume "bold<"

11)< doesnot match \b

so enigne backtracks and i guess . will now consume "bold<\b"





but somehow its getting confusing from here ...Could anyone please help...The site mentioned below explains something else....iam unable to get it.....Thanks for your patience in advance

http://www.regular-expressions.info/brackets.html

topic in above link is "Backtracking Into Capturing Groups"


Man Suraj
Someone says "Impossible is nothing". The man next him says "Let me see you licking your elbow tip!"
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19725
    
  20

Your regex is incorrect, and does not match your description of it. It is missing the ending > of the start tag. You can verify by adding some more capturing groups and then checking the results:
Output:
0: <boo>bold (everything)
1: b ([A-Z][A-Z0-9]*)
2: o (non optional [^>])
3: o>bold (everything up to )

A quick fix in the regex: <([A-Z][A-Z0-9]*)[^>]*>.*?</\\1>
The [^>] is made optional by requiring it 0 or more times, and the closing > is added. If I keep the same capturing groups (around [^>]* and around .*?) the output is then this:
0: <boo>bold
1: b (because you are looking for the end tag )
2: oo ([^>]*)
3: bold (.*?)


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Maan Suraj
Ranch Hand

Joined: Dec 20, 2007
Posts: 299
Sorry iam not getting it...Actually please see the link and the topic i mentioned in my first post. Also, actually iam more interested in knowing how does the Regex Engine works in the above case and not exactly on the output. Actually, in my first post, i have tried to put down my understanding on token by token basis. I know, it may not be completely correct but its not completely wrong either...

I would like to know on the above lines i.e how the regex works........Thanks for all the efforts put by you in explaining, but if someone could explain taking every token into account, then it may be more helpful so that i can zero in on my error in understanding....


In the site i mentioned in my first post, they say... iam unable to get it

Let's take the regex <([A-Z][A-Z0-9]*)[^>]*>.*?</\1> without the word boundary and look inside the regex engine at the point where \1 fails the first time. First, .*? continues to expand until it has reached the end of the string, and </\1> has failed to match each time .*? matched one more character.

Then the regex engine backtracks into the capturing group. [A-Z0-9]* has matched oo, but would just as happily match o or nothing at all. When backtracking, [A-Z0-9]* is forced to give up one character. The regex engine continues, exiting the capturing group a second time. Since [A-Z][A-Z0-9]* has now matched bo, that is what is stored into the capturing group, overwriting boo that was stored before. [^>]* matches the second o in the opening tag. >.*?</ matches >bold<. \1 fails again.

The regex engine does all the same backtracking once more, until [A-Z0-9]* is forced to give up another character, causing it to match nothing, which the star allows. The capturing group now stores just b. [^>]* now matches oo. >.*?</ once again matches >bold<. \1 now succeeds, as does > and an overall match is found. But not the one we wanted.




Thanks in advance!
 
jQuery in Action, 2nd edition
 
subject: Need help in regular expression