• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Recursive Regular Expression

 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm looking to parse expressions like this:

${ one ${ two ${ three } } }

I need to match first on the whole thing. Then I'll take the contents of the outside brackets and parse again. Matching on the innermost brackets first will not do the job for me. Any thoughts? Thanks!!
 
David Harkness
Ranch Hand
Posts: 1646
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I looked this up in Mastering Regular Expressions (p. 328, "Using a Dynamic Regex to Match Nested Pairs"). If you were using Perl, there's a solution in there. The book covers Java, but I'm not familiar with the native Java regex utilities yet.

As the section title implies, it uses a dynamic pattern, in this case a recursive pattern that includes itself. It works by having the included (dynamic) part of the pattern being created by arbitrary Perl code.

While not immediately helpful, you might find more information doing a Google search for dynamic patterns.
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks, I'll take a look. In googling for recursive I found a special operator R in some compilers. I probably went in a misleading direction with the title of this ... I need to match the outside expression with regex, then handle the inside with code.
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Regular Expressions are greedy by default and work from left to right. So using the simple expression

"$\\{(.*)\\}"

should match the whole string and give you

" one ${ two ${ three } } "

as group 1.

Does that help???
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ilja: that may work, but it will fail parsing something like

${one} ${two}

assuming the goal here is to grab "${one}" and "${two}" separately. This may or may not be an issue, depending on Stan's input...

Stan, can you put an upper limit on the nesting depth? The technique described in MRE will work in Java for any particular finite nesting depth. The cost is, the greater the depth, the more complex the regex. (Though it's not too difficult to generate - but it's real ugly to look at.) However java.util.regex does not have the dynamic regex construct from Perl (e.g. (??[expression]) ) which seems necessary to make the regex truly recursive. I'm not aware of any elegant Java-based regex to handle this. In Java it's probably best to write some non-regex code that simply iterates through each char in a String, counting occurances of "${" and "}", and saving a new capture group any time the latter count equals the former.
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I finally got the right Google search for this. The key is "balanced" for parens and such in programming languages. The general conclusion is that it's not possible, and can't be solved with a new bit of syntax because of the whole nature of the regex engine architecture. Whew!

What I'm doing is parsing Wiki markup with regular expressions and building a tree of document elements. I might have to process these nested macros with text substitution before building the dom. A typical one looks like:

${linksto ${page name}}

I can regex find the inner ${page name} pretty easily, replace ${page name} with Home (for example), then find the outer ${linksto Home}. If I do this before parsing all the other Wiki markup then the text produced by the macro will be parsed with no further effort. Guess that will work.
[ October 01, 2004: Message edited by: Stan James ]
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Stan James:
The general conclusion is that it's not possible, and can't be solved with a new bit of syntax because of the whole nature of the regex engine architecture.


Or with other words: the language you want to parse isn't regular: http://en.wikipedia.org/wiki/Regular_grammar

It *is* context-free, though: http://en.wikipedia.org/wiki/Context-free_grammar
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic