This week's giveaway is in the Android forum.
We're giving away four copies of Android Security Essentials Live Lessons and have Godfrey Nolan on-line!
See this thread for details.
The moose likes Java in General and the fly likes Regular Expression ignoring or clipping continuation lines Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Regular Expression ignoring or clipping continuation lines" Watch "Regular Expression ignoring or clipping continuation lines" New topic
Author

Regular Expression ignoring or clipping continuation lines

Mike Rainville
Ranch Hand

Joined: May 29, 2004
Posts: 36
This regular expression works as expected in "The Regex Coach":
(?m)^(\w+)(?:\s+)??((?:.*(?:[\n]^\s+)?.*)*)?

In a Java program, (tested on 1.4.2_3 or 1.5 beta 1) it looks like this
(?m)^(\\w+)(?:\\s+)??((?:.*(?:[\\n]^\\s+)?.*)*)?
_________________________ ^^^^^^^^^^^^^^ ___________ carets mark the section that does continuation
The regex group(1) captures the term "Budgie" and group(2) is its definition: "Active and amusing miniature parrot native to Australia"

EXCEPT that

in Java, the continuation line is ignored, and we get "Active and amusing miniature parrot". I am sure the problem is in here: (?:[\\n]^\\s+)

Inserting $ in almost every conceivable position had no effect.
I also tried .? after [\\n] just in case there was something after the newline and before ^ the beginning of the actual new line. It doesn't seem to matter if there are two, three, four or even five backslashes [\\\\\n].
Either the continuation is ignored or the program captures nothing at all.

I want to do this <term>Cat</term> <meaning>Natural loner</meaning>
with a definition file having some lines blank and ignored, and some lines have terms without definitions. I have added underscores for leading blanks, because they are being edited out by the
forum.

Cat Natural loner
Dog Best friend
Budgie Active and amusing miniature parrot
____native to Australia
____and popular as a house pet
Eagle American mascot

Any suggestions as to what I could try?

[ August 13, 2004: Message edited by: Mike Rainville ]
[ August 13, 2004: Message edited by: Mike Rainville ]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Are you saying this is the sample input you're looking at?

Cat Natural loner
Dog Best friend
Budgie Active and amusing miniature parrot
native to Australia
Eagle American mascot

How is anyone supposed to know that "native to Australia" is a continuation of the previous line, rather than a definition of the word "native"? I mean, a human who already knows what the words mean will have little problem, but it seems your format does not indicate this. Unless we're supposed to look at capitalization: "native" is a continuation because it's lowercase, while "Eagle" is a new definition because it's in uppercase. Is that your intent?


"I'm not back." - Bill Harding, Twister
Mike Rainville
Ranch Hand

Joined: May 29, 2004
Posts: 36
I'm sorry. When I entered the data, I must have indented with nulls. The data always has at least four spaces at the beginning of a continuation line (b for blanks in what follows


Budgie A miniature parrot
bbbbnative to Australia
Alan Moore
Ranch Hand

Joined: May 06, 2004
Posts: 262
Try this:

It's difficult to say what's causing your regex to fail, with all those unnecessary question marks, asterisks and parentheses. With all that indeterminacy, it's no wonder you get different behavior in Regex Coach. (BTW, the possessive quantifiers - "++", "?+", etc. - probably won't work in Regex Coach; they're a Java innovation.)
Mike Rainville
Ranch Hand

Joined: May 29, 2004
Posts: 36
Thanks for trying. Your suggestion definitely works, and is at least as fast as the one I am working with; it also misses the continued lines, though. I truly suspect that there is a problem with the way Java is handling the \\n, which is common to both patterns.

Except for the highlighted continuation part, the pattern works in both Java and Regex Coach. The question marks are each necessary, because the real test data is a bit more unpredictable, with just a term, sometimes whitespace following in varying amoumts and a few blank lines to ignore.

Each character in the pattern solves an actual problem in the full data file. I made up the test data to highlight the continuation problem.
Alan Moore
Ranch Hand

Joined: May 06, 2004
Posts: 262
Well, that regex works for me with the test data you provided. In fact, so does your regex, except that it matches too much when entries are separated by blank lines. Are you sure your data uses the linefeed character to separate lines? Maybe instead of "\\n" you should be using "(?:\\r?\\n|\\r)" to allow for DOS, Mac and Unix line separators.

How about providing a more representative sample of the data? When you say there may be blank lines to ignore, do you mean within an entry (term + definition), or between entries? In the meantime, try this platform-neutral regex that allows for terms without definitions:


[ August 14, 2004: Message edited by: Alan Moore ]
[ August 14, 2004: Message edited by: Alan Moore ]
Mike Rainville
Ranch Hand

Joined: May 29, 2004
Posts: 36
That did the trick, and it performs very well, too. I am deeply grateful for all your help.

May God bless,
Thank you,
Mike
Julian Kennedy
Ranch Hand

Joined: Aug 02, 2004
Posts: 823
Hi guys,

Regular expressions, or text patterns as some people insist on calling them these days, make my head spin. Takes me right back to the days of incomprehensible C code and Unix shell scripts; not like this lovely fluffy Java stuff.

Anyway, my question (probably a very stupid one) is: don't we have $ for the end of a line in regular expressions any more? Or would that just not fit here? I must admit that I haven't investigated due to aforementioned head spinning...

Jules
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
don't we have $ for the end of a line in regular expressions any more?

Yes, we do, and that would probably be the easier way to handle this:

(?m)^(\\w+)\\s+(?:.*+(?:$\\s+.*+)*+)

Note that $ could also be ^ or $^. The end of one line is the beginning of another - except when there is no more input. And we're requiring something else after the $, so that's not an issue.

Also, I favor greedy quantifiers whenever possible. There's no reason to allow backtracking here; in many cases backtracking just creates confusion and inefficiency anyway. (Though sometimes it's really, really useful.)
Alan Moore
Ranch Hand

Joined: May 06, 2004
Posts: 262
Originally posted by Jim Yingst:
[B][/B]

The requirement is that a continuation line start with several spaces, and this regex doesn't satisfy that. The '$' effectively does a lookahead, asserting that the next character is a line separator. Then the '\\s+' actually matches the line sep, and it isn't required to match anything else. If there are spaces following the line sep it will match them, but it will just as happily match more line separators, or nothing at all. In fact, when I try this regex, it consumes all the data in one gulp.

Nice try, Jim, but you just have to be explicit about which kind of whitespace you want to match at each point.

BTW, in Perl 6, '\n' matches any kind of line separator (as if it were '(?:\r?\n|\r)') and the new '\h' shorthand matches only horizontal whitespace (space or tab). *sigh*
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Good point, thanks. I forgot that \\s matches line separators too. Replacing $ with $^ would solve part of the problem, but it's unclear from the description how/if blank lines should be handled; oh well.

The requirement is that a continuation line start with several spaces, and this regex doesn't satisfy that.

Err, well, if we're going to be strict about that requirement, yours doesn't either. But it does do well enough for most most inputs, unlike mine...
john von
Ranch Hand

Joined: Apr 13, 2004
Posts: 49
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Regular Expression ignoring or clipping continuation lines
 
Similar Threads
Could you please explain me the meaning of this RegEx?
What ?: does in the regular expression (?:\w*?)?
Pattern matches but never replaces
Regular expression to search for words not beginning with a single quote and with or without spaces
Not able to compile Swing files