• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Regular Expression ignoring or clipping continuation lines

 
Ranch Hand
Posts: 37
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
This regular expression works as expected in "The Regex Coach":
(?m)^(\w+)(?:\s+)??((?:.*(?:[\n]^\s+)?.*)*)?

In a Java program, (tested on 1.4.2_3 or 1.5 beta 1) it looks like this
(?m)^(\\w+)(?:\\s+)??((?:.*(?:[\\n]^\\s+)?.*)*)?
_________________________ ^^^^^^^^^^^^^^ ___________ carets mark the section that does continuation
The regex group(1) captures the term "Budgie" and group(2) is its definition: "Active and amusing miniature parrot native to Australia"

EXCEPT that

in Java, the continuation line is ignored, and we get "Active and amusing miniature parrot". I am sure the problem is in here: (?:[\\n]^\\s+)

Inserting $ in almost every conceivable position had no effect.
I also tried .? after [\\n] just in case there was something after the newline and before ^ the beginning of the actual new line. It doesn't seem to matter if there are two, three, four or even five backslashes [\\\\\n].
Either the continuation is ignored or the program captures nothing at all.

I want to do this <term>Cat</term> <meaning>Natural loner</meaning>
with a definition file having some lines blank and ignored, and some lines have terms without definitions. I have added underscores for leading blanks, because they are being edited out by the
forum.

Cat Natural loner
Dog Best friend
Budgie Active and amusing miniature parrot
____native to Australia
____and popular as a house pet
Eagle American mascot

Any suggestions as to what I could try?

[ August 13, 2004: Message edited by: Mike Rainville ]
[ August 13, 2004: Message edited by: Mike Rainville ]
 
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Are you saying this is the sample input you're looking at?

Cat Natural loner
Dog Best friend
Budgie Active and amusing miniature parrot
native to Australia
Eagle American mascot

How is anyone supposed to know that "native to Australia" is a continuation of the previous line, rather than a definition of the word "native"? I mean, a human who already knows what the words mean will have little problem, but it seems your format does not indicate this. Unless we're supposed to look at capitalization: "native" is a continuation because it's lowercase, while "Eagle" is a new definition because it's in uppercase. Is that your intent?
 
Mike Rainville
Ranch Hand
Posts: 37
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm sorry. When I entered the data, I must have indented with nulls. The data always has at least four spaces at the beginning of a continuation line (b for blanks in what follows


Budgie A miniature parrot
bbbbnative to Australia
 
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Try this:

It's difficult to say what's causing your regex to fail, with all those unnecessary question marks, asterisks and parentheses. With all that indeterminacy, it's no wonder you get different behavior in Regex Coach. (BTW, the possessive quantifiers - "++", "?+", etc. - probably won't work in Regex Coach; they're a Java innovation.)
 
Mike Rainville
Ranch Hand
Posts: 37
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for trying. Your suggestion definitely works, and is at least as fast as the one I am working with; it also misses the continued lines, though. I truly suspect that there is a problem with the way Java is handling the \\n, which is common to both patterns.

Except for the highlighted continuation part, the pattern works in both Java and Regex Coach. The question marks are each necessary, because the real test data is a bit more unpredictable, with just a term, sometimes whitespace following in varying amoumts and a few blank lines to ignore.

Each character in the pattern solves an actual problem in the full data file. I made up the test data to highlight the continuation problem.
 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Well, that regex works for me with the test data you provided. In fact, so does your regex, except that it matches too much when entries are separated by blank lines. Are you sure your data uses the linefeed character to separate lines? Maybe instead of "\\n" you should be using "(?:\\r?\\n|\\r)" to allow for DOS, Mac and Unix line separators.

How about providing a more representative sample of the data? When you say there may be blank lines to ignore, do you mean within an entry (term + definition), or between entries? In the meantime, try this platform-neutral regex that allows for terms without definitions:


[ August 14, 2004: Message edited by: Alan Moore ]
[ August 14, 2004: Message edited by: Alan Moore ]
 
Mike Rainville
Ranch Hand
Posts: 37
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
That did the trick, and it performs very well, too. I am deeply grateful for all your help.

May God bless,
Thank you,
Mike
 
Ranch Hand
Posts: 823
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi guys,

Regular expressions, or text patterns as some people insist on calling them these days, make my head spin. Takes me right back to the days of incomprehensible C code and Unix shell scripts; not like this lovely fluffy Java stuff.

Anyway, my question (probably a very stupid one) is: don't we have $ for the end of a line in regular expressions any more? Or would that just not fit here? I must admit that I haven't investigated due to aforementioned head spinning...

Jules
 
Jim Yingst
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
don't we have $ for the end of a line in regular expressions any more?

Yes, we do, and that would probably be the easier way to handle this:

(?m)^(\\w+)\\s+(?:.*+(?:$\\s+.*+)*+)

Note that $ could also be ^ or $^. The end of one line is the beginning of another - except when there is no more input. And we're requiring something else after the $, so that's not an issue.

Also, I favor greedy quantifiers whenever possible. There's no reason to allow backtracking here; in many cases backtracking just creates confusion and inefficiency anyway. (Though sometimes it's really, really useful.)
 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Jim Yingst:
[B][/B]


The requirement is that a continuation line start with several spaces, and this regex doesn't satisfy that. The '$' effectively does a lookahead, asserting that the next character is a line separator. Then the '\\s+' actually matches the line sep, and it isn't required to match anything else. If there are spaces following the line sep it will match them, but it will just as happily match more line separators, or nothing at all. In fact, when I try this regex, it consumes all the data in one gulp.

Nice try, Jim, but you just have to be explicit about which kind of whitespace you want to match at each point.

BTW, in Perl 6, '\n' matches any kind of line separator (as if it were '(?:\r?\n|\r)') and the new '\h' shorthand matches only horizontal whitespace (space or tab). *sigh*
 
Jim Yingst
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Good point, thanks. I forgot that \\s matches line separators too. Replacing $ with $^ would solve part of the problem, but it's unclear from the description how/if blank lines should be handled; oh well.

The requirement is that a continuation line start with several spaces, and this regex doesn't satisfy that.

Err, well, if we're going to be strict about that requirement, yours doesn't either. But it does do well enough for most most inputs, unlike mine...
 
Ranch Hand
Posts: 49
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic