aspose file tools*
The moose likes Beginning Java and the fly likes Regular expression help Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "Regular expression help" Watch "Regular expression help" New topic
Author

Regular expression help

Luigi Plinge
Ranch Hand

Joined: Jan 06, 2011
Posts: 441

I'm after a regular expression that will capture words, defined as

- Letters A-Za-z
- including optional single "." at the end
- bounded by spaces or the beginning / end of input

My attempt so far isbut this doesn't work because \b turns out to consider the "." as an end of word boundary, so it would (wrongly) capture "y." from the token "y.o" as a word.

I know that \s represents a space, \A is the start of input, and \Z (or possibly \z?) is the end of input. I also tried but that gives an exception.
Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 14149
    
  18

I tried to lookup in the API documentation what \b (a word boundary) means exactly, but it looks like the API docs nor the tutorial don't exactly specify what it means. So, I'd try something else instead that is defined more clearly.
Luigi Plinge wrote:... but that gives an exception.

When you get an exception, please tell us what exception, with the stack trace if possible - the more specific information you give us, the easier it is to help you.

I tried that line out and got:

What did you mean by \A? That's not a valid escape sequence in regular expressions. (The same for \Z at the end).

Java Beginners FAQ - JavaRanch SCJP FAQ - The Java Tutorial - Java SE 7 API documentation
Scala Notes - My blog about Scala
Luigi Plinge
Ranch Hand

Joined: Jan 06, 2011
Posts: 441

From the documentation of Pattern

Boundary matchers
^ The beginning of a line
$ The end of a line
\b A word boundary
\B A non-word boundary
\A The beginning of the input
\G The end of the previous match
\Z The end of the input but for the final terminator, if any
\z The end of the input
James Sabre
Ranch Hand

Joined: Sep 07, 2004
Posts: 781

To my mind there is ambiguity in your requirement statement :-

1) Do you want to ignore anything that is not a 'word' ? For example, if the input is "Hello \\\\\\\\\\\\\\\ World" do you just want to extract "Hello" and "World" ?
2) Do you want to include any terminating '.' as part of the result? For example, if the input is "Hello. World" do you want to extract "Hello." and "World" or do you want to extract "Hello" and "World"?
3) Do you want to ignore the first word in your input if it is not prefixed by a space? For example, if the input is "Hello World" do you just want "World" since "Hello" is not prefixed by a space?
4) What do you want the result of input of "Hello.World" to be?

If it is difficult to create a formal specification then a good approach is to define a set of test cases and the result you expect. Make sure you consider the edge conditions such as those above.



Retired horse trader.
 Note: double-underline links may be advertisements automatically added by this site and are probably not endorsed by me.
Luigi Plinge
Ranch Hand

Joined: Jan 06, 2011
Posts: 441

James -
1) yes
2) yes
3) no
4) ""

So the following are words :
"ab", "ab."

The following are not words :
"ab..", "a.b", ".ab", "a.b.", "a2b."

I got it working by defining 4 separate patterns thus: although, surely there is a better way?

Actually I think it would be a lot easier just to use the split(" ") method on the input String, and match each substring to "[A-Za-z]+\.?". But I'd be interested to hear if it's possible to form a regular expression that includes the split, and why my 2nd attempt in the OP is not valid.
James Sabre
Ranch Hand

Joined: Sep 07, 2004
Posts: 781

I think the following covers all your use cases though I can still think of use cases that this will probably not cover :-


You will need to improve the specification for me to spend any time refining this.

The complexity of the requirement means you will need to create a very very very good JUnit test harness.
Luigi Plinge
Ranch Hand

Joined: Jan 06, 2011
Posts: 441

Thanks for that, James. It's not worth spending a lot of time as it isn't for anything important. Just a couple of questions:

1) What's the difference between the "<=" in group 0 and the "=" in group 2?
2) Why is the first group separate while the third group is nested in the second?
3) What are the cases you mention that it wouldn't cover?
James Sabre
Ranch Hand

Joined: Sep 07, 2004
Posts: 781

Luigi Plinge wrote:Thanks for that, James. It's not worth spending a lot of time as it isn't for anything important. Just a couple of questions:

1) What's the difference between the "<=" in group 0 and the "=" in group 2?

You need to read the Javadoc for Pattern; in particular you need 'look ahead' and 'look behind'.

2) Why is the first group separate while the third group is nested in the second?


Since neither are capturing groups, both the 'look behind' term and the 'look ahead' term can be either inside or outside of the capturing group BUT in this case the capturing group is not needed anyway. My final version of the regex was -

where one extracts group() rather than group(1) .


3) What are the cases you mention that it wouldn't cover?


Can't remember now. At my age I have trouble remembering my own name.
Luigi Plinge
Ranch Hand

Joined: Jan 06, 2011
Posts: 441

Many thanks for your help. Regular expressions suddenly don't seem so difficult.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Regular expression help