This week's book giveaways are in the Refactoring and Agile forums.
We're giving away four copies each of Re-engineering Legacy Software and Docker in Action and have the authors on-line!
See this thread and this one for details.
Win a copy of Re-engineering Legacy Software this week in the Refactoring forum
or Docker in Action in the Agile forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Regular expression help

 
Luigi Plinge
Ranch Hand
Posts: 441
IntelliJ IDE Scala Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm after a regular expression that will capture words, defined as

- Letters A-Za-z
- including optional single "." at the end
- bounded by spaces or the beginning / end of input

My attempt so far isbut this doesn't work because \b turns out to consider the "." as an end of word boundary, so it would (wrongly) capture "y." from the token "y.o" as a word.

I know that \s represents a space, \A is the start of input, and \Z (or possibly \z?) is the end of input. I also tried but that gives an exception.
 
Jesper de Jong
Java Cowboy
Saloon Keeper
Posts: 15207
36
Android IntelliJ IDE Java Scala Spring
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I tried to lookup in the API documentation what \b (a word boundary) means exactly, but it looks like the API docs nor the tutorial don't exactly specify what it means. So, I'd try something else instead that is defined more clearly.
Luigi Plinge wrote:... but that gives an exception.

When you get an exception, please tell us what exception, with the stack trace if possible - the more specific information you give us, the easier it is to help you.

I tried that line out and got:

What did you mean by \A? That's not a valid escape sequence in regular expressions. (The same for \Z at the end).
 
Luigi Plinge
Ranch Hand
Posts: 441
IntelliJ IDE Scala Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
From the documentation of Pattern

Boundary matchers
^ The beginning of a line
$ The end of a line
\b A word boundary
\B A non-word boundary
\A The beginning of the input
\G The end of the previous match
\Z The end of the input but for the final terminator, if any
\z The end of the input
 
James Sabre
Ranch Hand
Posts: 781
Java Netbeans IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
To my mind there is ambiguity in your requirement statement :-

1) Do you want to ignore anything that is not a 'word' ? For example, if the input is "Hello \\\\\\\\\\\\\\\ World" do you just want to extract "Hello" and "World" ?
2) Do you want to include any terminating '.' as part of the result? For example, if the input is "Hello. World" do you want to extract "Hello." and "World" or do you want to extract "Hello" and "World"?
3) Do you want to ignore the first word in your input if it is not prefixed by a space? For example, if the input is "Hello World" do you just want "World" since "Hello" is not prefixed by a space?
4) What do you want the result of input of "Hello.World" to be?

If it is difficult to create a formal specification then a good approach is to define a set of test cases and the result you expect. Make sure you consider the edge conditions such as those above.


 
Luigi Plinge
Ranch Hand
Posts: 441
IntelliJ IDE Scala Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
James -
1) yes
2) yes
3) no
4) ""

So the following are words :
"ab", "ab."

The following are not words :
"ab..", "a.b", ".ab", "a.b.", "a2b."

I got it working by defining 4 separate patterns thus: although, surely there is a better way?

Actually I think it would be a lot easier just to use the split(" ") method on the input String, and match each substring to "[A-Za-z]+\.?". But I'd be interested to hear if it's possible to form a regular expression that includes the split, and why my 2nd attempt in the OP is not valid.
 
James Sabre
Ranch Hand
Posts: 781
Java Netbeans IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think the following covers all your use cases though I can still think of use cases that this will probably not cover :-


You will need to improve the specification for me to spend any time refining this.

The complexity of the requirement means you will need to create a very very very good JUnit test harness.
 
Luigi Plinge
Ranch Hand
Posts: 441
IntelliJ IDE Scala Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for that, James. It's not worth spending a lot of time as it isn't for anything important. Just a couple of questions:

1) What's the difference between the "<=" in group 0 and the "=" in group 2?
2) Why is the first group separate while the third group is nested in the second?
3) What are the cases you mention that it wouldn't cover?
 
James Sabre
Ranch Hand
Posts: 781
Java Netbeans IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Luigi Plinge wrote:Thanks for that, James. It's not worth spending a lot of time as it isn't for anything important. Just a couple of questions:

1) What's the difference between the "<=" in group 0 and the "=" in group 2?

You need to read the Javadoc for Pattern; in particular you need 'look ahead' and 'look behind'.

2) Why is the first group separate while the third group is nested in the second?


Since neither are capturing groups, both the 'look behind' term and the 'look ahead' term can be either inside or outside of the capturing group BUT in this case the capturing group is not needed anyway. My final version of the regex was -

where one extracts group() rather than group(1) .


3) What are the cases you mention that it wouldn't cover?


Can't remember now. At my age I have trouble remembering my own name.
 
Luigi Plinge
Ranch Hand
Posts: 441
IntelliJ IDE Scala Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Many thanks for your help. Regular expressions suddenly don't seem so difficult.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic