I'm after a regular expression that will capture words, defined as
- Letters A-Za-z
- including optional single "." at the end
- bounded by spaces or the beginning / end of input
My attempt so far isbut this doesn't work because \b turns out to consider the "." as an end of word boundary, so it would (wrongly) capture "y." from the token "y.o" as a word.
I know that \s represents a space, \A is the start of input, and \Z (or possibly \z?) is the end of input. I also tried but that gives an exception.
I tried to lookup in the API documentation what \b (a word boundary) means exactly, but it looks like the API docs nor the tutorial don't exactly specify what it means. So, I'd try something else instead that is defined more clearly.
Luigi Plinge wrote:... but that gives an exception.
When you get an exception, please tell us what exception, with the stack trace if possible - the more specific information you give us, the easier it is to help you.
I tried that line out and got:
What did you mean by \A? That's not a valid escape sequence in regular expressions. (The same for \Z at the end).
Boundary matchers
^ The beginning of a line
$ The end of a line
\b A word boundary
\B A non-word boundary
\A The beginning of the input
\G The end of the previous match
\Z The end of the input but for the final terminator, if any
\z The end of the input
To my mind there is ambiguity in your requirement statement :-
1) Do you want to ignore anything that is not a 'word' ? For example, if the input is "Hello \\\\\\\\\\\\\\\ World" do you just want to extract "Hello" and "World" ?
2) Do you want to include any terminating '.' as part of the result? For example, if the input is "Hello. World" do you want to extract "Hello." and "World" or do you want to extract "Hello" and "World"?
3) Do you want to ignore the first word in your input if it is not prefixed by a space? For example, if the input is "Hello World" do you just want "World" since "Hello" is not prefixed by a space?
4) What do you want the result of input of "Hello.World" to be?
If it is difficult to create a formal specification then a good approach is to define a set of test cases and the result you expect. Make sure you consider the edge conditions such as those above.
Retired horse trader.
Note: double-underline links may be advertisements automatically added by this site and are probably not endorsed by me.
The following are not words :
"ab..", "a.b", ".ab", "a.b.", "a2b."
I got it working by defining 4 separate patterns thus: although, surely there is a better way?
Actually I think it would be a lot easier just to use the split(" ") method on the input String, and match each substring to "[A-Za-z]+\.?". But I'd be interested to hear if it's possible to form a regular expression that includes the split, and why my 2nd attempt in the OP is not valid.
Thanks for that, James. It's not worth spending a lot of time as it isn't for anything important. Just a couple of questions:
1) What's the difference between the "<=" in group 0 and the "=" in group 2?
2) Why is the first group separate while the third group is nested in the second?
3) What are the cases you mention that it wouldn't cover?
Luigi Plinge wrote:Thanks for that, James. It's not worth spending a lot of time as it isn't for anything important. Just a couple of questions:
1) What's the difference between the "<=" in group 0 and the "=" in group 2?
You need to read the Javadoc for Pattern; in particular you need 'look ahead' and 'look behind'.
2) Why is the first group separate while the third group is nested in the second?
Since neither are capturing groups, both the 'look behind' term and the 'look ahead' term can be either inside or outside of the capturing group BUT in this case the capturing group is not needed anyway. My final version of the regex was -
where one extracts group() rather than group(1) .
3) What are the cases you mention that it wouldn't cover?
Can't remember now. At my age I have trouble remembering my own name.