Win a copy of Testing JavaScript Applications this week in the HTML Pages with CSS and JavaScript forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Bear Bibeault
  • Ron McLeod
  • Jeanne Boyarsky
  • Paul Clapham
Sheriffs:
  • Tim Cooke
  • Liutauras Vilda
  • Junilu Lacar
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • fred rosenberger
  • salvin francis
Bartenders:
  • Piet Souris
  • Frits Walraven
  • Carey Brown

How to get a RegEx to extract only uppercase from string

 
Bartender
Posts: 1752
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is there a way to extract just the upper cased words in a mixed upper and lower cased sentence?

Sample String: STRING two MORE

I want a RegEx that will give me STRING MORE

My currrent RegEx is:  Pattern p2 = Pattern.compile("\\b(([A-Z]+)\\s*)+",Pattern.MULTILINE);

But this only works if the upper cased words are already next to each other.

Thanks for any suggestions.

- mike
 
lowercase baba
Posts: 12871
62
Chrome Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Have you considered that a regex may not be the right way to tackle this task?
 
Mike London
Bartender
Posts: 1752
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

fred rosenberger wrote:Have you considered that a regex may not be the right way to tackle this task?



Yep, after banging my head on a table, doing 2^N searches, etc., I certainly am.

Tools like RegExRx show that that RegEx works with the expected matches.

But, in Java, not so much.

Therefore, posting here was a sanity check & last resort to make sure I wasn't missing anything.

- mike
 
fred rosenberger
lowercase baba
Posts: 12871
62
Chrome Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
so I would think the correct question would be "How do i extract only uppcase letters from a string?"  There are many ways, but which you'd use depends on the specific details.

Why not iterate through the string, character by character, and only print/keep upper case letters?

Why not do a substituion, replacing all non-uppercase letters with a null character?

Why does it HAVE to be a regular expression?
 
Mike London
Bartender
Posts: 1752
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

fred rosenberger wrote:so I would think the correct question would be "How do i extract only uppcase letters from a string?"  There are many ways, but which you'd use depends on the specific details.

Why not iterate through the string, character by character, and only print/keep upper case letters?

Why not do a substituion, replacing all non-uppercase letters with a null character?

Why does it HAVE to be a regular expression?



It certainly doesn't have to be a Regular Expression. I was just in a death-roll trying to get one to work. You probably know what  I mean (one more compile!!!).

Once I got your reply, though I just used string.split() and parsed each word against a simple RegEx. If it passes, I append it to the output string for the web service.

Done.

Thank you!

- mike
 
Bartender
Posts: 7202
65
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sounds like you've got it under control now. Here's some Java-8 ways to do it (I'm still on my Java-8 learning curve).
Output:
 
Sheriff
Posts: 15801
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't know if it's just me but it seems it would have been straightforward to do something like this:

Of course, you'd have to use this in conjunction with a Scanner that reads in each word or however it is you separate words, maybe even just use String.split() and iterate over the resulting array.
 
Mike London
Bartender
Posts: 1752
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Carey Brown wrote:Sounds like you've got it under control now. Here's some Java-8 ways to do it (I'm still on my Java-8 learning curve).
Output:



Nice!

Every time I think I'm getting the hang of the Java 8 API, I see something like your reply.

I wonder how long it will take me to be this proficient!  :-(

Thanks very much!!

- mike
 
Saloon Keeper
Posts: 12155
258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Personally I think I would have used a regex after all:
 
Mike London
Bartender
Posts: 1752
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:Personally I think I would have used a regex after all:



Thanks.
 
Junilu Lacar
Sheriff
Posts: 15801
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan's solution is a little cleaner, IMO. I still don't think you need a (complicated) regex though

or

The only difference between the two is whether you're trimming a leading space or a trailing space.

Simpler yet, you can use Collectors:

 
Mike London
Bartender
Posts: 1752
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Junilu Lacar wrote:Stephan's solution is a little cleaner, IMO. I still don't think you need a (complicated) regex though

or

The only difference between the two is whether you're trimming a leading space or a trailing space.

Simpler yet, you can use Collectors:



Amazing. Thank you!

Did you spend some time playing around to get the expected result as I would have or did you touch-type the code above?

- mike
 
Junilu Lacar
Sheriff
Posts: 15801
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Mike London wrote:
Did you spend some time playing around to get the expected result as I would have or did you touch-type the code above?


I've been doing this long enough to not have to look at the keyboard when I type. The code was not all written just off the top of my head though. I always have to look up how to use Collectors, for example, but I did already know about reduce() without having to look it up. Besides, I always test out my code. No use posting something that doesn't actually work.
 
Junilu Lacar
Sheriff
Posts: 15801
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you want to strictly define words as any sequence of [A-Za-z] and ignore any non-word chars like commas, semicolons, apostrophes, and other kinds of punctation, you can do this:

The expression on line 4 uses the \p{Alpha} POSIX character class. You can find more like these to experiment with here: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html -- which one you use all depends on how you want to define a "word" and what characters to consider as word boundaries.

The use of symbolic names is just to clarify intent.
 
Mike London
Bartender
Posts: 1752
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Junilu Lacar wrote:If you want to strictly define words as any sequence of [A-Za-z] and ignore any non-word chars like commas, semicolons, apostrophes, and other kinds of punctation, you can do this:

The expression on line 4 uses the \p{Alpha} POSIX character class. You can find more like these to experiment with here: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html -- which one you use all depends on how you want to define a "word" and what characters to consider as word boundaries.

The use of symbolic names is just to clarify intent.



Wow, I'd never have known about the POSIX way of checking for Alpha.

This is actually the only one of the posted examples that totally works with numbers and other characters (of course, I didn't specify that in my original posting).

Thanks again very much!!!

- mike
 
Bartender
Posts: 4006
156
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is still somewhat dubious.  Why not simply have a method:


The difference is:

But it is up to OP to decide what to do with words like "BC1G5".
 
Junilu Lacar
Sheriff
Posts: 15801
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mike,

After reviewing this entire thread, I just realized that you never actually posted your own solution code before others and myself started giving you ours. If this is homework, then the cat's already out of the bag and there's no taking those solutions back. If you're going to submit any of these solutions as your homework, there's nothing we can do about it now. However, I would caution you that these are public forums and instructors are pretty good at finding plagiarized work.

We do have a standing policy about students doing their own homework.  Just sayin'...

If this isn't homework, then no harm, no foul.
 
Stephan van Hulst
Saloon Keeper
Posts: 12155
258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Junilu Lacar wrote:If you want to strictly define words as any sequence of [A-Za-z] and ignore any non-word chars like commas, semicolons, apostrophes, and other kinds of punctation, you can do this:


I'm not actually quite sure how this is more clear than a simple regex that describes the exact pattern for an upper case word surrounded by word boundaries.
 
Mike London
Bartender
Posts: 1752
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Junilu Lacar wrote:Mike,

After reviewing this entire thread, I just realized that you never actually posted your own solution code before others and myself started giving you ours. If this is homework, then the cat's already out of the bag and there's no taking those solutions back. If you're going to submit any of these solutions as your homework, there's nothing we can do about it now. However, I would caution you that these are public forums and instructors are pretty good at finding plagiarized work.

We do have a standing policy about students doing their own homework.  Just sayin'...

If this isn't homework, then no harm, no foul.



Nope, it's not homework and I did include my Pattern statement in the very first posting.

Frankly, I didn't expect so many cool replies.

FWIW, here's my original code that worked fine:



Thanks again!

- mike
 
Junilu Lacar
Sheriff
Posts: 15801
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:I'm not actually quite sure how this is more clear than a simple regex that describes the exact pattern for an upper case word surrounded by word boundaries.


I made no such claim nor was one intended to be implied. The solution you offered used a while loop. I just showed an alternative that didn't use an explicit while() loop but used a stream all the way instead. The context of the WORD_BOUNDARY symbolic constant was "define words as any sequence of [A-Za-z]"

OP should note that the "[^\\p{Alpha}]" expression is equivalent to  "[^A-Za-z]" so either one could be used. It's all up to you to decide what you think is more readable or consider as "simpler". I actually think that "[^A-Za-z]" is more straightforward but I wanted to point OP to the other possible character classes he might consider using.
 
Stephan van Hulst
Saloon Keeper
Posts: 12155
258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Why did you split the input string first if you're using the find() method anyway? Why must your word start with A-Z, but then the rest of the word is made up of any character as long as it's not a-z or a non-word character? Your word must also be at least two characters long. Why is the pattern multiline?
 
Stephan van Hulst
Saloon Keeper
Posts: 12155
258
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Junilu Lacar wrote:I just showed an alternative that didn't use an explicit while() loop but used a stream all the way instead.


I see. As Rob reminded me in another thread, instead of using Arrays.stream() in combination with String.split(), you may want to use Pattern.splitAsStream() instead.
 
Junilu Lacar
Sheriff
Posts: 15801
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
OP: good to know that this isn't homework. Thank you for clearing that up.

As you can see, there are as many ways to skin a cat as there are in painting a still life of a fruit bowl. Beauty is in the eye of the beholder, so take all these suggestions and decide for yourself how you'll take them, understand them, and use them to improve your own coding style.
 
Mike London
Bartender
Posts: 1752
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Junilu Lacar wrote:OP: good to know that this isn't homework. Thank you for clearing that up.

As you can see, there are as many ways to skin a cat as there are in painting a still life of a fruit bowl. Beauty is in the eye of the beholder, so take all these suggestions and decide for yourself how you'll take them, understand them, and use them to improve your own coding style.



Yeah, I'm way past Grad School days now...

Thanks very much for all the great stuff here.

The Ranch Rocks!!!
 
Mike London
Bartender
Posts: 1752
17
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Junilu Lacar wrote:

Stephan van Hulst wrote:I'm not actually quite sure how this is more clear than a simple regex that describes the exact pattern for an upper case word surrounded by word boundaries.


I made no such claim nor was one intended to be implied. The solution you offered used a while loop. I just showed an alternative that didn't use an explicit while() loop but used a stream all the way instead. The context of the WORD_BOUNDARY symbolic constant was "define words as any sequence of [A-Za-z]"

OP should note that the "[^\\p{Alpha}]" expression is equivalent to  "[^A-Za-z]" so either one could be used. It's all up to you to decide what you think is more readable or consider as "simpler". I actually think that "[^A-Za-z]" is more straightforward but I wanted to point OP to the other possible character classes he might consider using.



Phew! Agreed. The {Alpha} thing had me nervous since I'd never heard of it.

I'll update my code to use the equivalent RegEx as you noted.

Thanks again,

- mike
 
Bartender
Posts: 5167
11
Netbeans IDE Opera Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What's wrong with sentence.replaceAll("(?:\\b)[a-z]*(?:\\b)", "")?  And if the double spaces left behind are an issue, that could be chained to .replace("  ", " ").
 
Stephan van Hulst
Saloon Keeper
Posts: 12155
258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Darryl Burke wrote:What's wrong with sentence.replaceAll("(?:\\b)[a-z]*(?:\\b)", "")?  And if the double spaces left behind are an issue, that could be chained to .replace("  ", " ").


That regular expression does not take into account that the sentence could consist of characters other than a-z, A-Z and spaces. Also, the second replace would have to be performed multiple times until there are no more changes.
 
Carey Brown
Bartender
Posts: 7202
65
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Here is my third attempt which does match Mikes output.
Output:

 
Junilu Lacar
Sheriff
Posts: 15801
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Carey Brown wrote:Output:


The output of the code that uses [^A-Za-z] (or [^\\p{Alpha}]) as the WORD_BOUNDARY with that as the input is this:

STRING MORE TOGETHER AB CD XYZ ABC A BB

I'm betting it's a copy-paste error in your test code that reports those results.

Also note the output from that code doesn't have all those extra spaces where non-AllCaps words were in the input string.
 
Junilu Lacar
Sheriff
Posts: 15801
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The key to getting the "correct" output is really having a clear definition of what a "word" consists of.  If "ABC1" is to be considered a "word" then the WORD_BOUNDARY should be defined as "[^A-Za-z0-9]".
 
Stephan van Hulst
Saloon Keeper
Posts: 12155
258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Personally I don't really like the solutions where there is both a predicate to check if something is a valid word, and a boundary between two such words. It's redundant, and it introduces another opportunity for a bug (namely if the boundary does not complement the predicate).
 
Junilu Lacar
Sheriff
Posts: 15801
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The mitigation to that concern is:
1. Test your code
2. Keep related things close together
3. Keep the scope of things as limited as possible

There's also the idea of keeping concerns separated. Is combining the two concerns of validating words and separating words really a good idea?
 
Junilu Lacar
Sheriff
Posts: 15801
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Say somebody now turns around and says, "Well now I only want to see the words that are all lowercase." or "only the words that start with a capital letter."  The definition of "word" is still the same but now your filter condition is different.  Keeping the concerns separated allows you to change only that which really needs to be changed: the filter predicate, not the word boundary definition.
 
Carey Brown
Bartender
Posts: 7202
65
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Junilu Lacar wrote:Stephan's solution is a little cleaner, IMO. I still don't think you need a (complicated) regex though

or

The only difference between the two is whether you're trimming a leading space or a trailing space.

Simpler yet, you can use Collectors:


These were the three test cases I attributed to Junilu. Did I get this wrong?
 
Stephan van Hulst
Saloon Keeper
Posts: 12155
258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My point is that one of those concerns is not part of the requirement. The requirement is literally "find all words that match my definition of a word". That you split the input along some possibly incompatible word boundary first is not necessary.
 
Junilu Lacar
Sheriff
Posts: 15801
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Carey Brown wrote:These were the three test cases I attributed to Junilu. Did I get this wrong?


All three versions are functionally equivalent. Look further down in the thread from there. There are a couple more snippets I think where the word boundary is not defined as " " but rather as [^A-Za-z].
 
Junilu Lacar
Sheriff
Posts: 15801
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:My point is that one of those concerns is not part of the requirement. The requirement is literally "find all words that match my definition of a word". That you split the input along some possibly incompatible word boundary first is not necessary.


I disagree with your interpretation of the requirements. A word, as I understand it, can be any series of alphabetic characters. The words that should be sent to the output should only be those that are all uppercase.
 
Stephan van Hulst
Saloon Keeper
Posts: 12155
258
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Okay, but that still doesn't necessitate splitting the input first.

We're always going on about using the right tool for the job, and in this case a pattern matcher is absolutely perfect. I don't know why we're writing more brittle code using streams.

[edit]

I misread your interpretation. Disregard my first sentence.
 
Carey Brown
Bartender
Posts: 7202
65
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Junilu Lacar wrote:

Carey Brown wrote:These were the three test cases I attributed to Junilu. Did I get this wrong?


All three versions are functionally equivalent. Look further down in the thread from there. There are a couple more snippets I think where the word boundary is not defined as " " but rather as [^A-Za-z].


I modified the method to this:
And now the output is:

I'm not happy about having to use two regular expressions in my 3rd attempt. On the other hand, both of those expressions are now very simple.
 
Junilu Lacar
Sheriff
Posts: 15801
264
Mac Android IntelliJ IDE Eclipse IDE Spring Debian Java Ubuntu Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
@Stephan: I don't see it that way, sorry. The pieces that work together are relatively close together in the source, they aren't that difficult to understand, and are streams really that brittle? I don't like mixing concerns and to me those two are different concerns. If you disagree, then we'll just have to leave it at that. No use going round in circles.  Besides, it's not like it's an earth-shaking problem. It's parsing out words and printing them out.  ¯\_(ツ)_/¯
 
I guess everyone has an angle. Fine, what do you want? Just know that you cannot have this tiny ad:
Building a Better World in your Backyard by Paul Wheaton and Shawn Klassen-Koop
https://coderanch.com/wiki/718759/books/Building-World-Backyard-Paul-Wheaton
    Bookmark Topic Watch Topic
  • New Topic