File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes General Computing and the fly likes Regular expression to find words in a String Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » General Computing
Bookmark "Regular expression to find words in a String" Watch "Regular expression to find words in a String" New topic

Regular expression to find words in a String

Nicholas Jordan
Ranch Hand

Joined: Sep 17, 2006
Posts: 1282
My mind is saturated, I am deep in a total re-write. I need to build a few lines of java.util.regex to walk through a large buffer and pick up words - dropping the 's on plurals. This gets involved and not all sources are consistent, I seek suggestions and comment.

Jim Yingst

Joined: Jan 30, 2000
Posts: 18671
Are you talking about plurals, or posessives? Or both? Possessives have apostrophes, while plurals do not.

If you're talking about plurals, I think it's going to be nearly impossible to do this with a regex that is accurate in all cases, as there are too many special cases. Will you be OK with an expression that just often gets it right?

If you need to eliminate posessives but not plurals, that's probably more feasible, as there are fewer special cases there.

"I'm not back." - Bill Harding, Twister
Nicholas Jordan
Ranch Hand

Joined: Sep 17, 2006
Posts: 1282
I want to zip right past any issues that slow the search, using the 80/20 rule. For example, just now I was looking at several pages written by degreed authors writing in html 1.1 to have clean sample text to test on. It occured to me that the problem of discerning what is inside a pair of <> ( along with any punctuation, pictures + graphics and control characters that would need to be discarded ) are the immediate next phase of regex building. Right now, for purposes of this post, we want a Bottle Rocket driven blind watchmaker on skids hot bonded to polytetrafluoroethylene pads running on floating rails covered with polished ice == skip anything that does not fit fast into the definition of a word leaving off posessives, plurals, special cases and any permutation thereof I did not think of.

This is the IV of a feedback loop to populate the registers. Human intervention can occur after we come up with something to look at, I have a 700 page book on Swing at hand, along with fifteen browser windows open and working some ideas for a 2-dimensional language to take the feed from the operator and this phase, I just want to populate the registers with something that looks like a word to the normal human mind. Unicode or otherwise.
[ February 26, 2008: Message edited by: Nicholas Jordan ]
I agree. Here's the link:
subject: Regular expression to find words in a String
It's not a secret anymore!