File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Regex Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Regex" Watch "Regex" New topic
Author

Regex

Richard Teston
Ranch Hand

Joined: Feb 12, 2002
Posts: 89
Can anyone of you guys can give me a regular expression pattern in finding words with period like "end." but not "end..". I try "\\b[a-zA-Z]+\\.?" and "[a-zA-Z]+\\.?" but the word with two periods are still counted in the matches. Does anybody here have any suggestions or can give me the right pattern? Thanks.


The Code is the Programmer
Phil Chuang
Ranch Hand

Joined: Feb 15, 2003
Posts: 251
You could use something like \\b[A-Za-z]+\\.+ which will get
asdf.
asdf..
asdf...
etc. (literally!)
and then just discard the ones with multiple periods?
[ October 10, 2003: Message edited by: Phil Chuang ]
Richard Teston
Ranch Hand

Joined: Feb 12, 2002
Posts: 89
..but logically the pattern "\\b[a-zA-Z]+\\.?\\b" should match all words one[b]"with one period"[\b] and those words [b]"without a period"[\b] because everyone know that a metacharacter [b]"?"[\b] is an optional which means the pattern above may have match one period or nothing at all. Does this means that the Java Regular Expression engine have a bug?. Maybe I have a wrong pattern? Please tell which is which because I tried the pattern above using underscore "_" instead (i.e. "\\b[a-zA-Z]+_?\\b" ) the matched worked fine. Please enlighten me.Thanks
Adrian Yan
Ranch Hand

Joined: Oct 02, 2000
Posts: 688
hmm... you can't use quantifier in this case, because it doesn't check for the next character when your match ends.
Here is the one I tested: I have to apologize cause I ran this in Tcl.
[a-zA-Z]+([.])([^.])
Hope this helps.
Richard Teston
Ranch Hand

Joined: Feb 12, 2002
Posts: 89
Thanks Adrian for the pattern but I'm sorry, that too doesn't work as well if fact it did'nt find any word in the test string "The quick brown fox. jumps..". This is weird... I try the pattern "\\." and it matches all the period in my string. Does anybody have an idea?
Phil Chuang
Ranch Hand

Joined: Feb 15, 2003
Posts: 251
And you'd think this would do it, but I can't get it to work:
"\\b[A-Za-z]+[.][^.]"
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Well there's something icky like
\b[A-Za-z]+\b(?:\s|$|\.(?:[^.])|$)
(Word folloed by space or end of line or [a . followed something other than ., or endo f line])
You can use negative lookahead to simplify:
\b[A-Za-z]+\b(?!\.\.)\.?
(Word not followed by .. but maybe followed by .)
Or also use a posessive quantifier (available in java.util.regex, bot not most other regex libraries) to make it easier to avoid partial words:
\b[A-Za-z]++(?!\.\.)\.?
(Save as previous, really, but maybe a little faster)
All the above are regexes, not Java literals, so double each \ to make a literal.


"I'm not back." - Bill Harding, Twister
Mani Ram
Ranch Hand

Joined: Mar 11, 2002
Posts: 1140
Originally posted by Jim Yingst:

folloed, endo f line, Save as previous

Jim, are you in a Friday evening hurry?
[ October 10, 2003: Message edited by: Mani Ram ]

Mani
Quaerendo Invenietis
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
..but logically the pattern "\\b[a-zA-Z]+\\.?\\b" should match all words one "with one period"
No, because after the period, it looks for a word boundary \b, and it doesn't (usually) find one because the word already ended before the '.'.
and those words "without a period" because everyone know that a metacharacter "?" is an optional which means the pattern above may have match one period or nothing at all.
It also mathes words with two periods, because it can simply ignore the period (the ? means it's not required to take it) and it's still at the word boundary, so the final \b matches.
Does this means that the Java Regular Expression engine have a bug?.
Dunno if there are other bugs, but this isn't one - it's a problem with your pattern.
The classic reference for learning about regexes is Mastering Regular Expressions by Jeffrey Friedl. Highly recommended Also Max Habibi (bartender here at the ranch) hasn the upcoming Real World Regular Expressions with Java 1.4 which will be worth checking out, focusing more specifically on Java's java.util.regex package. Also useful: if you use Eclipse, try the RegEx Tester plug-in. Or for mroe traditional regexes (no possessive queitifiers) you can use the Regex Coach. There are probably others; these are the ones I've tried.
[ October 10, 2003: Message edited by: Jim Yingst ]
Richard Teston
Ranch Hand

Joined: Feb 12, 2002
Posts: 89
Thanks for the explanation Jim I've been reading Mastering Regular Expression by Jeffrey Friendl I haven't reach the part of the book about how DFA and NFA engine evaluate the regex. I thought my pattern is right because obviously you can tell what this pattern --> \\b[a-zA-Z]+\\.?\\b really wants but the regex engine does not interpreted it that way. I really must study how regex engine evaluate regular expression pattern, but this of course depends on the engine. Anyway does anyone of you guys know what engine does java regex use? Is it (NFA,DFA,DFA(POSIX)...)?
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
It's NFA. As are most other serious modern regex packages. But the issues with your pattern here have little to do with that, and more to do with what a word boundary is. In a string like "foo. " there's a word boundary between 'o' and '.', but not between '.' and ' '. So the only way your regex matches something is by not matching the "\\.?" part (since it's got a ?, this is OK). It can do this even if there is a '.' next in the target string, or even "..". The matcher tries to match your entire regex, and is willing and able to backtrack from an optional match (?) if it needs to, in order to be able to match the final /b.
[ October 11, 2003: Message edited by: Jim Yingst ]
Richard Teston
Ranch Hand

Joined: Feb 12, 2002
Posts: 89
Thanks again Jim for the enlightenment.About the engine I'm just curious about the java regex engine and you are right it's not the issue with my pattern.
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: Regex
 
Similar Threads
Email Validation code
JS on the html page
Pattern Problems
Regex validation
Regular Expression Question