aspose file tools*
The moose likes Java in General and the fly likes Seemingly simple regex making my head hurt Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Seemingly simple regex making my head hurt" Watch "Seemingly simple regex making my head hurt" New topic
Author

Seemingly simple regex making my head hurt

sever oon
Ranch Hand

Joined: Feb 08, 2004
Posts: 268
Hey all,

I have a seemingly simple regex problem that has me momentarily stumped. While I'm waiting for the aspirin to kick in, I was hoping one of y'all (that's JavaRanch-speak) might want to take a crack at it.

I'm trying to write a regex that matches a String according to the following rules:

1. Has one or more of the characters: a-zA-Z0-9
2. Can contain a dash, but NOT as the first or last character in the String.

So examples that would match:
abc
1
1ab4
a-bc
ab-c
a-----d

Examples that would NOT match:
-abc
abc-
a--b-11-

(The dashes cannot appear in the first or last position.)

I cannot depend upon the beginning and end of line markers (^ or $) because I'm planning on defining this regex as a constant and using this constant as part of a larger regex.

So, here's what I've got:



The problem: this regex requires that the matching String be at least two characters long. My first thought was to just put a question mark after the last character class, but then it would match Strings like "abc-" which end in a dash. Not acceptable.

Thanks all!
sev
Max Habibi
town drunk
( and author)
Sheriff

Joined: Jun 27, 2002
Posts: 4118

String regex = "(?=[^-])[\\p{Alnum}]*-*[\\p{Alnum}]+";


HTH,
M


Java Regular Expressions
David Weitzman
Ranch Hand

Joined: Jul 27, 2001
Posts: 1365
One of those days, eh? This is a case where playing with lookbehinds/lookaheads and other regex gadgets might be fun, but there's a simple solution that works and should be understandable by anyone familiar with basic regex syntax. I hope UBB code doesn't try to interpret this:

[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9]

When you have a regex that handles most cases and misses a few, it's always an option fill in the missing cases as an entirely seperate regex. It gets a little more funky when you have a regex that matches too many cases.
Max Habibi
town drunk
( and author)
Sheriff

Joined: Jun 27, 2002
Posts: 4118
I prefer David's answer, because it's less complex. It's slightly less efficient, but I'm guessing it would be hard to measure the difference.

The only adjustment I would make, and this is stylistic, is the following



HTH,
M
[ June 08, 2004: Message edited by: Max Habibi ]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Gotta watch those pesky backslashes, eh Max?

I suspect efficiency here also depends on the input - are successes or failures going to be more common? Are single-character inputs common, or rare?

There's also a difference in behavior between the two solutions. Max' original solution will not allow 123-45-67, while David's will. It's not 100% clear which of these is intended according to the instructions (which require that we allow "a dash", but say nothing about multiple dashes, except there's an example that shows multiple consecutive dashes). Maybe it doesn't matter. But my guess is David's soltion is correct. I'd probably formulate it as

"[a-zA-Z0-9]+(\\-+[a-zA-Z0-9]+)*"

or

"[a-zA-Z0-9]++(?:\\-++[a-zA-Z0-9]++)*+"

or

"\\p{Alnum}++(?:\\-++\\p{Alnum}++)*+"

I think the first is probably most understandable to people now, but the latter forms offer improvements I'd like to see more commonly used. That the possessive forms ++ and *+ aren't really necessary, and may be changed to + and * respectively - but I think in this case they lead to the fastest solution possible, eliminating unnecessary backtracking. Which also helps readability, IMO, assuming the reader is familiar with possessive forms.
[ June 08, 2004: Message edited by: Jim Yingst ]

"I'm not back." - Bill Harding, Twister
Alan Moore
Ranch Hand

Joined: May 06, 2004
Posts: 262
How is the regex going to be used? If you're using the matches() method (that is, the value to be matched makes up the entire target string), then either David's or Jim's regex will work. But if you use Matcher#find() to pluck the values out of a longer string, David's regex will stop after matching a single letter or digit. Also, given a target string like "test -123-456- test", both regexes will ignore the leading and trailing hyphens and return 123-456. If you don't want that to happen, you can use negative lookbehind to prevent it:Note that the possessive quantifier is not there just for efficiency's sake; if the last thing the regex engine sees is a hyphen, we don't want it to back off and match something shorter, we want it to fail. You could also use negative lookahead for the trailing hyphen:Here, again, if we weren't using possessive quantifiers, the lookahead would have to do more work:Again, all this applies only if you're using find() rather than matches().
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: Seemingly simple regex making my head hurt