I have a seemingly simple regex problem that has me momentarily stumped. While I'm waiting for the aspirin to kick in, I was hoping one of y'all (that's JavaRanch-speak) might want to take a crack at it.
I'm trying to write a regex that matches a String according to the following rules:
1. Has one or more of the characters: a-zA-Z0-9 2. Can contain a dash, but NOT as the first or last character in the String.
So examples that would match: abc 1 1ab4 a-bc ab-c a-----d
Examples that would NOT match: -abc abc- a--b-11-
(The dashes cannot appear in the first or last position.)
I cannot depend upon the beginning and end of line markers (^ or $) because I'm planning on defining this regex as a constant and using this constant as part of a larger regex.
So, here's what I've got:
The problem: this regex requires that the matching String be at least two characters long. My first thought was to just put a question mark after the last character class, but then it would match Strings like "abc-" which end in a dash. Not acceptable.
One of those days, eh? This is a case where playing with lookbehinds/lookaheads and other regex gadgets might be fun, but there's a simple solution that works and should be understandable by anyone familiar with basic regex syntax. I hope UBB code doesn't try to interpret this:
When you have a regex that handles most cases and misses a few, it's always an option fill in the missing cases as an entirely seperate regex. It gets a little more funky when you have a regex that matches too many cases.
town drunk ( and author)
Joined: Jun 27, 2002
I prefer David's answer, because it's less complex. It's slightly less efficient, but I'm guessing it would be hard to measure the difference.
The only adjustment I would make, and this is stylistic, is the following
HTH, M [ June 08, 2004: Message edited by: Max Habibi ]
I suspect efficiency here also depends on the input - are successes or failures going to be more common? Are single-character inputs common, or rare?
There's also a difference in behavior between the two solutions. Max' original solution will not allow 123-45-67, while David's will. It's not 100% clear which of these is intended according to the instructions (which require that we allow "a dash", but say nothing about multiple dashes, except there's an example that shows multiple consecutive dashes). Maybe it doesn't matter. But my guess is David's soltion is correct. I'd probably formulate it as
I think the first is probably most understandable to people now, but the latter forms offer improvements I'd like to see more commonly used. That the possessive forms ++ and *+ aren't really necessary, and may be changed to + and * respectively - but I think in this case they lead to the fastest solution possible, eliminating unnecessary backtracking. Which also helps readability, IMO, assuming the reader is familiar with possessive forms. [ June 08, 2004: Message edited by: Jim Yingst ]
How is the regex going to be used? If you're using the matches() method (that is, the value to be matched makes up the entire target string), then either David's or Jim's regex will work. But if you use Matcher#find() to pluck the values out of a longer string, David's regex will stop after matching a single letter or digit. Also, given a target string like "test -123-456- test", both regexes will ignore the leading and trailing hyphens and return 123-456. If you don't want that to happen, you can use negative lookbehind to prevent it:Note that the possessive quantifier is not there just for efficiency's sake; if the last thing the regex engine sees is a hyphen, we don't want it to back off and match something shorter, we want it to fail. You could also use negative lookahead for the trailing hyphen:Here, again, if we weren't using possessive quantifiers, the lookahead would have to do more work:Again, all this applies only if you're using find() rather than matches().