File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes jdk1.4 regex help Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "jdk1.4 regex help" Watch "jdk1.4 regex help" New topic
Author

jdk1.4 regex help

Takeshi Toyohara
Greenhorn

Joined: Feb 03, 2002
Posts: 21
howdy,
i'm just playing around with the new regex features and had a question or two.
so im starting off by trying to create a nice robust regex pattern to validate email addresses.
im starting off with the part up through the @ sign.
here is what i have so far.

or the equivalent

it seems to work fine, but it get REALLY bogged down and slow when trying to process something like the following. is there any way to optimize this? or am i SOL?
thanks!
Steve Deadsea
Ranch Hand

Joined: Dec 03, 2001
Posts: 125
I don't have any experience with the 1.4 regexps but I have been using the ORO matcher for a long time. I also have a regular expression that I use to validate email addresses.
It might suprise you that the following is a an email address:
"steve deadsea"@mydomain.tld
(note the quotes and the space)
The two relevent RFCs are:
format for subdomain: rfc 1035 pg 7
format for email address: rfc 821 pg 29
Here is the regular expression that I eventually came up with. Note the (?: instead of ( so that parenthesis are not matched, which can speed up performance a lot:
private static final String LETTER = "[a-zA-Z]";
private static final String DIGIT = "[0-9]";
private static final String LETTER_DIGIT = "[0-9a-zA-Z]";
private static final String LETTER_DIGIT_HYPHEN = "(?:[0-9a-zA-Z-])";
private static final String QUOTEDSTRING = "(?:[\"\\\"](?:[^\\\"]|(?:[\\\\][\\\"]))*[\\\"])";
private static final String ATOM = "(?:[\\!\\#-\\\'\\*\\+\\-\\/-9\\=\\?A-Z\\^-\\~]+)";
private static final String SUBDOMAIN = "(?:" + LETTER + "(?:" + LETTER_DIGIT_HYPHEN + "*" + LETTER_DIGIT + ")?)";
private static final String WORD = "(?:" + ATOM + "|" + QUOTEDSTRING + ")";
private static final String DOMAIN = "(?:" + SUBDOMAIN + "(?:[\\.]" + SUBDOMAIN + ")+)";
private static final String LOCALPART = "(?:" + WORD + "(?:[\\.]" + WORD + ")*)";
private static final String EMAIL = "(?:" + LOCALPART + "[\\@]" + DOMAIN + ")";
protected static final String EMAIL_ADDRESS = "^" + EMAIL + "$";
protected static final String EMAIL_ADDRESS_OPTIONAL = "^(?:" + EMAIL + "?)$";
Again, I have not tested this with the JDK regex package, but it should work as is or with a few minor modifications.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Steve's reply looks good to me. Some other thoughts in response to the original post:
The two patterns given are not quite equivalent to each other. The \w character class matches a _ as part of its definition. So the second pattern will match a string that begins with _, while the first will not. Perhaps [\w&&[^_]] is what you were looking for?
The ? in the middle of your patterns doesn't seem to do anything. If a [-_] is "missing", then that just means the the neighboring [A-Za-z0-9]+ group will be able to grab more characters. I suspect that this unnecessary ? may make the whole pattern much more vague and flexible, leading the matcher to waste its time considing all the many ways a given pattern might match. Try dropping the ? to see how performance improves.
I'm not sure what exactly the patterns are intended to do. Is the goal to exclude anything with - or _ at the beginning or end, but those chars are OK anywhere else? (Note that Steve's code does something slightly different here - I'm too lazy to read the RFCs right now, but I imagine his version is correct.)
Offhand, if speed is still a problem I might try something like this instead:
[a-zA-Z]|[a-zA-Z][a-zA-Z0-9_-]*[a-zA-Z0-9]


"I'm not back." - Bill Harding, Twister
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: jdk1.4 regex help