howdy, i'm just playing around with the new regex features and had a question or two. so im starting off by trying to create a nice robust regex pattern to validate email addresses. im starting off with the part up through the @ sign. here is what i have so far.
or the equivalent
it seems to work fine, but it get REALLY bogged down and slow when trying to process something like the following. is there any way to optimize this? or am i SOL? thanks!
I don't have any experience with the 1.4 regexps but I have been using the ORO matcher for a long time. I also have a regular expression that I use to validate email addresses. It might suprise you that the following is a an email address: "steve deadsea"@mydomain.tld (note the quotes and the space) The two relevent RFCs are: format for subdomain: rfc 1035 pg 7 format for email address: rfc 821 pg 29 Here is the regular expression that I eventually came up with. Note the (?: instead of ( so that parenthesis are not matched, which can speed up performance a lot: private static final String LETTER = "[a-zA-Z]"; private static final String DIGIT = "[0-9]"; private static final String LETTER_DIGIT = "[0-9a-zA-Z]"; private static final String LETTER_DIGIT_HYPHEN = "(?:[0-9a-zA-Z-])"; private static final String QUOTEDSTRING = "(?:[\"\\\"](?:[^\\\"]|(?:[\\\\][\\\"]))*[\\\"])"; private static final String ATOM = "(?:[\\!\\#-\\\'\\*\\+\\-\\/-9\\=\\?A-Z\\^-\\~]+)"; private static final String SUBDOMAIN = "(?:" + LETTER + "(?:" + LETTER_DIGIT_HYPHEN + "*" + LETTER_DIGIT + ")?)"; private static final String WORD = "(?:" + ATOM + "|" + QUOTEDSTRING + ")"; private static final String DOMAIN = "(?:" + SUBDOMAIN + "(?:[\\.]" + SUBDOMAIN + ")+)"; private static final String LOCALPART = "(?:" + WORD + "(?:[\\.]" + WORD + ")*)"; private static final String EMAIL = "(?:" + LOCALPART + "[\\@]" + DOMAIN + ")"; protected static final String EMAIL_ADDRESS = "^" + EMAIL + "$"; protected static final String EMAIL_ADDRESS_OPTIONAL = "^(?:" + EMAIL + "?)$"; Again, I have not tested this with the JDK regex package, but it should work as is or with a few minor modifications.
Steve's reply looks good to me. Some other thoughts in response to the original post: The two patterns given are not quite equivalent to each other. The \w character class matches a _ as part of its definition. So the second pattern will match a string that begins with _, while the first will not. Perhaps [\w&&[^_]] is what you were looking for? The ? in the middle of your patterns doesn't seem to do anything. If a [-_] is "missing", then that just means the the neighboring [A-Za-z0-9]+ group will be able to grab more characters. I suspect that this unnecessary ? may make the whole pattern much more vague and flexible, leading the matcher to waste its time considing all the many ways a given pattern might match. Try dropping the ? to see how performance improves. I'm not sure what exactly the patterns are intended to do. Is the goal to exclude anything with - or _ at the beginning or end, but those chars are OK anywhere else? (Note that Steve's code does something slightly different here - I'm too lazy to read the RFCs right now, but I imagine his version is correct.) Offhand, if speed is still a problem I might try something like this instead: [a-zA-Z]|[a-zA-Z][a-zA-Z0-9_-]*[a-zA-Z0-9]