File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Please help me check this regex Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of The Java EE 7 Tutorial Volume 1 or Volume 2 this week in the Java EE forum
or jQuery UI in Action in the JavaScript forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Please help me check this regex" Watch "Please help me check this regex" New topic
Author

Please help me check this regex

Kanchan Yadav
Greenhorn

Joined: Nov 30, 2010
Posts: 5
Hi! I have been reading some posts & links on 'Regex'. I need to write one that checks first & last names. I know many are available on the web but just wanted to try on my own...
This regex should match names beginning with Capital, with dots & spaces, with atleast one multi-length string as in 'Java' & of length 45 chars.
It should also be able to match names like - A.B. Henry, A B Henry, A Henry, Alex H, Alex H., Alex Henry, Alex Williams Henry, etc.


I've created a package 'validator' in my application & named this class as 'FirstLastNameVerifier'. The code goes like this:

Thanks for your help.
Kanchan Yadav
Greenhorn

Joined: Nov 30, 2010
Posts: 5
And yes I have tried giving it the same inputs as mentioned above, but it doesn't seem to work. Thanks.
Matthew Brown
Bartender

Joined: Apr 06, 2010
Posts: 4370
    
    8

Hi Kanchan, welcome to the Ranch!

I'm finding it hard to parse that regular expression, but I think you're over-complicating it.

Firstly, remember you don't need to validate everything in the same regular expression. For instance, I'd forget about trying to check the 45 character limit it in. I think you're better off checking .length() before trying the match.

I'm also not sure that it's worth trying to validate the rule that at least one of the groups has to be multi-character in the same regular expression. It may be possible, but only at the expense of complicating it. Whereas two separate simple regular expressions would probably do that job.

The following expression matches all the names you've given above, although I haven't tested it properly against names it shouldn't validate:
([A-Z]([a-z]+|[.]?)[ ]*){2,}

A second regular expression to check that there is at least one multi-letter word in the name should be pretty simple.

Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41599
    
  55
In this day and age your code should be able to handle Unicode, not just ASCII. So instead of "[A-Z]", use "\p{Lu}", instead of "[a-z]", use "\p{Ll}", and instead of "[A-Za-z]" you should use "\p{L}".


Ping & DNS - my free Android networking tools app
Matthew Brown
Bartender

Joined: Apr 06, 2010
Posts: 4370
    
    8

Oh, one other thing I was going to mention. If this is going to be for a real application, you need to realise that real names can get a bit more complicated. You'll need to cope with things like O'Leary, McDonald, Lloyd-George, etc. Validating names without excluding real people is not straightforward.
Kanchan Yadav
Greenhorn

Joined: Nov 30, 2010
Posts: 5
Thanks Matthew & Ulf.. My first post got some replies
Matthew - Actually, I tried with a simpler one but as I kept adding those conditions it got complicated. I'll try this one & let you know.

Ulf - That's definitely something new to learn & I will use it in my code.

By the way, I can not yet understand the use of '?' and '^' in regex. I'm trying to read more about them, but still clueless as to their application in my code. If you can suggest some good tutorial/link/URL.. please do so.

Thanks!
Vinoth Kumar Kannan
Ranch Hand

Joined: Aug 19, 2009
Posts: 276

Like Matthew, I too tried and came with another possible regex solution - [A-Z][a-z]*(?:[\\s\\.][A-Z][a-z]*)*

The ultimate aim is simple - Match a multi-word string separated by either '.' or ' ', and starts with a capital letter.
Lets split the designing into parts.
Write a regex that starts with a capital letter and followed by any number of optional small letters(can also be 0, as in the case of 'A B Henry'). => [A-Z][a-z]*
Then, coming to the multi-word part, the words may be separated by either a '.' or a ' '(as in A.B Henry), but not both. => ([A-Z][a-z]*[\\s\\.])+
..but in the above constructed regex, the string is always supposed to end with a ' ' or a '.', otherwise it wont match. So we do a workaround, bringing [\\s\\.] to the front. => [A-Z][a-z]*(?:[\\s\\.][A-Z][a-z]*)*

The above constructed regex now shall match all of your inputs except for 'Alex H.' , but I guess you know what to do to make the pattern match this too

and the '?:' used is for defining a non-capturing group, as we are not interested in using back references here. This '?:' when included will improve your regex matching speed relatively.


OCPJP 6
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41599
    
  55
Matthew makes a good point. If you can split the input fields into three fields (first name - middle name or initials - last name), then you could apply more specific checks, but if it's just one large string then it'll be tricky.

(The regexp you posted doesn't make this mistake, but it's good to remember that there are last names that have only two characters, "Ng" being an example.)
Matthew Brown
Bartender

Joined: Apr 06, 2010
Posts: 4370
    
    8

It's also worth reading this.

For instance, I work in a university. Imagine how many of our systems struggled when we had a student join last year who only has a single name. Not just a preference for using a single name: legally speaking they have a single name.
Kanchan Yadav
Greenhorn

Joined: Nov 30, 2010
Posts: 5
Thanks Vinoth. I tried that regex & yes it does work in most of the situations.

Thanks Matthew. I went through that article. Yes, I would definitely need to keep in mind all those aspects & 'localization' etc. for later. Presently, I can continue with the min. of requirements that I've mentioned.

Thanks Ulf. Infact the complicated regex is a result of my trying to get all things done & dealing with all conditions @ once.

Thanks All for guiding me. I'm reading a few tutorials on 'regex' & trying to come up with a good one for this. Greenhorn that I am its taking me a little more time to figure out . But yes please do watch this post; I'm sure to come up with something soon. I'm at it.
Kanchan Yadav
Greenhorn

Joined: Nov 30, 2010
Posts: 5
Hello Vinoth, just one query. I want to set this class as an input verifier for a text field on one of my Swing forms. Which would be a better way to code it? 1. Using the 'match' method (as above) or using 'Pattern' & 'Matcher' classes. The code for pre-compiled pattern is:



I have tried testing your regex with both the code samples & there are some differences in output.
1. Using the 'match' method returns 'false' with most of the input strings mentioned above.

2. Using pattern class does match the input strings but the 'matcher groups' are not the same as input strings. For eg. A.B. Williams would give the following output:

1. "String is A.B. Williams Match result is false"
2. "String is A.B. Williams Match result is true"


In another detail display, it gives
Matcher grp is: A Matcher start is: 0 Matcher end is: 1
Matcher grp is: B Matcher start is: 2 Matcher end is: 3
Matcher grp is: Williams Matcher start is: 5 Matcher end is: 13

(I understand that because of the use of ?:, we do not get the characters "." & " " as a mtching group in the result.)

My question is why do the two cases treat the input strings differently. I'm not sure but could it be because 'match' does a complete match & would not settle for anything less whereas 'pattern' returns true even if a 'minimum' match is found?
Can you suggest how it is generally done & which would be a better way given the context?

Thanks for your help.
Vinoth Kumar Kannan
Ranch Hand

Joined: Aug 19, 2009
Posts: 276

I'm not sure but could it be because 'match' does a complete match & would not settle for anything less whereas 'pattern' returns true even if a 'minimum' match is found?

I assume that you mean String's matches() method by 'match' and Pattern.compile(),Matcher.find() & Matcher.group() by 'pattern'.
So, you want to know the difference between them?
Go for the matches() method if you want to match your pattern against the whole input string, and go for Matcher.find() followed by Matcher.group(), to find any number of substring matches in the input string.
I suggest you read more of these API usages in the javadoc.

And 1 more thing - when you have a lot of strings to be matched against a same regex pattern, it is advisable that you create a Pattern instance and use Matcher.matches() or Matcher.find()/Matcher.group() on it.
Using String's matches(regex) is relatively a more costly method, for a long list of strings to be matched against. For just few matches, you can use this.

...I understand that because of the use of ?:, we do not get the characters "." & " " as a mtching group in the result...

No. When designing a pattern, anything within '(' and ')' is called a capturing group, which once matched the regex engine shall remember throughout its processing. To prevent the regex engine remembering this match, as we do not intend to use it again else where in our pattern(as a back reference), we use '?:'. You can even remove it. Nothing is going to change, except for the performance when the input string is very large.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Please help me check this regex