| Author |
Regular expressions - Split text using any chars except letters
|
Hesham Gneady
Ranch Hand
Joined: Feb 26, 2007
Posts: 66
|
|
Hello ,
Suppose i've a JTextArea where a user enters some text in any language, and i want to grab each word the user entered, so i want to split that text with any character like this :
This is okay for English text, but i want this also to work if user entered German/Turkish/Arabic/ .... text.
Is this possible ?
Thanks.
|
Hesham
|
 |
pete stein
Bartender
Joined: Feb 23, 2007
Posts: 1561
|
|
why not just split on white space?
i.e.,
This will still leave punctuation marks present though.
|
 |
Hesham Gneady
Ranch Hand
Joined: Feb 26, 2007
Posts: 66
|
|
yes, right ... but each language has it's own punctuations which i want to use in splitting the text too.
But if there was no other solution then that's my second option.
|
 |
Campbell Ritchie
Sheriff
Joined: Oct 13, 2005
Posts: 32716
|
|
|
Have a look at the Java™ Tutorials section, particularly about the predefined character classes. You might be able to create a class for "not something" which might help.
|
 |
Hesham Gneady
Ranch Hand
Joined: Feb 26, 2007
Posts: 66
|
|
Thanks Campbell ... I've read it.
But sorry, i don''t get it. What's the difference between what you're suggesting & the code example i introduced :
|
 |
Campbell Ritchie
Sheriff
Joined: Oct 13, 2005
Posts: 32716
|
|
|
What about \\w or \\W? a-zA-Z only works for English; other languages use different alphabets.
|
 |
Rob Spoor
Sheriff
Joined: Oct 27, 2005
Posts: 19216
|
|
Campbell Ritchie wrote:What about \\w or \\W? a-zA-Z only works for English; other languages use different alphabets.
\w is explicitly specified as "A word character: [a-zA-Z_0-9]". I've tried with é but that was used to split on.
|
SCJP 1.4 - SCJP 6 - SCWCD 5
How To Ask Questions How To Answer Questions
|
 |
Hesham Gneady
Ranch Hand
Joined: Feb 26, 2007
Posts: 66
|
|
Well ... Temporarily i did this :
So i'm splitting the text using any non-letter English character. But as i said this is not nice, if the user entered any non-letter character of any other language like German or Arabic, it won't split the text.
Any ideas ?
|
 |
Joanne Neal
Rancher
Joined: Aug 05, 2005
Posts: 3011
|
|
As I very rarely use them, I don't know much about regexes, but I was amazed that they didn't support matching of non-English characters.
So I had a little browse around the web and i found this.
I don't know if it will solve your problem, but from a non-regex expert reading of it, the Unicode Character Properties section looks like it may be useful.
|
Joanne
|
 |
Hesham Gneady
Ranch Hand
Joined: Feb 26, 2007
Posts: 66
|
|
Thanks Joanne.
I guess we were injustice to Regular Expressions, this code will just do it :
\p{L} will return any Unicode letter.
For more info. : Regular Expressions
Thanks
|
 |
Campbell Ritchie
Sheriff
Joined: Oct 13, 2005
Posts: 32716
|
|
|
Damn! \\w wouldn't work. But well done, Joanne.
|
 |
Joanne Neal
Rancher
Joined: Aug 05, 2005
Posts: 3011
|
|
Actually, reading thru the Unicode Character Properties section of that link, it appears accented characters can be either one or two unicode codepoints.
To be absolutely sure of a match, I think you would also need to allow for a letter followed by a non-spacing mark.
|
 |
Hesham Gneady
Ranch Hand
Joined: Feb 26, 2007
Posts: 66
|
|
To be absolutely sure of a match, I think you would also need to allow for a letter followed by a non-spacing mark.
Let me check i understand you right, do you mean a word like this :
"tt/" or "tté3"
Right ?
|
 |
Joanne Neal
Rancher
Joined: Aug 05, 2005
Posts: 3011
|
|
Again, "character" really means "Unicode code point". \p{L} matches a single code point in the category "letter". If your input string is à encoded as U+0061 U+0300, it matches a without the accent. If the input is à encoded as U+00E0, it matches à with the accent. The reason is that both the code points U+0061 (a) and U+00E0 (à) are in the category "letter", while U+0300 is in the category "mark".
So, if the à encoded is as U+0061 U+0300, then the first character is the latter a and the second character is not a letter (it is a mark) and so could break any regex that is looking for a string of letters.
|
 |
Hesham Gneady
Ranch Hand
Joined: Feb 26, 2007
Posts: 66
|
|
Sorry for being late to reply Joanne.
I think i understand you ... if we've a word : "eàe" where à=U+0061 U+0300 .... Then it will be splitted to a word "ea" and "e" ... right ?
If so, i think this will do the job :
I didn't try this one but i think it'll do the job. What do you think Joanne ?
|
 |
Joanne Neal
Rancher
Joined: Aug 05, 2005
Posts: 3011
|
|
Hesham Gneady wrote:Sorry for being late to reply Joanne.
I think i understand you ... if we've a word : "eàe" where à=U+0061 U+0300 .... Then it will be splitted to a word "ea" and "e" ... right ?
Yes
Hesham Gneady wrote:If so, i think this will do the job :
I didn't try this one but i think it'll do the job. What do you think Joanne ?
As I said earlier, I'm not a regex expert. Best way to find out if it's right is to test it.
|
 |
 |
|
|
subject: Regular expressions - Split text using any chars except letters
|
|
|