aspose file tools*
The moose likes Java in General and the fly likes Regular expressions - Split text using any chars except letters Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Regular expressions - Split text using any chars except letters" Watch "Regular expressions - Split text using any chars except letters" New topic
Author

Regular expressions - Split text using any chars except letters

Hesham Gneady
Ranch Hand

Joined: Feb 26, 2007
Posts: 66
Hello ,

Suppose i've a JTextArea where a user enters some text in any language, and i want to grab each word the user entered, so i want to split that text with any character like this :

This is okay for English text, but i want this also to work if user entered German/Turkish/Arabic/ .... text.
Is this possible ?

Thanks.


Hesham
pete stein
Bartender

Joined: Feb 23, 2007
Posts: 1561
why not just split on white space?
i.e.,


This will still leave punctuation marks present though.
Hesham Gneady
Ranch Hand

Joined: Feb 26, 2007
Posts: 66
yes, right ... but each language has it's own punctuations which i want to use in splitting the text too.
But if there was no other solution then that's my second option.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38765
    
  23
Have a look at the Java™ Tutorials section, particularly about the predefined character classes. You might be able to create a class for "not something" which might help.
Hesham Gneady
Ranch Hand

Joined: Feb 26, 2007
Posts: 66
Thanks Campbell ... I've read it.
But sorry, i don''t get it. What's the difference between what you're suggesting & the code example i introduced :

Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38765
    
  23
What about \\w or \\W? a-zA-Z only works for English; other languages use different alphabets.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19693
    
  20

Campbell Ritchie wrote:What about \\w or \\W? a-zA-Z only works for English; other languages use different alphabets.

\w is explicitly specified as "A word character: [a-zA-Z_0-9]". I've tried with é but that was used to split on.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Hesham Gneady
Ranch Hand

Joined: Feb 26, 2007
Posts: 66
Well ... Temporarily i did this :

So i'm splitting the text using any non-letter English character. But as i said this is not nice, if the user entered any non-letter character of any other language like German or Arabic, it won't split the text.

Any ideas ?
Joanne Neal
Rancher

Joined: Aug 05, 2005
Posts: 3528
    
  15
As I very rarely use them, I don't know much about regexes, but I was amazed that they didn't support matching of non-English characters.
So I had a little browse around the web and i found this.
I don't know if it will solve your problem, but from a non-regex expert reading of it, the Unicode Character Properties section looks like it may be useful.


Joanne
Hesham Gneady
Ranch Hand

Joined: Feb 26, 2007
Posts: 66
Thanks Joanne.
I guess we were injustice to Regular Expressions, this code will just do it :
\p{L} will return any Unicode letter.
For more info. : Regular Expressions

Thanks
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38765
    
  23
Damn! \\w wouldn't work. But well done, Joanne.
Joanne Neal
Rancher

Joined: Aug 05, 2005
Posts: 3528
    
  15
Actually, reading thru the Unicode Character Properties section of that link, it appears accented characters can be either one or two unicode codepoints.
To be absolutely sure of a match, I think you would also need to allow for a letter followed by a non-spacing mark.
Hesham Gneady
Ranch Hand

Joined: Feb 26, 2007
Posts: 66
To be absolutely sure of a match, I think you would also need to allow for a letter followed by a non-spacing mark.

Let me check i understand you right, do you mean a word like this :
"tt/" or "tté3"

Right ?
Joanne Neal
Rancher

Joined: Aug 05, 2005
Posts: 3528
    
  15
Again, "character" really means "Unicode code point". \p{L} matches a single code point in the category "letter". If your input string is à encoded as U+0061 U+0300, it matches a without the accent. If the input is à encoded as U+00E0, it matches à with the accent. The reason is that both the code points U+0061 (a) and U+00E0 (à) are in the category "letter", while U+0300 is in the category "mark".


So, if the à encoded is as U+0061 U+0300, then the first character is the latter a and the second character is not a letter (it is a mark) and so could break any regex that is looking for a string of letters.
Hesham Gneady
Ranch Hand

Joined: Feb 26, 2007
Posts: 66
Sorry for being late to reply Joanne.
I think i understand you ... if we've a word : "eàe" where à=U+0061 U+0300 .... Then it will be splitted to a word "ea" and "e" ... right ?

If so, i think this will do the job :
I didn't try this one but i think it'll do the job. What do you think Joanne ?
Joanne Neal
Rancher

Joined: Aug 05, 2005
Posts: 3528
    
  15
Hesham Gneady wrote:Sorry for being late to reply Joanne.
I think i understand you ... if we've a word : "eàe" where à=U+0061 U+0300 .... Then it will be splitted to a word "ea" and "e" ... right ?


Yes

Hesham Gneady wrote:If so, i think this will do the job :
I didn't try this one but i think it'll do the job. What do you think Joanne ?


As I said earlier, I'm not a regex expert. Best way to find out if it's right is to test it.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Regular expressions - Split text using any chars except letters