File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Regular expressions - Split text using any chars except letters

 
Hesham Gneady
Ranch Hand
Posts: 66
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello ,

Suppose i've a JTextArea where a user enters some text in any language, and i want to grab each word the user entered, so i want to split that text with any character like this :

This is okay for English text, but i want this also to work if user entered German/Turkish/Arabic/ .... text.
Is this possible ?

Thanks.
 
pete stein
Bartender
Posts: 1561
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
why not just split on white space?
i.e.,


This will still leave punctuation marks present though.
 
Hesham Gneady
Ranch Hand
Posts: 66
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
yes, right ... but each language has it's own punctuations which i want to use in splitting the text too.
But if there was no other solution then that's my second option.
 
Campbell Ritchie
Sheriff
Pie
Posts: 47300
52
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Have a look at the Java™ Tutorials section, particularly about the predefined character classes. You might be able to create a class for "not something" which might help.
 
Hesham Gneady
Ranch Hand
Posts: 66
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Campbell ... I've read it.
But sorry, i don''t get it. What's the difference between what you're suggesting & the code example i introduced :

 
Campbell Ritchie
Sheriff
Pie
Posts: 47300
52
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What about \\w or \\W? a-zA-Z only works for English; other languages use different alphabets.
 
Rob Spoor
Sheriff
Pie
Posts: 20398
47
Chrome Eclipse IDE Java Windows
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Campbell Ritchie wrote:What about \\w or \\W? a-zA-Z only works for English; other languages use different alphabets.

\w is explicitly specified as "A word character: [a-zA-Z_0-9]". I've tried with é but that was used to split on.
 
Hesham Gneady
Ranch Hand
Posts: 66
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well ... Temporarily i did this :

So i'm splitting the text using any non-letter English character. But as i said this is not nice, if the user entered any non-letter character of any other language like German or Arabic, it won't split the text.

Any ideas ?
 
Joanne Neal
Rancher
Pie
Posts: 3742
16
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
As I very rarely use them, I don't know much about regexes, but I was amazed that they didn't support matching of non-English characters.
So I had a little browse around the web and i found this.
I don't know if it will solve your problem, but from a non-regex expert reading of it, the Unicode Character Properties section looks like it may be useful.
 
Hesham Gneady
Ranch Hand
Posts: 66
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Joanne.
I guess we were injustice to Regular Expressions, this code will just do it :
\p{L} will return any Unicode letter.
For more info. : Regular Expressions

Thanks
 
Campbell Ritchie
Sheriff
Pie
Posts: 47300
52
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Damn! \\w wouldn't work. But well done, Joanne.
 
Joanne Neal
Rancher
Pie
Posts: 3742
16
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Actually, reading thru the Unicode Character Properties section of that link, it appears accented characters can be either one or two unicode codepoints.
To be absolutely sure of a match, I think you would also need to allow for a letter followed by a non-spacing mark.
 
Hesham Gneady
Ranch Hand
Posts: 66
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
To be absolutely sure of a match, I think you would also need to allow for a letter followed by a non-spacing mark.

Let me check i understand you right, do you mean a word like this :
"tt/" or "tté3"

Right ?
 
Joanne Neal
Rancher
Pie
Posts: 3742
16
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Again, "character" really means "Unicode code point". \p{L} matches a single code point in the category "letter". If your input string is à encoded as U+0061 U+0300, it matches a without the accent. If the input is à encoded as U+00E0, it matches à with the accent. The reason is that both the code points U+0061 (a) and U+00E0 (à) are in the category "letter", while U+0300 is in the category "mark".


So, if the à encoded is as U+0061 U+0300, then the first character is the latter a and the second character is not a letter (it is a mark) and so could break any regex that is looking for a string of letters.
 
Hesham Gneady
Ranch Hand
Posts: 66
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sorry for being late to reply Joanne.
I think i understand you ... if we've a word : "eàe" where à=U+0061 U+0300 .... Then it will be splitted to a word "ea" and "e" ... right ?

If so, i think this will do the job :
I didn't try this one but i think it'll do the job. What do you think Joanne ?
 
Joanne Neal
Rancher
Pie
Posts: 3742
16
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hesham Gneady wrote:Sorry for being late to reply Joanne.
I think i understand you ... if we've a word : "eàe" where à=U+0061 U+0300 .... Then it will be splitted to a word "ea" and "e" ... right ?


Yes

Hesham Gneady wrote:If so, i think this will do the job :
I didn't try this one but i think it'll do the job. What do you think Joanne ?


As I said earlier, I'm not a regex expert. Best way to find out if it's right is to test it.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic