It's not a secret anymore!*
The moose likes Java in General and the fly likes Regular expression pattern for Non-Ascii characters Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Regular expression pattern for Non-Ascii characters" Watch "Regular expression pattern for Non-Ascii characters" New topic
Author

Regular expression pattern for Non-Ascii characters

Raghu Sha
Ranch Hand

Joined: Feb 02, 2013
Posts: 122
How to write pattern to find Non Ascii characters from input using reg ex pattern?
Whcih includes TAB,"",punctuation..
Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 14115
    
  16

Tab and punctuation are certainly ASCII characters. So, what do you really mean by "non-ASCII" characters? You have to be precise if you want a good answer.


Java Beginners FAQ - JavaRanch SCJP FAQ - The Java Tutorial - Java SE 7 API documentation
Scala Notes - My blog about Scala
Raghu Sha
Ranch Hand

Joined: Feb 02, 2013
Posts: 122
Thanks..
First we can write regex for below allowable characters.
Remaining are Non-Ascii.

Ascii characters
Char >= 32 && Char <= 255

Country specific allowable characters
0x15E,0x15F,0x162,0x163,0x102,0x103,0xCE,0xEE,0xC2,0xE2
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1050
    
  10

Raghu Sha wrote:
Char >= 32 && Char <= 255


The ASCII character set does not include values greater than 127 and it does include characters less than 32 so it sounds like you don't actually mean ASCII .

Also, your last post seems to contradict your first post. Do you want to extract from a String the ones that are in your specified set or to remove from a String the ones that are in your specified set.
Raghu Sha
Ranch Hand

Joined: Feb 02, 2013
Posts: 122
Thanks Richerd.
Sorry for confusing requirement.

Need to filter Non-Ascii charaters from user input using RegEx pattern based on country specific.
The application support multiple countries.

Could you please tel us your design approach how to achieve this ?

fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 11256
    
  16

Raghu Sha wrote:Need to filter Non-Ascii charaters from user input using RegEx pattern based on country specific.
The application support multiple countries.

your REQUIREMENT is to use a regex? That is not a good requirement. It should tell you what you need to accomplish, but not dictate HOW you do it. I would go back to whoever wrote that spec and tell them to try again.


There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

I suspect the whole requirement is bogus, not just the part which requires the use of a regex. I suspect it's going to prevent me from using the characters é or ™ because I'm in an English-language environment and everybody knows that you don't use those characters in English.

But generally we can't control requirements given to us by higher-ups, and if the requirement is actually bogus there's nothing we can do about that. So my approach would be to ignore anything referring to "ASCII", since that appears to be a red herring, and just get a list of permitted characters for each language. It's easy enough to write a regex to match a list of characters -- even a regex klutz like me should be able to do it.
Raghu Sha
Ranch Hand

Joined: Feb 02, 2013
Posts: 122
How to achieve this?
Please give design approach.

Ivan Jozsef Balazs
Rancher

Joined: May 22, 2012
Posts: 867
    
    5
Raghu Sha wrote:
Please give design approach.


The hints on the design seem to have been ignored by you.


Ascii characters
Char >= 32 && Char <= 255

Country specific allowable characters
0x15E,0x15F,0x162,0x163,0x102,0x103,0xCE,0xEE,0xC2,0xE2


What about this regexp?

^[ -\u00FF\u015E\u015F\u0162\u163...]$

That is
"begin of string, then any character (from space to 0xFF or in the list of the 'country specific allowable characters') any times and then the end of the string"
(I was lazy to write them all, the three dots stand for the continuation.)

Ivan Jozsef Balazs
Rancher

Joined: May 22, 2012
Posts: 867
    
    5
Country specific allowable characters


Which country is it? Romania?
Raghu Sha
Ranch Hand

Joined: Feb 02, 2013
Posts: 122
Yes it is Romania.
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1050
    
  10

Raghu Sha wrote:Yes it is Romania.


But we still don't know your requirement ! We don't know whether you want to remove the invalid characters, create a set of the invalid characters contained in your input or just simply say whether or not the input has invalid characters. Obvioulsy the regex for these three requirements are very closely related but not necessarily the same.

So, what is your input and what is your desired output?
Ivan Jozsef Balazs
Rancher

Joined: May 22, 2012
Posts: 867
    
    5
Raghu Sha wrote:Yes it is Romania.


I happened to live a neighbouring country and though I do not speak Romanian, I somehow recognized the letters.
Are you sure about the requirement?
In texts at least in people's names other character might also occur, given the fact people of other mother tongues
(using different extension letters to the Latin alphabet) also live there.

Romania used for a while the Cyrillic alphabet and (albeit a country of orthodox faith) switched to Latin later.
Raghu Sha
Ranch Hand

Joined: Feb 02, 2013
Posts: 122
@Richerd.

It should filter Non-Ascii characters from user input.
If user enters, Non-Ascii characters in input, it shouldn't go to data/service layer. (remove those nonAscii chars)

Thanks
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1050
    
  10

Raghu Sha wrote:@Richerd.


Err..... Richard.


It should filter Non-Ascii characters from user input.
If user enters, Non-Ascii characters in input, it shouldn't go to data/service layer. (remove those nonAscii chars)

Thanks


I thought this ASCII requirement had been discarded since you have agreed that you don't actually mean ASCII ! As far as I can see you still have not defined the actual set of characters you wish to keep or the characters you wish to discard.

You need to use the String.replaceAll() method or the java.util.regex.Matcher.replaceAll() method. You need to spend some time learning about regular expression in general and regular expression in Java. Take a look at http://www.regular-expressions.info/tutorial.html and http://docs.oracle.com/javase/tutorial/essential/regex/.

P.S. Once you have define the character set the regex you need is trivial.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Regular expression pattern for Non-Ascii characters