| Author |
Remove bad characters from strings
|
Surendra Kumar
Ranch Hand
Joined: Jul 04, 2006
Posts: 87
|
|
Hi, I am trying to replace bad chanracters from strings. We have some users copying and pasting strings from mainframes and some other sources. So this brings us some unwanted special characters (like A with ^ on top or two dots on top). So later when we try to to process such string some of our applications are failing. These characters can't be even typed. So, can someone please tell how to remove these characters?
|
 |
Paul Clapham
Bartender
Joined: Oct 14, 2005
Posts: 16483
|
|
I can type those characters. And if it were up to me I would fix the parts of my application which failed when faced with such characters. Speaking of which, why does the application fail? You might also want to read this article to make sure it's not your fault that those characters appear in your inputs. However if you want to "cleanse" your data then do this: 1. Make a list of all the characters that you choose to accept. 2. Apply some method that removes any other characters from your inputs. Perhaps a regular expression could be used (I don't know much about them) or just a little filter that moves only valid characters into a validated string.
|
 |
Naseem Khan
Ranch Hand
Joined: Apr 25, 2005
Posts: 809
|
|
Check this thread Searching a large text file...
|
Asking Smart Questions FAQ - How To Put Your Code In Code Tags
|
 |
Surendra Kumar
Ranch Hand
Joined: Jul 04, 2006
Posts: 87
|
|
Thanks, Paul. I'm looking for some sample code that does remove these unwanted characters. The application should accept only ASCII characters. And all other characters should be removed. Could you please provide me with an example to do this in Java 1.4? Thanks a lot.
|
 |
Chris Rutkowski
Greenhorn
Joined: Jun 26, 2006
Posts: 8
|
|
|
"A with ^ on top or two dots on top", and other characters like them, are prefectly legal ASCII characters. On a normal U.S. keyboard they can only be typed by using ALT and then a number combination on the number pad, but many foreign keyboards are set up differently and if a language uses these characters, they have their own key just like any other letter. ANY printable character has an associated ASCII value.
|
 |
Surendra Kumar
Ranch Hand
Joined: Jul 04, 2006
Posts: 87
|
|
|
so, they come under the old 128-bit ASCII code?
|
 |
Henry Wong
author
Sheriff
Joined: Sep 28, 2004
Posts: 16692
|
|
Try... Henry
|
Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
|
 |
Surendra Kumar
Ranch Hand
Joined: Jul 04, 2006
Posts: 87
|
|
Hi Henry, Thanks for the code. If you don't mind, could you please let me know what exactly does this code do?
|
 |
Henry Wong
author
Sheriff
Joined: Sep 28, 2004
Posts: 16692
|
|
Originally posted by Surendra Nichenametla: Hi Henry, Thanks for the code. If you don't mind, could you please let me know what exactly does this code do?
It's a string substitution. The first parameter is the item to match for, and the second is what to replace it with. In this case, the second parameter specifies that the string should be replaced with nothing. As for the first parameter, it is a Regular Expression that defines what you want. Henry
|
 |
Ernest Friedman-Hill
author and iconoclast
Marshal
Joined: Jul 08, 2003
Posts: 24057
|
|
Originally posted by Chris Rutkowski: "A with ^ on top or two dots on top", and other characters like them, are prefectly legal ASCII characters.
No, actually. They're perfectly lovely characters, essential to many languages, but ASCII is a very restricted set of 127 specific characters that do not include umlauts, accents, etc -- only those characters used in the English language are included, along with control characters and punctuation (see here). I'm not arguing that applications that only accept ASCII are appropriate in Java -- they're certainly not. I'm just being pedantic about our terms.
|
[Jess in Action][AskingGoodQuestions]
|
 |
Surendra Kumar
Ranch Hand
Joined: Jul 04, 2006
Posts: 87
|
|
Thanks, Henry. Hi Ernest, Our app is trying to convert a text file to PDF document, and this is where it's failing. It fails with msg: "1 non-printable character. failed to create pdf..." So I wanted to remove all these non-printable characters. Now with the code snippent given by Henry I am trying to test this app. Can you please let me know how to print those extended ASCII chars using the conventinal English keyboard?
|
 |
Chris Rutkowski
Greenhorn
Joined: Jun 26, 2006
Posts: 8
|
|
Originally posted by Ernest Friedman-Hill: No, actually. They're perfectly lovely characters, essential to many languages, but ASCII is a very restricted set of 127 specific characters that do not include umlauts, accents, etc -- only those characters used in the English language are included, along with control characters and punctuation (see here). I'm not arguing that applications that only accept ASCII are appropriate in Java -- they're certainly not. I'm just being pedantic about our terms.
Thanks for the info, Ernest. My mistake.
|
 |
 |
|
|
subject: Remove bad characters from strings
|
|
|