aspose file tools*
The moose likes Java in General and the fly likes Filter Non-Standard Characters Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Filter Non-Standard Characters" Watch "Filter Non-Standard Characters" New topic
Author

Filter Non-Standard Characters

Anirvan Majumdar
Ranch Hand

Joined: Feb 22, 2005
Posts: 261
hi,

I have developed a transformation application which reads data from Excel sheets and then creates an XML document. However, during the operation certain indeterminate characters creep in. These characters are treated as "space" characters by Java but even if I do a trim then these don't get removed. On comparing them against " " also, nothing happens!

For example, here's a character which is wrecking quite a havoc --> �
There's another one which looks like a small square.

Can anyone suggest how to deal with such occurrences and filter them out from a string using Java?

Doing it manually would be too much of an overhead.
TIA!
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42612
    
  65
I think it would be better to determine where and why those characters "creep in", and stop them from doing so.


Ping & DNS - my free Android networking tools app
Anirvan Majumdar
Ranch Hand

Joined: Feb 22, 2005
Posts: 261
I understand that the problem presented here might be rather conveniently solved if a possibility to plug the source would have existed. However, as matters stand, the Excel sheets can be created by anyone around the world. The people we are targeting in our case take a dump of database tables into these sheets. So obviously, considering that the dumps are usually to the tune of 1000s of rows, I don't think it would be really feasible for us to suggest them to keep an eye out for such funny characters.

Ulf, your suggestion is more like a prevention but in my case the problem already manifests itself. I need to "solve" it.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42612
    
  65
I undersatnd, but if you knew where the characters came in you might see a pattern that would allow you to remove those characters based on position or character code.

These characters are treated as "space" characters by Java but even if I do a trim then these don't get removed.

In what way does Java treat those characters as space - is their code 32? Then trim() should remove them. If their code isn't 32 then they're clearly not space characters.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
The trim() method will only remove whitespace characters at the beginning or end of the String - removing multiple whitespace chars if present. If the characters are embedded elsewhere in the string, trim() will have no effect.

Anirvan, you can find out exactly what these characters are by casting each char to int and printing it out:

Here the quotes are useful since some characters are not visible, and others may cause you to skip a line.

Once you know the numeric value of the characters, you can find out exactly what character it is by looking up the value at in a unicode chart (such as at unicode.org). Once you know what it is, it will be easier to develop a sensible strategy for dealing with it.


"I'm not back." - Bill Harding, Twister
Anirvan Majumdar
Ranch Hand

Joined: Feb 22, 2005
Posts: 261
Guess I finally settled for the "int" value mapping fix. Seems to work fine but I was hoping for a better implementation :-[

Here's what I finally settled for:


Thanks again!
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42612
    
  65
This looks potentially dangerous. Can you guarantee that there will never be legitimate characters beyond 127?
Anirvan Majumdar
Ranch Hand

Joined: Feb 22, 2005
Posts: 261
Not unless you're concerned with only "printable" characters. Beyond 127 what you get is a whole set of Latin, Non-break space, and a wide bunch of zookie characters from a wider bunch of Worldly languages.
Of course, the implementation above should be a a nightmare for someone developing an application with Localization in mind. It's only meant for the nice characters one can see printed on a standard keyboard.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
 
subject: Filter Non-Standard Characters