File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Filter Non-Standard Characters

 
Anirvan Majumdar
Ranch Hand
Posts: 261
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hi,

I have developed a transformation application which reads data from Excel sheets and then creates an XML document. However, during the operation certain indeterminate characters creep in. These characters are treated as "space" characters by Java but even if I do a trim then these don't get removed. On comparing them against " " also, nothing happens!

For example, here's a character which is wrecking quite a havoc --> �
There's another one which looks like a small square.

Can anyone suggest how to deal with such occurrences and filter them out from a string using Java?

Doing it manually would be too much of an overhead.
TIA!
 
Ulf Dittmer
Rancher
Pie
Posts: 42966
73
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think it would be better to determine where and why those characters "creep in", and stop them from doing so.
 
Anirvan Majumdar
Ranch Hand
Posts: 261
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I understand that the problem presented here might be rather conveniently solved if a possibility to plug the source would have existed. However, as matters stand, the Excel sheets can be created by anyone around the world. The people we are targeting in our case take a dump of database tables into these sheets. So obviously, considering that the dumps are usually to the tune of 1000s of rows, I don't think it would be really feasible for us to suggest them to keep an eye out for such funny characters.

Ulf, your suggestion is more like a prevention but in my case the problem already manifests itself. I need to "solve" it.
 
Ulf Dittmer
Rancher
Pie
Posts: 42966
73
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I undersatnd, but if you knew where the characters came in you might see a pattern that would allow you to remove those characters based on position or character code.

These characters are treated as "space" characters by Java but even if I do a trim then these don't get removed.

In what way does Java treat those characters as space - is their code 32? Then trim() should remove them. If their code isn't 32 then they're clearly not space characters.
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The trim() method will only remove whitespace characters at the beginning or end of the String - removing multiple whitespace chars if present. If the characters are embedded elsewhere in the string, trim() will have no effect.

Anirvan, you can find out exactly what these characters are by casting each char to int and printing it out:

Here the quotes are useful since some characters are not visible, and others may cause you to skip a line.

Once you know the numeric value of the characters, you can find out exactly what character it is by looking up the value at in a unicode chart (such as at unicode.org). Once you know what it is, it will be easier to develop a sensible strategy for dealing with it.
 
Anirvan Majumdar
Ranch Hand
Posts: 261
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Guess I finally settled for the "int" value mapping fix. Seems to work fine but I was hoping for a better implementation :-[

Here's what I finally settled for:


Thanks again!
 
Ulf Dittmer
Rancher
Pie
Posts: 42966
73
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This looks potentially dangerous. Can you guarantee that there will never be legitimate characters beyond 127?
 
Anirvan Majumdar
Ranch Hand
Posts: 261
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Not unless you're concerned with only "printable" characters. Beyond 127 what you get is a whole set of Latin, Non-break space, and a wide bunch of zookie characters from a wider bunch of Worldly languages.
Of course, the implementation above should be a a nightmare for someone developing an application with Localization in mind. It's only meant for the nice characters one can see printed on a standard keyboard.
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic