wood burning stoves 2.0*
The moose likes Linux / UNIX and the fly likes Remove special characters of any type from the file Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Engineering » Linux / UNIX
Bookmark "Remove special characters of any type from the file" Watch "Remove special characters of any type from the file" New topic
Author

Remove special characters of any type from the file

Krish Yeruva
Ranch Hand

Joined: Sep 17, 2008
Posts: 58
Hi,
I have one file which is having the unpredictable special characters in it. Please find the attachment.
So can any one please help me out, is there any script to find out and eliminate those unpredictable special characters.









[Spl.PNG]


Thanks & Regards
KITTU
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1067
    
  10

Sounds to me like the application you are using to view the file is using the wrong character encoding. What is the application?
Krish Yeruva
Ranch Hand

Joined: Sep 17, 2008
Posts: 58
Hi Richard,
I am not viewing these files. The thing is, in my application there is alot of unix jobs which uses these files. If these files are having the special characters like this, then that job will get failed. So I need to use the command in unix to findout and remove any type of special chars in the files. SO can you please help me out.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42286
    
  64
Do these characters belong in the files? If so, you'll need to handle them properly. If not, the easiest may be to not put them there in the first place.


Ping & DNS - my free Android networking tools app
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1067
    
  10

Krishnareddy Yeruva wrote:
I am not viewing these files.


You must be ! The PNG file you attached shows the character so you are viewing it! So what was used to view the character?
Krish Yeruva
Ranch Hand

Joined: Sep 17, 2008
Posts: 58
After the job got failed because of those bad characters, I have opened the file in EditPlus and notepad. I am able to see those special chars in both the editors.
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1067
    
  10

Krishnareddy Yeruva wrote:After the job got failed because of those bad characters, I have opened the file in EditPlus and notepad. I am able to see those special chars in both the editors.


I assume you mean 'notepad' as in Windows 'notepad' . Try opening the file in notepad and changing the character encoding (I don't use notepad but I know there is an option to set the character encoding). If you can find what the character encoding should be then it is easy enough to convert the file to the encoding assumed in your Linux scripts.
Krish Yeruva
Ranch Hand

Joined: Sep 17, 2008
Posts: 58
Hi Ulf,

As Ulf mentioned:
Do these characters belong in the files? If so, you'll need to handle them properly. If not, the easiest may be to not put them there in the first place.

These characters doesn't belong to these files. But by mistake these chars are getting placed inside these files. So if we have any unix script to remove these special chars, so that we can validate and remove such chars from that file before the job is going to execute.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42286
    
  64
Look into the Unix/Linux "sed" utility; it can be used to remove characters from a file according to some regexp (assuming you can find a regexp that matches precisely the extraneous characters and no others).
Krish Yeruva
Ranch Hand

Joined: Sep 17, 2008
Posts: 58
Hi,
The main thing here is these txt files are processing by the UNIX jobs. So if we have the unix script to find out these special chars, we can remove those chars from these txt files.

Thanks in advance
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1067
    
  10

Krishnareddy Yeruva wrote:Hi Ulf,

As Ulf mentioned:
Do these characters belong in the files? If so, you'll need to handle them properly. If not, the easiest may be to not put them there in the first place.

These characters doesn't belong to these files. But by mistake these chars are getting placed inside these files. So if we have any unix script to remove these special chars, so that we can validate and remove such chars from that file before the job is going to execute.


The only thing that makes these characters 'special' is that you don't want them in your files. This is where your problem starts. You either have to define the set of the characters you want to remove or you need to define the set that you can accept but either way you to know the character encoding of the file so you can know how to convert the bytes of the files into characters before the filtering. Once you know the character encoding it is a fairly straight forwards to write a small program in almost any language you will find on the Linux box to filter the content of the file. But I stress - you need to know the character encoding to make this safe.

As Ulf says, the best approach is not to put the 'specials' there in the first place and in your position this would be my first point of attack. I would go back to the people who provided the files and ask them to provide clean files with a known character encoding.



Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 16145
    
  21

Also, assuming that these really are "special characters" and not simply some sort of binary artefacts, the "tr" Unix/Linux utility can be used to translate them into something more meaningful.


Customer surveys are for companies who didn't pay proper attention to begin with.
 
 
subject: Remove special characters of any type from the file